Design tiny urlFor example, Maimai will not allow you to send too long URLs and will convert them into short links. 1.ScenarioGenerate a short url based on a long url. Such as http://www.javaedge.com => http://bit.ly/1ULoQB6 Restore the long url according to the short url and jump: Questions to confirm with the interviewer: Do long urls and short urls have to correspond one to one? Short URL has not been used for a long time. Does it need to be released? 1.1 QPS Analysis
Assume that each user posts an average of 0.1 microblogs with URLs per day (10 microblogs, one of which contains a link). Average write QPS = 100M * 0.1 / 86400 = 100 Peak write qps = 100 * 2 = 200
Assume that each user clicks 1 tiny url on average Average write QPS = 100M * 1 / 86400 = 1k Peak read qps = 1k * 2 = 2k
100M * 0.1 = 10M records The average length of each URL is 100, totaling 1G 1T hard drive can be used for 3 years From the analysis of 2 and 3, we can see that distribution or sharding is not required. To support 2k QPS, one SSD MySQL is enough. 2. Service - logical block clustering and interface designThe system is actually very simple, it only needs one service: URL Service. Since tiny url has only one UrlService:
Method design: UrlService.encode(long_url): encoding method UrlService.decode(long_url): decoding method There are currently two common mainstream styles for access port design:
So, which service design does your company choose for its short-chain system? 3. Storage data access (best reflects practical experience)
3.1 SQL VS NoSQLDo you need transactions? No, nosql+1 Do you need rich SQL queries? No, nosql+1 Want to be lazy? Tiny URL requires simple code, nosql+1 Is the qps high? 2k, not high. sql+1 How high is the scalability requirement? Storage and QPS are not high, and a single machine can handle it. sql+1 - SQL requires you to write your own code to scale Do you need a sequential ID? It depends on your algorithm.
3.2 AlgorithmConvert long ur to a 6-digit short url. Given a long url, return a short url. Implement two methods:
standard:
Using two hash tables:
The short URL has a fixed format: "http://tiny.url/" + 6 characters, and the characters can be arbitrary. To avoid duplication, we can use them in lexicographical order, or use a set to record whether it has been used based on random generation. Use a hash function (not feasible) For example, take the last 6 digits of the MD5 of the long URL:
Randomly generate shortURL + DB deduplication Randomly pick a 6-digit short URL. If it has not been used before, bind it to the long URL. public String long2Short ( String url ) { Advantages: Simple implementation Disadvantage: The speed of generating short links becomes slower as the number of short links increases. Relational database table: only two columns, Short key and Long URL, are needed, and indexes are created for each column. You can also use nosql, but you need to create two tables:
Base32 conversion (microblogging implementation) Base62:
Different URLs that can be represented by 6 bits:
Advantages: High efficiency Disadvantages: Strong reliance on global auto-increment id public class TinyUrl { Because we need to use an auto-increment id, we can only use a relational DB table: id primary key, long url (index) 4. ScaleHow to improve the response speed and make it as efficient as directly opening the original link. To be clear, this is a business with more reads and less writes. 4.1 Cache AsideThe cache needs to store two types of data:
4.2 CDNUse geographic location information to speed up. Optimize server access speed:
Optimize data access speed
4.3 When do you need multiple DB servers?Insufficient cache resources or low hit rate Too many write operations More and more requests cannot be satisfied by cache What can multiple DB servers optimize? • Solve the problem of insufficient storage: storage • Solve the problem of being too busy: qps So what is the main problem with tiny url? Storage is no problem, the key is qps. So, how to shard? Vertical splitting: assign multiple tables to multiple machines. This is not applicable, there are only two columns and they cannot be split further. Horizontal split: If id and shortURL are used as shard keys: • When querying long2short, it can only be broadcast to N databases for querying • Why do we need to check long2short? To avoid duplicate creation • This is OK if you don't need to avoid duplicate creation Use long url as shard key: When performing short2long queries, the query can only be broadcast to N DBs. 4.3.1 Sharding Key Selection If a long can correspond to multiple short • Use cache to cache all long2short • When creating a short URL for a long URL, if the cache miss occurs, a new short URL is created. If a long can only correspond to a short • If using a random generation algorithm • Two tables, one for long2short and one for short2long • Each mapping relationship is stored twice, so long2short and short2long queries can be supported at the same time • If using base62 conversion • There is a serious problem: how to maintain a global auto-incrementing ID between multiple machines? • Generally, relational DB only supports global auto-increment ID on one machine. 4.4 Global auto-increment id4.4.1 Dedicate a DB for auto-increment service This DB does not store real data and is not responsible for other queries. To avoid single point of failure, multiple DBs may be required. 4.4.2 Using ZooKeeper But using a global auto-increment ID is not the best solution for tiny URLs. Generating a Distributed Sequence Number 4.5 Base62-based sharding strategyHash(long_url)%62 as shard key And put hash(long_url)%62 directly into short url If the original short key is AB1234, the current short key is
In this way, the shard key can be obtained through both short and long. Disadvantage: The number of DB machines cannot exceed 62. So, the final optimal architecture is: Can 4.6 be further optimized?Communication between web server and database. The communication between the centralized server cluster and cross-regional web servers is slow: for example, the server in China needs to access the DB in the United States. Why not allow Chinese servers to access Chinese DBs? If data is repeatedly written to the Chinese DB, how to solve the consistency problem? It is difficult to solve! Consider user habits:
So, the final structure is: You can also maintain a whitelist of domain names to access the DB in the corresponding region. 5. User-defined short linksImplement a customer short URL so that customers can create their own short URLs. That is, you need to implement createCustom based on the previous method. createCustom. Three methods need to be implemented:
Notice:
5.1 Based on Base62Is it possible to directly add a new column custom_url in URLTable to record the corresponding custom url? Not feasible! For most data, this column is actually empty, which wastes storage space. Add a new table to store custom URLs: CustomURLTable. Create a custom short link: query and insert in CustomURLTable Create a short link based on the long link:
As before, two hash tables are used to handle the mapping between long URLs and short URLs. An additional process is to return "error" when the URL set by the user conflicts with the existing one. Note: If the URL set by the user is exactly the same as the existing one, the short URL should also be returned. public class TinyUrl2 { 5.2 Based on random generation algorithmNo need to make any changes, just create the custom url as a short url! Reference: https://www.zhihu.com/question/29270034 |
<<: How 5G will change engineering design
>>: From 5G to 6G: The race between innovation and disruption
[[388060]] Quantum technology has become the comm...
[[380734]] 01 Introduction Network interrupt vect...
The Ministry of Industry and Information Technolo...
There has been much to prove about 5G’s theoretic...
Yes, you read that right. With the exposure of th...
[[348473]] This is definitely not good news for A...
Juniper Research predicts that by 2026, there wil...
On the afternoon of September 1st, at the "5...
Labs Guide 700M is called the "golden freque...
5G is an enabler that will deliver new capabiliti...
There is a place that gathers the most energetic ...
As more and more enterprises begin to realize the...
The future of industrial communications is on the...
Amazon, Microsoft and Google account for more tha...
[[344212]] This article is reprinted from the WeC...