This article is reprinted from the WeChat public account "Java Chinese Community", written by Lei Ge. Please contact the Java Chinese Community public account to reprint this article. URL deduplication is very common in our daily work and interviews, such as these: It can be seen that similar interview questions have appeared in many well-known Internet companies including Alibaba, NetEase Cloud, Youku, and Zuoyebang. Moreover, problems similar to URL deduplication, such as IP blacklist/whitelist judgment, also often appear in our work, so in this article we will discuss the issue of URL deduplication. URL deduplication ideas Without considering the business scenario and data volume, we can use the following solution to implement URL duplication judgment:
The specific implementation of the above solution is as follows. URL deduplication implementation solution 1. Use Java's Set collection to detect duplicates The Set collection is inherently non-repeatable. It can only store elements with different values. If the values are the same, adding will fail. Therefore, we can determine whether the URL is repeated by the result of adding the Set collection. The implementation code is as follows:
The execution result of the program is: URL already exists: www.apigo.cn From the above results, we can see that the URL duplicate detection function can be realized by using the Set collection. 2. Deduplication of Redis Set The idea of using Redis's Set collection is consistent with the idea of Java's Set collection, both of which are implemented by using the non-repeatability of Set. Let's first use Redis's client redis-cli to implement an example of URL duplication detection: From the above results, we can see that when the addition is successful, it means that the URL is not repeated, but when the addition fails (the result is 0), it means that the URL already exists. Let's use code to implement Redis Set deduplication. The implementation code is as follows:
The execution result of the above program is: URL already exists: www.apigo.cn In the above code, we use the RedisTemplate in Spring Data to implement it. To use the RedisTemplate object in the Spring Boot project, we need to first introduce the spring-boot-starter-data-redis framework. The configuration information is as follows:
Then you need to configure the Redis connection information in the project and configure the following in application.properties:
After the above two steps, we can use the RedisTemplate object normally to operate Redis in the Spring Boot project. 3. Database deduplication We can also use the database to implement URL duplication judgment. First, let's design a URL storage table, as shown in the following figure: The SQL corresponding to this table is as follows:
The id is the auto-increment primary key, and the url field is set as an index. Setting an index can speed up the query. Let's first add two test data to the database, as shown in the following figure: We use SQL statements to query, as shown below: If the result is greater than 0, it means there are duplicate URLs, otherwise it means there are no duplicate URLs. 4. Unique index deduplication We can also use the unique index of the database to prevent URL duplication. Its implementation idea is very similar to the idea of the previous Set collection. First, we set a unique index for the URL field, and then add the URL data. If the addition is successful, it means the URL is not repeated, otherwise it is repeated. The SQL implementation for creating a unique index is as follows:
5. Guava Bloom filter deduplication Bloom Filter was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. Bloom filter can be used to retrieve whether an element is in a set. Its advantages are that the space efficiency and query time are far superior to general algorithms, and its disadvantages are that there is a certain misrecognition rate and deletion difficulty. The core implementation of the Bloom filter is a very large bit array and several hash functions. Assume that the length of the bit array is m and the number of hash functions is k. Take the above figure as an example, the specific operation process: Assume that there are 3 elements {x, y, z} in the set, and the number of hash functions is 3. First, initialize the bit array and set each bit in it to 0. For each element in the set, map the element through 3 hash functions in turn. Each mapping will generate a hash value, which corresponds to a point on the bit array. Then mark the corresponding position of the bit array as 1. When querying whether the W element is in the set, the same method is used to map W to the 3 points on the bit array through hashing. If one of the 3 points is not 1, it can be judged that the element must not exist in the set. On the contrary, if all 3 points are 1, the element may exist in the set. Note: It cannot be judged here whether the element must exist in the set, and there may be a certain misjudgment rate. It can be seen from the figure: Assume that an element corresponds to the 3 points with subscripts 4, 5, and 6 through mapping. Although these 3 points are all 1, it is obvious that these 3 points are the positions obtained by hashing different elements. Therefore, this situation shows that although the element is not in the set, it may correspond to 1, which is the reason for the misjudgment rate. We can use the Guava framework provided by Google to operate the Bloom filter. To achieve this, we first add a reference to Guava in pom.xml and configure it as follows:
Implementation code for URL duplication detection:
The execution result of the above program is: URL already exists: www.apigo.cn 6. Redis Bloom filter deduplication In addition to Guava's Bloom filter, we can also use Redis's Bloom filter to implement URL duplication detection. Before using it, we must first ensure that the Redis server version is greater than 4.0 (this version and above only supports Bloom filters), and enable the Redis Bloom filter function to work properly. Taking Docker as an example, let's demonstrate the installation and activation of Redis Bloom filter. First, download the Redis Bloom filter, and then enable the Bloom filter when restarting the Redis service, as shown in the following figure: After the Bloom filter is enabled, we can use the Redis client redis-cli to implement the Bloom filter URL duplication detection. The command is as follows: In Redis, there are not many Bloom filter operation commands, mainly including the following:
Next, we use code to demonstrate the use of Redis Bloom filter:
The execution result of the above program is: URL already exists: www.apigo.cn Summarize This article introduces 6 solutions for URL deduplication, among which Redis Set, Redis Bloom filter, database and unique index are suitable for distributed systems. If it is a massive distributed system, it is recommended to use Redis Bloom filter to implement URL deduplication. If it is a single machine with massive data, it is recommended to use Guava's Bloom filter to implement URL deduplication. |
<<: 5G core network revenue expected to reach $1 billion by 2020
In recent times, 5G has become popular in the cir...
What is the definition of a smart city? The answe...
Karamay is a desert city that was born and prospe...
At present, the digital economy has become the to...
The global distributed fiber optic sensor market ...
CUBECLOUD (Magic Cube Cloud) has launched a promo...
The generation of etag needs to meet several cond...
[[383535]] 5G has the characteristics of high spe...
Preface I've been reading about HTTP recently...
SaltyfishTech, Saltyfish Cloud, is a Chinese host...
The Docker network mechanism not only builds an e...
After HostDare launched the Los Angeles NVMe SSD ...
[[181278]] On January 6, the Ministry of Science ...
Hostodo has launched a February Special Deal prom...
HostDare has launched a promotion for Premium Chi...