6 ways to remove duplicate URLs! (with detailed code)

6 ways to remove duplicate URLs! (with detailed code)

[[341325]]

This article is reprinted from the WeChat public account "Java Chinese Community", written by Lei Ge. Please contact the Java Chinese Community public account to reprint this article.

URL deduplication is very common in our daily work and interviews, such as these:

It can be seen that similar interview questions have appeared in many well-known Internet companies including Alibaba, NetEase Cloud, Youku, and Zuoyebang. Moreover, problems similar to URL deduplication, such as IP blacklist/whitelist judgment, also often appear in our work, so in this article we will discuss the issue of URL deduplication.

URL deduplication ideas

Without considering the business scenario and data volume, we can use the following solution to implement URL duplication judgment:

  1. Use Java's Set collection to determine whether the URL is repeated based on the result of adding (successful addition means the URL is not repeated);
  2. Use the Set collection in Redis to determine whether the URL is repeated based on the result of adding;
  3. Store all URLs in the database, and then use SQL statements to determine whether there are duplicate URLs;
  4. Set the URL column in the database as a unique index, and determine whether the URL is duplicated based on the result when adding;
  5. Use Guava's Bloom filter to implement URL duplication detection;
  6. Use Redis's Bloom filter to implement URL duplication detection.

The specific implementation of the above solution is as follows.

URL deduplication implementation solution

1. Use Java's Set collection to detect duplicates

The Set collection is inherently non-repeatable. It can only store elements with different values. If the values ​​are the same, adding will fail. Therefore, we can determine whether the URL is repeated by the result of adding the Set collection. The implementation code is as follows:

  1. public class URLRepeat {
  2. // URL to be deduplicated
  3. public   static final String[] URLS = {
  4. "www.apigo.cn" ,
  5. "www.baidu.com" ,
  6. "www.apigo.cn"  
  7. };
  8. public   static void main(String[] args) {
  9. Set <String> set = new HashSet();
  10. for ( int i = 0; i < URLS.length; i++) {
  11. String url = URLS[i];
  12. boolean result = set . add (url);
  13. if (!result) {
  14. // Duplicate URL
  15. System. out .println( "URL already exists: " + url);
  16. }
  17. }
  18. }
  19. }

The execution result of the program is:

URL already exists: www.apigo.cn

From the above results, we can see that the URL duplicate detection function can be realized by using the Set collection.

2. Deduplication of Redis Set

The idea of ​​using Redis's Set collection is consistent with the idea of ​​Java's Set collection, both of which are implemented by using the non-repeatability of Set. Let's first use Redis's client redis-cli to implement an example of URL duplication detection:

From the above results, we can see that when the addition is successful, it means that the URL is not repeated, but when the addition fails (the result is 0), it means that the URL already exists.

Let's use code to implement Redis Set deduplication. The implementation code is as follows:

  1. // URL to be deduplicated
  2. public   static final String[] URLS = {
  3. "www.apigo.cn" ,
  4. "www.baidu.com" ,
  5. "www.apigo.cn"  
  6. };
  7.  
  8. @Autowired
  9. RedisTemplate redisTemplate;
  10.  
  11. @RequestMapping( "/url" )
  12. public void urlRepeat() {
  13. for ( int i = 0; i < URLS.length; i++) {
  14. String url = URLS[i];
  15. Long result = redisTemplate.opsForSet(). add ( "urlrepeat" , url);
  16. if (result == 0) {
  17. // Duplicate URL
  18. System. out .println( "URL already exists: " + url);
  19. }
  20. }
  21. }

The execution result of the above program is:

URL already exists: www.apigo.cn

In the above code, we use the RedisTemplate in Spring Data to implement it. To use the RedisTemplate object in the Spring Boot project, we need to first introduce the spring-boot-starter-data-redis framework. The configuration information is as follows:

  1. <! -- Add operation RedisTemplate reference -->  
  2. <dependency>
  3. <groupId>org.springframework.boot</groupId>
  4. <artifactId>spring-boot-starter-data-redis</artifactId>
  5. </dependency>

Then you need to configure the Redis connection information in the project and configure the following in application.properties:

  1. spring.redis.host=127.0.0.1
  2. spring.redis.port=6379
  3. #spring.redis.password =123456 # Redis server password . If you have a password, you need to configure this item

After the above two steps, we can use the RedisTemplate object normally to operate Redis in the Spring Boot project.

3. Database deduplication

We can also use the database to implement URL duplication judgment. First, let's design a URL storage table, as shown in the following figure:

The SQL corresponding to this table is as follows:

  1. /*================================================================*/
  2. /* Table : urlinfo */
  3. /*================================================================*/
  4. create   table urlinfo
  5. (
  6. id int   not   null auto_increment,
  7. url varchar (1000),
  8. ctime date ,
  9. del boolean,
  10. primary   key (id)
  11. );
  12.  
  13. /*================================================================*/
  14. /* Index : Index_url */
  15. /*================================================================*/
  16. create   index Index_url on urlinfo
  17. (
  18. url
  19. );

The id is the auto-increment primary key, and the url field is set as an index. Setting an index can speed up the query.

Let's first add two test data to the database, as shown in the following figure:

We use SQL statements to query, as shown below:

If the result is greater than 0, it means there are duplicate URLs, otherwise it means there are no duplicate URLs.

4. Unique index deduplication

We can also use the unique index of the database to prevent URL duplication. Its implementation idea is very similar to the idea of ​​the previous Set collection.

First, we set a unique index for the URL field, and then add the URL data. If the addition is successful, it means the URL is not repeated, otherwise it is repeated.

The SQL implementation for creating a unique index is as follows:

  1. create   unique   index Index_url on urlinfo
  2. (
  3. url
  4. );

5. Guava Bloom filter deduplication

Bloom Filter was proposed by Bloom in 1970. It is actually a very long binary vector and a series of random mapping functions. Bloom filter can be used to retrieve whether an element is in a set. Its advantages are that the space efficiency and query time are far superior to general algorithms, and its disadvantages are that there is a certain misrecognition rate and deletion difficulty.

The core implementation of the Bloom filter is a very large bit array and several hash functions. Assume that the length of the bit array is m and the number of hash functions is k.

Take the above figure as an example, the specific operation process: Assume that there are 3 elements {x, y, z} in the set, and the number of hash functions is 3. First, initialize the bit array and set each bit in it to 0. For each element in the set, map the element through 3 hash functions in turn. Each mapping will generate a hash value, which corresponds to a point on the bit array. Then mark the corresponding position of the bit array as 1. When querying whether the W element is in the set, the same method is used to map W to the 3 points on the bit array through hashing. If one of the 3 points is not 1, it can be judged that the element must not exist in the set. On the contrary, if all 3 points are 1, the element may exist in the set. Note: It cannot be judged here whether the element must exist in the set, and there may be a certain misjudgment rate. It can be seen from the figure: Assume that an element corresponds to the 3 points with subscripts 4, 5, and 6 through mapping. Although these 3 points are all 1, it is obvious that these 3 points are the positions obtained by hashing different elements. Therefore, this situation shows that although the element is not in the set, it may correspond to 1, which is the reason for the misjudgment rate.

We can use the Guava framework provided by Google to operate the Bloom filter. To achieve this, we first add a reference to Guava in pom.xml and configure it as follows:

  1. <! -- Add the Guava framework -->  
  2. <! -- https://mvnrepository.com/artifact/com.google.guava/guava -->  
  3. <dependency>
  4. <groupId>com.google.guava</groupId>
  5. <artifactId>guava</artifactId>
  6. <version>28.2-jre</version>
  7. </dependency>

Implementation code for URL duplication detection:

  1. public class URLRepeat {
  2. // URL to be deduplicated
  3. public   static final String[] URLS = {
  4. "www.apigo.cn" ,
  5. "www.baidu.com" ,
  6. "www.apigo.cn"  
  7. };
  8.  
  9. public   static void main(String[] args) {
  10. // Create a Bloom filter
  11. BloomFilter<String> filter = BloomFilter. create (
  12. Funnels.stringFunnel(Charset.defaultCharset()),
  13. 10, // The number of elements expected to be processed
  14. 0.01); // Expected false positive probability
  15. for ( int i = 0; i < URLS.length; i++) {
  16. String url = URLS[i];
  17. if (filter.mightContain(url)) {
  18. // Use duplicate URLs
  19. System. out .println( "URL already exists: " + url);
  20. } else {
  21. // Store the URL in a bloom filter
  22. filter.put(url);
  23. }
  24. }
  25. }
  26. }

The execution result of the above program is:

URL already exists: www.apigo.cn

6. Redis Bloom filter deduplication

In addition to Guava's Bloom filter, we can also use Redis's Bloom filter to implement URL duplication detection. Before using it, we must first ensure that the Redis server version is greater than 4.0 (this version and above only supports Bloom filters), and enable the Redis Bloom filter function to work properly.

Taking Docker as an example, let's demonstrate the installation and activation of Redis Bloom filter. First, download the Redis Bloom filter, and then enable the Bloom filter when restarting the Redis service, as shown in the following figure:

After the Bloom filter is enabled, we can use the Redis client redis-cli to implement the Bloom filter URL duplication detection. The command is as follows:

In Redis, there are not many Bloom filter operation commands, mainly including the following:

  • bf.add adds an element;
  • bf.exists determines whether an element exists;
  • bf.madd adds multiple elements;
  • bf.mexists determines whether multiple elements exist;
  • bf.reserve sets the accuracy of the Bloom filter.

Next, we use code to demonstrate the use of Redis Bloom filter:

  1. import redis.clients.jedis.Jedis;
  2. import utils.JedisUtils;
  3.  
  4. import java.util.Arrays;
  5.  
  6. public class BloomExample {
  7. // Bloom filter key  
  8. private static final String _KEY = "URLREPEAT_KEY" ;
  9.      
  10. // URL to be deduplicated
  11. public   static final String[] URLS = {
  12. "www.apigo.cn" ,
  13. "www.baidu.com" ,
  14. "www.apigo.cn"  
  15. };
  16.  
  17. public   static void main(String[] args) {
  18. Jedis jedis = JedisUtils.getJedis();
  19. for ( int i = 0; i < URLS.length; i++) {
  20. String url = URLS[i];
  21. boolean exists = bfExists(jedis, _KEY, url);
  22. if (exists) {
  23. // Duplicate URL
  24. System. out .println( "URL already exists: " + url);
  25. } else {
  26. bfAdd(jedis, _KEY, url);
  27. }
  28. }
  29. }
  30.  
  31. /**
  32. * Add elements
  33. * @param jedis Redis client
  34. * @param key     key  
  35. * @param value value
  36. * @return boolean
  37. */
  38. public   static boolean bfAdd(Jedis jedis, String key , String value) {
  39. String luaStr = "return redis.call('bf.add', KEYS[1], KEYS[2])" ;
  40. Object result = jedis.eval(luaStr, Arrays.asList( key , value),
  41. Arrays.asList());
  42. if (result.equals(1L)) {
  43. return   true ;
  44. }
  45. return   false ;
  46. }
  47.  
  48. /**
  49. * Check if the element exists
  50. * @param jedis Redis client
  51. * @param key     key  
  52. * @param value value
  53. * @return boolean
  54. */
  55. public   static boolean bfExists(Jedis jedis, String key , String value) {
  56. String luaStr = "return redis.call('bf.exists', KEYS[1], KEYS[2])" ;
  57. Object result = jedis.eval(luaStr, Arrays.asList( key , value),
  58. Arrays.asList());
  59. if (result.equals(1L)) {
  60. return   true ;
  61. }
  62. return   false ;
  63. }
  64. }

The execution result of the above program is:

URL already exists: www.apigo.cn

Summarize

This article introduces 6 solutions for URL deduplication, among which Redis Set, Redis Bloom filter, database and unique index are suitable for distributed systems. If it is a massive distributed system, it is recommended to use Redis Bloom filter to implement URL deduplication. If it is a single machine with massive data, it is recommended to use Guava's Bloom filter to implement URL deduplication.

<<:  5G core network revenue expected to reach $1 billion by 2020

>>:  In the 5G era, edge computing is used to accelerate the development of interconnected manufacturing

Recommend

Enabling sustainable development in smart cities through Wi-Fi

What is the definition of a smart city? The answe...

Karamay: Huawei's first cloud strategic cooperation city in the world

Karamay is a desert city that was born and prospe...

Distributed Fiber Optic Sensors Global Market Report 2023

The global distributed fiber optic sensor market ...

How is the ETag value in the HTTP response header generated?

The generation of etag needs to meet several cond...

GSMA Liu Hong: Who will build the 5G private network? Let the operators do it

[[383535]] 5G has the characteristics of high spe...

HTTPS learning summary

Preface I've been reading about HTTP recently...

IaC from the perspective of Docker network

The Docker network mechanism not only builds an e...

HostDare adds NVMe disk CN2 GIA line VPS, 15% discount starting from $30/year

After HostDare launched the Los Angeles NVMe SSD ...

my country has built the world's largest 4G network

[[181278]] On January 6, the Ministry of Science ...

Hostodo: $24.99/year-2GB/20G NVMe/5TB/Las Vegas, Spokane, and Miami data centers

Hostodo has launched a February Special Deal prom...