Universal crawler techniques: How to properly remove invalid parameters from URLs

Universal crawler techniques: How to properly remove invalid parameters from URLs

We know that a URL consists of the following parts:

The query part is called query parameter in Chinese. It is a key-value pair connected by equal signs in the URL. Some of these key-value pairs are valid, for example:

  1. https:// open .163.com/newview/movie/courseintro?newurl=MDAPTVFE8

The newurl=MDAPTVFE8 in this URL cannot be modified. Once you change it, it will no longer be this page.

But there are some URLs whose query parameters have no effect on the display of the web page, such as the following two URLs:

  1. https://www.163.com/dy/article/G7NINAJS0514HDK6.html? from =nav
  2. https://www.163.com/dy/article/G7NINAJS0514HDK6.html

When you visit these two URLs, you will find that they open the same page. This is because these parameters are used by the website. The website uses these parameters to count which page the user jumped from to this page.

When we develop a general news crawler, this optional query parameter will cause serious interference to URL-based deduplication. The same news may have different query parameters because it jumps from different pages, so it may be treated as multiple different news articles.

When we deduplicate news, we generally have a three-level deduplication logic: deduplication based on URL, deduplication based on news body text, and deduplication based on text semantics. Their resource consumption is gradually increasing. Therefore, if the news can be confirmed to be duplicated through the URL, there is no need to deduplicate the text; if the news can be confirmed to be duplicated through the text, there is no need to use semantic deduplication. This invalid parameter will increase the number of news entering the second level, thereby consuming more server resources.

In order to prevent such invalid parameters from interfering with the URL-based deduplication logic, we need to remove invalid URL parameters in advance.

Suppose there is a URL: https://www.kingname.info/article?docid=123&from=nav&output=json&ts=1849304323. Through manual annotation, we already know that for the website https://www.kingname.info, docid and output parameters are valid parameters and must be retained; from and ts parameters are invalid parameters and can be removed. So, how do we correctly remove these unnecessary parameter fields?

Some students will definitely say to use regular expressions to remove them. Then you can try to write regular expressions. As a reminder, some parameter values ​​may also contain = symbols, and some required field values ​​may contain invalid field names.

Today, instead of using regular expressions, we will use several functions in Python's built-in urllib module to implement a safe and perfect method to remove invalid fields.

This method requires the use of urlparse parse_qs urlencode and urlunparse. Let's look at a piece of code:

  1. from urllib.parse import urlparse, parse_qs, urlencode, urlunparse
  2.  
  3. url = 'https://www.kingname.info/article?docid=123&from=nav&output=json&ts=1849304323'  
  4. useless_field = [ 'from' , 'ts' ]
  5. parser = urlparse(url)
  6. query = parser.query
  7. query_dict = parse_qs(query)
  8. for field in useless_field:
  9. if field in query_dict:
  10. query_dict.pop(field)
  11.  
  12. new_query = urlencode(query_dict, doseq= True )
  13. new_parser = parser._replace(query=new_query)
  14. new_url = urlunparse(new_parser)
  15. print(new_url)

The running effect is shown in the figure below:

Among them, urlparse and urlunparse are a pair of opposite functions, the former converts the URL into a ParseResult object, and the latter converts the ParseResult object back to a URL string.

The .query property of the ParseResult object is a string with the following format:

parse_qs and urlencode are also a pair of opposite methods. The former converts the string output by .query into a dictionary, while the latter converts the field into a string in the form of .query:

After we use parse_qs to convert the query into a dictionary, we can use the dictionary's .pop method to remove invalid fields and then regenerate a new .query string.

Since the .query property of the ParseResult object is a read-only property and cannot be overwritten, we need to call an internal method parser._replace to replace the new .query field and generate a new ParseResult object. Finally, convert it back to a URL.

Using this method, we can safely remove invalid fields from URLs without having to write complex regular expressions.

This article is reprinted from the WeChat public account "Weiwen Code", which can be followed through the following QR code. To reprint this article, please contact the Weiwen Code public account.

<<:  How to prevent 5G from creating a new digital divide

>>:  "Building Intelligent Computing Power to Empower the Digital Economy"——2021 Ascend AI Server Product Launch Conference Successfully Held in Beijing

Recommend

Shanxi's first "government cloud" platform launched in Xinzhou

Recently, with the "Xinzhou Public Trading P...

Explore VLAN aggregation: How to optimize your network performance

VLAN technology is widely used in campus networks...

Four open source management tools to improve network usability and performance

[51CTO.com Quick Translation] In the past, networ...

What is 5G Dual Connectivity?

This is a 4G base station, simple and clean. Howe...

[11.11] RackNerd: $11.11/year - 1.11GB/11GB/3TB/San Jose and other data centers

RackNerd has also released several Double 11 prom...

Will an Ethernet splitter reduce my internet speed?

This article provides a detailed summary of Ether...

Education takes off with 5G smart technology

Education is the foundation of a country, and 5G ...