We know that a URL consists of the following parts: The query part is called query parameter in Chinese. It is a key-value pair connected by equal signs in the URL. Some of these key-value pairs are valid, for example:
The newurl=MDAPTVFE8 in this URL cannot be modified. Once you change it, it will no longer be this page. But there are some URLs whose query parameters have no effect on the display of the web page, such as the following two URLs:
When you visit these two URLs, you will find that they open the same page. This is because these parameters are used by the website. The website uses these parameters to count which page the user jumped from to this page. When we develop a general news crawler, this optional query parameter will cause serious interference to URL-based deduplication. The same news may have different query parameters because it jumps from different pages, so it may be treated as multiple different news articles. When we deduplicate news, we generally have a three-level deduplication logic: deduplication based on URL, deduplication based on news body text, and deduplication based on text semantics. Their resource consumption is gradually increasing. Therefore, if the news can be confirmed to be duplicated through the URL, there is no need to deduplicate the text; if the news can be confirmed to be duplicated through the text, there is no need to use semantic deduplication. This invalid parameter will increase the number of news entering the second level, thereby consuming more server resources. In order to prevent such invalid parameters from interfering with the URL-based deduplication logic, we need to remove invalid URL parameters in advance. Suppose there is a URL: https://www.kingname.info/article?docid=123&from=nav&output=json&ts=1849304323. Through manual annotation, we already know that for the website https://www.kingname.info, docid and output parameters are valid parameters and must be retained; from and ts parameters are invalid parameters and can be removed. So, how do we correctly remove these unnecessary parameter fields? Some students will definitely say to use regular expressions to remove them. Then you can try to write regular expressions. As a reminder, some parameter values may also contain = symbols, and some required field values may contain invalid field names. Today, instead of using regular expressions, we will use several functions in Python's built-in urllib module to implement a safe and perfect method to remove invalid fields. This method requires the use of urlparse parse_qs urlencode and urlunparse. Let's look at a piece of code:
The running effect is shown in the figure below: Among them, urlparse and urlunparse are a pair of opposite functions, the former converts the URL into a ParseResult object, and the latter converts the ParseResult object back to a URL string. The .query property of the ParseResult object is a string with the following format: parse_qs and urlencode are also a pair of opposite methods. The former converts the string output by .query into a dictionary, while the latter converts the field into a string in the form of .query: After we use parse_qs to convert the query into a dictionary, we can use the dictionary's .pop method to remove invalid fields and then regenerate a new .query string. Since the .query property of the ParseResult object is a read-only property and cannot be overwritten, we need to call an internal method parser._replace to replace the new .query field and generate a new ParseResult object. Finally, convert it back to a URL. Using this method, we can safely remove invalid fields from URLs without having to write complex regular expressions. This article is reprinted from the WeChat public account "Weiwen Code", which can be followed through the following QR code. To reprint this article, please contact the Weiwen Code public account. |
<<: How to prevent 5G from creating a new digital divide
Recently, with the "Xinzhou Public Trading P...
Since 2019, countries have successively issued 5G...
[51CTO.com original article] For large government...
VLAN technology is widely used in campus networks...
In 2018, policies such as mixed-ownership reform ...
[51CTO.com Quick Translation] In the past, networ...
[51CTO.com original article] In 1992, Andrew Grov...
From the Qin Emperor and Han Emperor to Emperor W...
This is a 4G base station, simple and clean. Howe...
RackNerd has also released several Double 11 prom...
VMISS currently offers a 30% discount coupon for ...
This article provides a detailed summary of Ether...
Education is the foundation of a country, and 5G ...
[[183832]] In response to the explosive growth of...
ExtraVM's 2023 Black Friday event is mainly f...