When writing crawlers, we often need to parse the list pages of a website. For example, the following example:
The running effect is shown in the figure below: In this case, I think it is very simple to get the URL of each item. Just write an XPath, as shown below: If you look closely, you will find that each URL starts with http://127.0.0.1:8000. The address of the current list page is also http://127.0.0.1:8000. So for simplicity, you can use relative paths in the tags:
The running effect is shown in the figure below. Only half of the URL can be extracted using XPath: But the browser can correctly recognize such relative addresses, and when you click, it can automatically jump to the correct address: If the relative path starts with /, the main domain name of the website will be added to the front of the relative path. But what if the address of the current list page partially overlaps with the relative path of the link? As shown in the following figure: The address of the current page is http://127.0.0.1:8000/book. The relative address is /book/1.html. In this case, you can simplify it further by not adding a slash in front of the relative path and changing the HTML to:
The running effect is shown in the figure below: In this case, the browser can still correctly identify it, as shown in the following figure: The browser knows that if the relative path does not start with /, it will concatenate the URL of the current page with the relative path. But it should be noted that when concatenating, the part to the left of the rightmost slash will be taken. The part to the right will be discarded. It is equivalent to concatenating the file address with the folder where the file is located. As shown in the figure below: If you can't remember how to distinguish them, you can use Python's own urllib.parse.urljoin to connect, as shown below: Seeing this, you may think that I have written another article today. Is such a simple thing worth writing an article? So let's look at the following example: The domain name is http://127.0.0.1:8000/book/index.html, and the relative domain name is 1.html, but why is the URL automatically recognized by the browser www.kingname.info/1.html? The key to this problem lies in the tags in the source code:
If there is a tag at the head of the HTML code, the value of its href attribute will be used to concatenate an absolute path with the relative path, instead of using the URL of the current page. If you don't know this, your crawler may have problems when splicing sub-page URLs. The website can also use this mechanism to construct a honeypot. The URL spliced according to the tag is the real sub-page address, and the URL spliced with the current page URL is the honeypot address. When the crawler accesses it, it will capture false data or be blocked immediately. For a detailed description of the tag, please read: The Document Base URL element[1]. References [1] The Document Base URL element: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base |
<<: How powerful is WiFi7? Three times faster than WiFi6, as fast as lightning
In recent years, as the Internet has gradually pe...
80VPS is a long-established Chinese hosting compa...
The three major operators released data for Decem...
High-quality content and customized services enha...
[[426343]] 1. Router startup initial sequence Whe...
[[394613]] On April 20, China Mobile announced it...
During the Dragon Boat Festival holiday, there ar...
We are entering a new normal in the way we work. ...
On September 26, the "2022 China Cloud Netwo...
Among the many business opportunities created by ...
Tudcloud offers a big discount on annual payment ...
Private 5G networks are attractive to the largest...
As world powers, China and the United States comp...
China Mobile released its unaudited financial dat...
The 5G era is approaching. While people are full ...