One skill a day: You can make a mistake in splicing a URL, and write a crawler

One skill a day: You can make a mistake in splicing a URL, and write a crawler

When writing crawlers, we often need to parse the list pages of a website. For example, the following example:

  1. <html>
  2. <head>
  3. <meta charset= "utf-8" >
  4. <title>Test relative path</title>
  5. </head>
  6. <body>
  7. <div>
  8. <h1>Book List</h1>
  9. <ul>
  10. <li> <a href="http://127.0.0.1:8000/book/1.html" > First Book</a></li>
  11. <li> <a href="http://127.0.0.1:8000/book/2.html" > The Second Book</a></li>
  12. <li> <a href="http://127.0.0.1:8000/book/3.html" > The third book</a></li>
  13. <li> <a href="http://127.0.0.1:8000/book/4.html" > The fourth book</a></li>
  14. <li> <a href="http://127.0.0.1:8000/book/5.html" > Fifth Book</a></li>
  15. </ul>
  16. </div>
  17. </body>
  18. </html>

The running effect is shown in the figure below:

In this case, I think it is very simple to get the URL of each item. Just write an XPath, as shown below:

If you look closely, you will find that each URL starts with http://127.0.0.1:8000. The address of the current list page is also http://127.0.0.1:8000. So for simplicity, you can use relative paths in the tags:

  1. <html>
  2. <head>
  3. <meta charset= "utf-8" >
  4. <title>Test relative path</title>
  5. </head>
  6. <body>
  7. <div>
  8. <h1>Book List</h1>
  9. <ul>
  10. <li> <a href="/book/1.html" > First Book</a></li>
  11. <li> <a href="/book/2.html" > The Second Book</a></li>
  12. <li> <a href="/book/3.html" > The Third Book</a></li>
  13. <li> <a href="/book/4.html" > Book 4</a></li>
  14. <li> <a href="/book/5.html" > Fifth Book</a></li>
  15. </ul>
  16. </div>
  17. </body>
  18. </html>

The running effect is shown in the figure below. Only half of the URL can be extracted using XPath:

But the browser can correctly recognize such relative addresses, and when you click, it can automatically jump to the correct address:

If the relative path starts with /, the main domain name of the website will be added to the front of the relative path.

But what if the address of the current list page partially overlaps with the relative path of the link? As shown in the following figure:

The address of the current page is http://127.0.0.1:8000/book. The relative address is /book/1.html. In this case, you can simplify it further by not adding a slash in front of the relative path and changing the HTML to:

  1. <html>
  2. <head>
  3. <meta charset= "utf-8" >
  4. <title>Test relative path</title>
  5. </head>
  6. <body>
  7. <div>
  8. <h1>Book List</h1>
  9. <ul>
  10. <li> <a href="1.html" > First Book</a></li>
  11. <li> <a href="2.html" > The Second Book</a></li>
  12. <li> <a href="3.html" > The third book</a></li>
  13. <li> <a href="4.html" > The fourth book</a></li>
  14. <li> <a href="5.html" > Fifth Book</a></li>
  15. </ul>
  16. </div>
  17. </body>
  18. </html>

The running effect is shown in the figure below:

In this case, the browser can still correctly identify it, as shown in the following figure:

The browser knows that if the relative path does not start with /, it will concatenate the URL of the current page with the relative path. But it should be noted that when concatenating, the part to the left of the rightmost slash will be taken. The part to the right will be discarded. It is equivalent to concatenating the file address with the folder where the file is located. As shown in the figure below:

If you can't remember how to distinguish them, you can use Python's own urllib.parse.urljoin to connect, as shown below:

Seeing this, you may think that I have written another article today. Is such a simple thing worth writing an article?

So let's look at the following example:

The domain name is http://127.0.0.1:8000/book/index.html, and the relative domain name is 1.html, but why is the URL automatically recognized by the browser www.kingname.info/1.html?

The key to this problem lies in the tags in the source code:

  1. <html>
  2. <head>
  3. <meta charset= "utf-8" >
  4. <title>Test relative path</title>
  5. <base href= "http://www.kingname.info" >
  6. </head>
  7. <body>
  8. <div>
  9. <h1>Book List</h1>
  10. <ul>
  11. <li> <a href="1.html" > First Book</a></li>
  12. <li> <a href="2.html" > The Second Book</a></li>
  13. <li> <a href="3.html" > The third book</a></li>
  14. <li> <a href="4.html" > The fourth book</a></li>
  15. <li> <a href="5.html" > Fifth Book</a></li>
  16. </ul>
  17. </div>
  18. </body>
  19. </html>

If there is a tag at the head of the HTML code, the value of its href attribute will be used to concatenate an absolute path with the relative path, instead of using the URL of the current page.

If you don't know this, your crawler may have problems when splicing sub-page URLs. The website can also use this mechanism to construct a honeypot. The URL spliced ​​according to the tag is the real sub-page address, and the URL spliced ​​with the current page URL is the honeypot address. When the crawler accesses it, it will capture false data or be blocked immediately.

For a detailed description of the tag, please read: The Document Base URL element[1].

References

[1] The Document Base URL element: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base

<<:  How powerful is WiFi7? Three times faster than WiFi6, as fast as lightning

>>:  Xi'an Yimatong previously reported: It took two days and two nights to optimize a 1M image to 100kb

Recommend

What kind of ERP system do we need in the post-epidemic era?

In recent years, as the Internet has gradually pe...

How to set up a new router?

[[426343]] 1. Router startup initial sequence Whe...

5G brings edge cloud services to the forefront

Among the many business opportunities created by ...

Industrial IoT and manufacturing will become one of the largest 5G markets

Private 5G networks are attractive to the largest...

China Mobile: 5G package customers reach 331 million

China Mobile released its unaudited financial dat...

How will the network be reconstructed in the 5G era?

The 5G era is approaching. While people are full ...