One skill a day: You can make a mistake in splicing a URL, and write a crawler

One skill a day: You can make a mistake in splicing a URL, and write a crawler

When writing crawlers, we often need to parse the list pages of a website. For example, the following example:

  1. <html>
  2. <head>
  3. <meta charset= "utf-8" >
  4. <title>Test relative path</title>
  5. </head>
  6. <body>
  7. <div>
  8. <h1>Book List</h1>
  9. <ul>
  10. <li> <a href="http://127.0.0.1:8000/book/1.html" > First Book</a></li>
  11. <li> <a href="http://127.0.0.1:8000/book/2.html" > The Second Book</a></li>
  12. <li> <a href="http://127.0.0.1:8000/book/3.html" > The third book</a></li>
  13. <li> <a href="http://127.0.0.1:8000/book/4.html" > The fourth book</a></li>
  14. <li> <a href="http://127.0.0.1:8000/book/5.html" > Fifth Book</a></li>
  15. </ul>
  16. </div>
  17. </body>
  18. </html>

The running effect is shown in the figure below:

In this case, I think it is very simple to get the URL of each item. Just write an XPath, as shown below:

If you look closely, you will find that each URL starts with http://127.0.0.1:8000. The address of the current list page is also http://127.0.0.1:8000. So for simplicity, you can use relative paths in the tags:

  1. <html>
  2. <head>
  3. <meta charset= "utf-8" >
  4. <title>Test relative path</title>
  5. </head>
  6. <body>
  7. <div>
  8. <h1>Book List</h1>
  9. <ul>
  10. <li> <a href="/book/1.html" > First Book</a></li>
  11. <li> <a href="/book/2.html" > The Second Book</a></li>
  12. <li> <a href="/book/3.html" > The Third Book</a></li>
  13. <li> <a href="/book/4.html" > Book 4</a></li>
  14. <li> <a href="/book/5.html" > Fifth Book</a></li>
  15. </ul>
  16. </div>
  17. </body>
  18. </html>

The running effect is shown in the figure below. Only half of the URL can be extracted using XPath:

But the browser can correctly recognize such relative addresses, and when you click, it can automatically jump to the correct address:

If the relative path starts with /, the main domain name of the website will be added to the front of the relative path.

But what if the address of the current list page partially overlaps with the relative path of the link? As shown in the following figure:

The address of the current page is http://127.0.0.1:8000/book. The relative address is /book/1.html. In this case, you can simplify it further by not adding a slash in front of the relative path and changing the HTML to:

  1. <html>
  2. <head>
  3. <meta charset= "utf-8" >
  4. <title>Test relative path</title>
  5. </head>
  6. <body>
  7. <div>
  8. <h1>Book List</h1>
  9. <ul>
  10. <li> <a href="1.html" > First Book</a></li>
  11. <li> <a href="2.html" > The Second Book</a></li>
  12. <li> <a href="3.html" > The third book</a></li>
  13. <li> <a href="4.html" > The fourth book</a></li>
  14. <li> <a href="5.html" > Fifth Book</a></li>
  15. </ul>
  16. </div>
  17. </body>
  18. </html>

The running effect is shown in the figure below:

In this case, the browser can still correctly identify it, as shown in the following figure:

The browser knows that if the relative path does not start with /, it will concatenate the URL of the current page with the relative path. But it should be noted that when concatenating, the part to the left of the rightmost slash will be taken. The part to the right will be discarded. It is equivalent to concatenating the file address with the folder where the file is located. As shown in the figure below:

If you can't remember how to distinguish them, you can use Python's own urllib.parse.urljoin to connect, as shown below:

Seeing this, you may think that I have written another article today. Is such a simple thing worth writing an article?

So let's look at the following example:

The domain name is http://127.0.0.1:8000/book/index.html, and the relative domain name is 1.html, but why is the URL automatically recognized by the browser www.kingname.info/1.html?

The key to this problem lies in the tags in the source code:

  1. <html>
  2. <head>
  3. <meta charset= "utf-8" >
  4. <title>Test relative path</title>
  5. <base href= "http://www.kingname.info" >
  6. </head>
  7. <body>
  8. <div>
  9. <h1>Book List</h1>
  10. <ul>
  11. <li> <a href="1.html" > First Book</a></li>
  12. <li> <a href="2.html" > The Second Book</a></li>
  13. <li> <a href="3.html" > The third book</a></li>
  14. <li> <a href="4.html" > The fourth book</a></li>
  15. <li> <a href="5.html" > Fifth Book</a></li>
  16. </ul>
  17. </div>
  18. </body>
  19. </html>

If there is a tag at the head of the HTML code, the value of its href attribute will be used to concatenate an absolute path with the relative path, instead of using the URL of the current page.

If you don't know this, your crawler may have problems when splicing sub-page URLs. The website can also use this mechanism to construct a honeypot. The URL spliced ​​according to the tag is the real sub-page address, and the URL spliced ​​with the current page URL is the honeypot address. When the crawler accesses it, it will capture false data or be blocked immediately.

For a detailed description of the tag, please read: The Document Base URL element[1].

References

[1] The Document Base URL element: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base

<<:  How powerful is WiFi7? Three times faster than WiFi6, as fast as lightning

>>:  Xi'an Yimatong previously reported: It took two days and two nights to optimize a 1M image to 100kb

Blog    

Recommend

A thread pool that novices can understand at a glance

I believe everyone can feel that using multithrea...

Europe focuses on 6GHz rules, Wi-Fi -7 is still a long way off

European regulators have been facing increasing p...

Seamless mobile connectivity is key to digitalization in healthcare

[[373455]] The widespread problem of unreliable c...

Wireless sensor network standardization progress and protocol analysis

[[188829]] As an application-oriented research fi...

There is a 1024-bit bug. The TCP data packets are so annoying!

1. Background Recently, I encountered a particula...

Let's talk about HTTP/3, QUIC, how do they work?

Why do we need HTTP/3? One important reason is to...

The core network and its vital role in cellular connectivity

The emergence of the Internet of Things (IoT) and...

Automation in SD-WAN and why you need WAN acceleration

Robert Sturt, general manager of streaming servic...

Bryan to launch fiber optic internet service

The city of Bryan, Texas, recently announced that...