One skill a day: You can make a mistake in splicing a URL, and write a crawler

When writing crawlers, we often need to parse the list pages of a website. For example, the following example:

 <html>
 <head>
        <meta charset= "utf-8" >
 <title>Test relative path</title>
 </head>
 <body>
 <div>
 <h1>Book List</h1>
 <ul>
 <li> <a href="http://127.0.0.1:8000/book/1.html" > First Book</a></li>
 <li> <a href="http://127.0.0.1:8000/book/2.html" > The Second Book</a></li>
 <li> <a href="http://127.0.0.1:8000/book/3.html" > The third book</a></li>
 <li> <a href="http://127.0.0.1:8000/book/4.html" > The fourth book</a></li>
 <li> <a href="http://127.0.0.1:8000/book/5.html" > Fifth Book</a></li>
 </ul>
 </div>
 </body>
 </html>

The running effect is shown in the figure below:

In this case, I think it is very simple to get the URL of each item. Just write an XPath, as shown below:

If you look closely, you will find that each URL starts with http://127.0.0.1:8000. The address of the current list page is also http://127.0.0.1:8000. So for simplicity, you can use relative paths in the tags:

 <html>
 <head>
        <meta charset= "utf-8" >
 <title>Test relative path</title>
 </head>
 <body>
 <div>
 <h1>Book List</h1>
 <ul>
 <li> <a href="/book/1.html" > First Book</a></li>
 <li> <a href="/book/2.html" > The Second Book</a></li>
 <li> <a href="/book/3.html" > The Third Book</a></li>
 <li> <a href="/book/4.html" > Book 4</a></li>
 <li> <a href="/book/5.html" > Fifth Book</a></li>
 </ul>
 </div>
 </body>
 </html>

The running effect is shown in the figure below. Only half of the URL can be extracted using XPath:

But the browser can correctly recognize such relative addresses, and when you click, it can automatically jump to the correct address:

If the relative path starts with /, the main domain name of the website will be added to the front of the relative path.

But what if the address of the current list page partially overlaps with the relative path of the link? As shown in the following figure:

The address of the current page is http://127.0.0.1:8000/book. The relative address is /book/1.html. In this case, you can simplify it further by not adding a slash in front of the relative path and changing the HTML to:

 <html>
 <head>
        <meta charset= "utf-8" >
 <title>Test relative path</title>
 </head>
 <body>
 <div>
 <h1>Book List</h1>
 <ul>
 <li> <a href="1.html" > First Book</a></li>
 <li> <a href="2.html" > The Second Book</a></li>
 <li> <a href="3.html" > The third book</a></li>
 <li> <a href="4.html" > The fourth book</a></li>
 <li> <a href="5.html" > Fifth Book</a></li>
 </ul>
 </div>
 </body>
 </html>

The running effect is shown in the figure below:

In this case, the browser can still correctly identify it, as shown in the following figure:

The browser knows that if the relative path does not start with /, it will concatenate the URL of the current page with the relative path. But it should be noted that when concatenating, the part to the left of the rightmost slash will be taken. The part to the right will be discarded. It is equivalent to concatenating the file address with the folder where the file is located. As shown in the figure below:

If you can't remember how to distinguish them, you can use Python's own urllib.parse.urljoin to connect, as shown below:

Seeing this, you may think that I have written another article today. Is such a simple thing worth writing an article?

So let's look at the following example:

The domain name is http://127.0.0.1:8000/book/index.html, and the relative domain name is 1.html, but why is the URL automatically recognized by the browser www.kingname.info/1.html?

The key to this problem lies in the tags in the source code:

 <html>
 <head>
        <meta charset= "utf-8" >
 <title>Test relative path</title>
        <base href= "http://www.kingname.info" >
 </head>
 <body>
 <div>
 <h1>Book List</h1>
 <ul>
 <li> <a href="1.html" > First Book</a></li>
 <li> <a href="2.html" > The Second Book</a></li>
 <li> <a href="3.html" > The third book</a></li>
 <li> <a href="4.html" > The fourth book</a></li>
 <li> <a href="5.html" > Fifth Book</a></li>
 </ul>
 </div>
 </body>
 </html>

If there is a tag at the head of the HTML code, the value of its href attribute will be used to concatenate an absolute path with the relative path, instead of using the URL of the current page.

If you don't know this, your crawler may have problems when splicing sub-page URLs. The website can also use this mechanism to construct a honeypot. The URL spliced according to the tag is the real sub-page address, and the URL spliced with the current page URL is the honeypot address. When the crawler accesses it, it will capture false data or be blocked immediately.

For a detailed description of the tag, please read: The Document Base URL element[1].

References

[1] The Document Base URL element: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/base

<<: How powerful is WiFi7? Three times faster than WiFi6, as fast as lightning

>>: Xi'an Yimatong previously reported: It took two days and two nights to optimize a 1M image to 100kb

Three-minute review! A quick overview of 5G industry development trends in September

5G construction has not yet been completed, and countries have invested heavily in 6G research. Will my country still be ahead in 6G?

Blog

SDN and NFV: Technology implementation and commercial deployment in full swing

Blog

Digital-vm New Year 50% off, KVM VPS monthly payment starts from 2 US dollars, 8 computer rooms in the United States/Japan/Singapore

Blog

How fast is Starlink's internet speed? Foreign netizens tested it

Blog

One skill a day: You can make a mistake in splicing a URL, and write a crawler

Three-minute review! A quick overview of 5G industry development trends in September

6G research and development has entered a critical window period. How far is it from commercialization?

5G speed may be slower than 4G?

The operator left these few words, and careful users finally realized that unlimited data is expensive

5G mobile phones are coming! Who will be the next Nokia?

5G construction has not yet been completed, and countries have invested heavily in 6G research. Will my country still be ahead in 6G?

SDN and NFV: Technology implementation and commercial deployment in full swing

Digital-vm New Year 50% off, KVM VPS monthly payment starts from 2 US dollars, 8 computer rooms in the United States/Japan/Singapore

How fast is Starlink's internet speed? Foreign netizens tested it

Recommend

A thread pool that novices can understand at a glance

Europe focuses on 6GHz rules, Wi-Fi -7 is still a long way off

ZJI newly launched Taiwan CN2 server, Hong Kong high frequency server/Taiwan CN2 server 30% off

DAMO's full-stack data solution is grandly launched, opening a new journey for domestic databases

Huawei Global Analyst Conference 2017: Using cloud as the engine to promote global digital transformation

Seamless mobile connectivity is key to digitalization in healthcare

Wireless sensor network standardization progress and protocol analysis

There is a 1024-bit bug. The TCP data packets are so annoying!

[6.18] Eurasia Cloud: Los Angeles CN2 GIA line VPS quarterly payment starts from 56.8 yuan

Let's talk about HTTP/3, QUIC, how do they work?

HostKvm is now available in Hong Kong International Zone C, 1Gbps bandwidth Hong Kong VPS 20% off starting at $6.8/month

The core network and its vital role in cellular connectivity

Automation in SD-WAN and why you need WAN acceleration

The soul-searching questions of TCP protocol: strengthening your network foundation

Bryan to launch fiber optic internet service