It is easy to write Python crawlers, but to write Python crawlers safely, you need to know more, not only technically, but also legally. The Robots protocol is one of them. If you don’t understand the Robots protocol and crawl something you shouldn’t crawl, you may face jail time! 1. Introduction to Robots Protocol The Robots Protocol is also called the Crawler Protocol or the Robot Protocol. Its full name is the Robots Exclusing Protocol, which is used to tell crawlers and search engines which pages can be crawled and which cannot be crawled. The content of this protocol is usually placed in a text file called robots.txt, which is usually located in the root directory of the website. Note that the content in the robots.txt file only tells the crawler what it should crawl and what it should not crawl, but it does not prevent the crawler from crawling those prohibited resources through technical means. It only notifies the crawler. Although you can write a crawler without following the description of the robots.txt file, as an ethical, educated and disciplined crawler, you should try your best to follow the rules described in the robots.txt file. Otherwise, it may cause legal disputes. When a crawler visits a website, it will first check whether there is a robots.txt file in the root directory of the URL. If so, the crawler will crawl web resources according to the crawling range defined in the file. If the file does not exist, the crawler will crawl all directly accessible pages of the website. Let's take a look at an example of a robots.txt file:
This crawling rule first tells the crawler that it is valid for all crawlers, and that any resources other than the test directory are not allowed to be crawled. If this robots.txt file is placed in the root directory of a website, the search engine crawler will only crawl the resources in the test directory, and we will find that the search engine can no longer find resources in other directories. The User-agent above describes the name of the crawler. Setting it to * here means that it is valid for all crawlers. We can also specify certain crawlers, such as the following setting that explicitly specifies the Baidu crawler.
There are two important authorization instructions in the robots.txt file: Disallow and Allow. The former means prohibiting crawling, and the latter means running crawling. In other words, Disallow is a blacklist and Allow is a whitelist. For example, the following are some examples of Robots protocols. 1. Prohibit all crawlers from crawling all resources on the website
2. Prevent all crawlers from crawling resources in the /private and /person directories of the website
3. Only prohibit Baidu crawler from crawling website resources
Many search engine crawlers have specific names. Table 1 lists some commonly used crawler names. Table 1 Common crawler names 2. Analyze Robots protocol We do not need to analyze the Robots protocol ourselves. The robotparser module of the urllib library provides the corresponding API to parse the robots.txt file, which is the RobotFileParser class. The RobotFileParser class can be used in many ways. For example, you can set the URL of the robots.txt file through the set_url method and then analyze it. The code is as follows:
The can_fetch method is used to obtain whether a URL on the website is authorized to be crawled according to the Robots protocol. If it is allowed to be crawled, it returns True, otherwise it returns False. The constructor of the RobotFileParser class can also accept a URL and then use the can_fetch method to determine whether a page can be fetched.
The following example uses the parse method to specify the data of the robots.txt file and output whether different URLs are allowed to be crawled. This is another way to use the RobotFileParser class.
The results are as follows:
This article is reprinted from the WeChat public account "Geek Origin", which can be followed through the following QR code. To reprint this article, please contact the Geek Origin public account. |
<<: The impact of 5G technology on these 20 industries
Huawei's 5G development is hindered With the ...
Originally, in the context of the "big cake&...
WiFi has become a necessity in life, and I believ...
Although Wi-Fi 6 has just been launched, Wi-Fi 7 ...
Judging from the scene of MWC2018, 5G has become ...
Wireless AP devices are used to centrally connect...
Where the will goes, there is a way; not even the...
ZheyeIO has released a 2020 year-end promotion pl...
Recently, the former fourth largest broadband acc...
[[333327]] 3GPP defines the 5G core network as a ...
Recently, F5, the world's leading multi-cloud...
[51CTO.com original article] On December 1-2, 201...
In addition to the VPS hosting discount, KVMLA al...
[[402918]] Recently, the Ministry of Industry and...
Hengchuang Technology is a long-established domes...