Writing crawlers has become "prison programming" because you don't understand the Robots protocol

Writing crawlers has become "prison programming" because you don't understand the Robots protocol

[[386960]]

It is easy to write Python crawlers, but to write Python crawlers safely, you need to know more, not only technically, but also legally. The Robots protocol is one of them. If you don’t understand the Robots protocol and crawl something you shouldn’t crawl, you may face jail time!

1. Introduction to Robots Protocol

The Robots Protocol is also called the Crawler Protocol or the Robot Protocol. Its full name is the Robots Exclusing Protocol, which is used to tell crawlers and search engines which pages can be crawled and which cannot be crawled. The content of this protocol is usually placed in a text file called robots.txt, which is usually located in the root directory of the website.

Note that the content in the robots.txt file only tells the crawler what it should crawl and what it should not crawl, but it does not prevent the crawler from crawling those prohibited resources through technical means. It only notifies the crawler. Although you can write a crawler without following the description of the robots.txt file, as an ethical, educated and disciplined crawler, you should try your best to follow the rules described in the robots.txt file. Otherwise, it may cause legal disputes.

When a crawler visits a website, it will first check whether there is a robots.txt file in the root directory of the URL. If so, the crawler will crawl web resources according to the crawling range defined in the file. If the file does not exist, the crawler will crawl all directly accessible pages of the website. Let's take a look at an example of a robots.txt file:

  1. User -agent:*
  2. Disallow:/
  3. Allow:/test/

This crawling rule first tells the crawler that it is valid for all crawlers, and that any resources other than the test directory are not allowed to be crawled. If this robots.txt file is placed in the root directory of a website, the search engine crawler will only crawl the resources in the test directory, and we will find that the search engine can no longer find resources in other directories.

The User-agent above describes the name of the crawler. Setting it to * here means that it is valid for all crawlers. We can also specify certain crawlers, such as the following setting that explicitly specifies the Baidu crawler.

  1. User -agent: BaiduSpider

There are two important authorization instructions in the robots.txt file: Disallow and Allow. The former means prohibiting crawling, and the latter means running crawling. In other words, Disallow is a blacklist and Allow is a whitelist. For example, the following are some examples of Robots protocols.

1. Prohibit all crawlers from crawling all resources on the website

  1. User -agent:*
  2. Disallow:/

2. Prevent all crawlers from crawling resources in the /private and /person directories of the website

  1. User -agent: *
  2. Disallow: /private/
  3. Disallow:/person/

3. Only prohibit Baidu crawler from crawling website resources

  1. User -agent: BaiduSpider
  2. Disallow:/

Many search engine crawlers have specific names. Table 1 lists some commonly used crawler names.

Table 1 Common crawler names

Crawler Name
Search Engines
website
Googlebot
Google
www.google.com
BaiduSpider
Baidu
www.baidu.com
360Spider
360 Search
www.so.com
Bingbot
Bing
www.bing.com

2. Analyze Robots protocol

We do not need to analyze the Robots protocol ourselves. The robotparser module of the urllib library provides the corresponding API to parse the robots.txt file, which is the RobotFileParser class. The RobotFileParser class can be used in many ways. For example, you can set the URL of the robots.txt file through the set_url method and then analyze it. The code is as follows:

  1. form urllib.robotparser import RobotFileParser
  2. robot = RobotFileParser()
  3. robot.set_url( 'https://www.jd.com/robots.txt' )
  4. robot.read ()
  5. print(robot.can_fetch( '*' , 'https://www.jd.com/test.js' ))

The can_fetch method is used to obtain whether a URL on the website is authorized to be crawled according to the Robots protocol. If it is allowed to be crawled, it returns True, otherwise it returns False.

The constructor of the RobotFileParser class can also accept a URL and then use the can_fetch method to determine whether a page can be fetched.

  1. robot = RobotFileParser( 'https://www.jd.com/robots.txt' )
  2. print(robot.can_fetch( '*' , 'https://www.jd.com/test.js' ))

The following example uses the parse method to specify the data of the robots.txt file and output whether different URLs are allowed to be crawled. This is another way to use the RobotFileParser class.

  1. from urllib.robotparser import RobotFileParser
  2. from urllib import request
  3. robot = RobotFileParser()
  4. url = 'https://www.jianshu.com/robots.txt'  
  5. headers = {
  6. 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36' ,
  7. 'Host' : 'www.jianshu.com' ,
  8. }
  9. req = request.Request(url=url, headers=headers)
  10.   
  11. # Grab the contents of the robots.txt file and submit it to the parse method for analysis
  12. robot.parse( request.urlopen(req) .read ().decode( 'utf-8' ).split( '\n' ))
  13. # Output True  
  14. print(robot.can_fetch( '*' , 'https://www.jd.com' ))
  15. # Output True  
  16. print(robot.can_fetch( '*' , 'https://www.jianshu.com/p/92f6ac2c350f' ))
  17. # Output False  
  18. print(robot.can_fetch( '*' , 'https://www.jianshu.com/search?q=Python&page=1&type=note' ))

The results are as follows:

  1. True  
  2. True  
  3. False  

This article is reprinted from the WeChat public account "Geek Origin", which can be followed through the following QR code. To reprint this article, please contact the Geek Origin public account.

<<:  The impact of 5G technology on these 20 industries

>>:  With the advent of the 5G era, there will be three major changes in the future, and the retail industry will be the biggest beneficiary

Recommend

Facing Huawei 5G, the United States is showing its hand

Originally, in the context of the "big cake&...

Why do you need to ask someone to deploy WiFi at home? You can do it yourself

WiFi has become a necessity in life, and I believ...

Wi-Fi 7: The Next Generation of Wi-Fi Evolution

Although Wi-Fi 6 has just been launched, Wi-Fi 7 ...

10 ways to completely solve wireless AP failures

Wireless AP devices are used to centrally connect...

Do you know the characteristics of 5G core network (5GC)?

[[333327]] 3GPP defines the 5G core network as a ...