Writing crawlers has become "prison programming" because you don't understand the Robots protocol

[[386960]]

It is easy to write Python crawlers, but to write Python crawlers safely, you need to know more, not only technically, but also legally. The Robots protocol is one of them. If you don’t understand the Robots protocol and crawl something you shouldn’t crawl, you may face jail time!

1. Introduction to Robots Protocol

The Robots Protocol is also called the Crawler Protocol or the Robot Protocol. Its full name is the Robots Exclusing Protocol, which is used to tell crawlers and search engines which pages can be crawled and which cannot be crawled. The content of this protocol is usually placed in a text file called robots.txt, which is usually located in the root directory of the website.

Note that the content in the robots.txt file only tells the crawler what it should crawl and what it should not crawl, but it does not prevent the crawler from crawling those prohibited resources through technical means. It only notifies the crawler. Although you can write a crawler without following the description of the robots.txt file, as an ethical, educated and disciplined crawler, you should try your best to follow the rules described in the robots.txt file. Otherwise, it may cause legal disputes.

When a crawler visits a website, it will first check whether there is a robots.txt file in the root directory of the URL. If so, the crawler will crawl web resources according to the crawling range defined in the file. If the file does not exist, the crawler will crawl all directly accessible pages of the website. Let's take a look at an example of a robots.txt file:

 User -agent:*
 Disallow:/
 Allow:/test/

This crawling rule first tells the crawler that it is valid for all crawlers, and that any resources other than the test directory are not allowed to be crawled. If this robots.txt file is placed in the root directory of a website, the search engine crawler will only crawl the resources in the test directory, and we will find that the search engine can no longer find resources in other directories.

The User-agent above describes the name of the crawler. Setting it to * here means that it is valid for all crawlers. We can also specify certain crawlers, such as the following setting that explicitly specifies the Baidu crawler.

 User -agent: BaiduSpider

There are two important authorization instructions in the robots.txt file: Disallow and Allow. The former means prohibiting crawling, and the latter means running crawling. In other words, Disallow is a blacklist and Allow is a whitelist. For example, the following are some examples of Robots protocols.

1. Prohibit all crawlers from crawling all resources on the website

 User -agent:*
 Disallow:/

2. Prevent all crawlers from crawling resources in the /private and /person directories of the website

 User -agent: *
 Disallow: /private/
 Disallow:/person/

3. Only prohibit Baidu crawler from crawling website resources

 User -agent: BaiduSpider
 Disallow:/

Many search engine crawlers have specific names. Table 1 lists some commonly used crawler names.

Table 1 Common crawler names

Crawler Name	Search Engines	website
Googlebot	Google	www.google.com
BaiduSpider	Baidu	www.baidu.com
360Spider	360 Search	www.so.com
Bingbot	Bing	www.bing.com

2. Analyze Robots protocol

We do not need to analyze the Robots protocol ourselves. The robotparser module of the urllib library provides the corresponding API to parse the robots.txt file, which is the RobotFileParser class. The RobotFileParser class can be used in many ways. For example, you can set the URL of the robots.txt file through the set_url method and then analyze it. The code is as follows:

 form urllib.robotparser import RobotFileParser
 robot = RobotFileParser()
 robot.set_url( 'https://www.jd.com/robots.txt' )
 robot.read ()
 print(robot.can_fetch( '*' , 'https://www.jd.com/test.js' ))

The can_fetch method is used to obtain whether a URL on the website is authorized to be crawled according to the Robots protocol. If it is allowed to be crawled, it returns True, otherwise it returns False.

The constructor of the RobotFileParser class can also accept a URL and then use the can_fetch method to determine whether a page can be fetched.

 robot = RobotFileParser( 'https://www.jd.com/robots.txt' )
 print(robot.can_fetch( '*' , 'https://www.jd.com/test.js' ))

The following example uses the parse method to specify the data of the robots.txt file and output whether different URLs are allowed to be crawled. This is another way to use the RobotFileParser class.

 from urllib.robotparser import RobotFileParser
 from urllib import request
 robot = RobotFileParser()
 url = 'https://www.jianshu.com/robots.txt'  
 headers = {
 'User-Agent' : 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36' ,
 'Host' : 'www.jianshu.com' ,
 }
 req = request.Request(url=url, headers=headers) 
  
 # Grab the contents of the robots.txt file and submit it to the parse method for analysis
 robot.parse( request.urlopen(req) .read ().decode( 'utf-8' ).split( '\n' ))
 # Output True  
 print(robot.can_fetch( '*' , 'https://www.jd.com' ))
 # Output True  
 print(robot.can_fetch( '*' , 'https://www.jianshu.com/p/92f6ac2c350f' ))
 # Output False  
 print(robot.can_fetch( '*' , 'https://www.jianshu.com/search?q=Python&page=1&type=note' ))

The results are as follows:

 True  
 True  
 False

This article is reprinted from the WeChat public account "Geek Origin", which can be followed through the following QR code. To reprint this article, please contact the Geek Origin public account.

<<: The impact of 5G technology on these 20 industries

>>: With the advent of the 5G era, there will be three major changes in the future, and the retail industry will be the biggest beneficiary

Network Technology Outlook 2023: Virtual Networks Will Drive NaaS Development, Cloud-Hosted Security Options Will Explode

Blog

Sharktech: 1Gbps unlimited traffic high-defense server starting from $79/month, Los Angeles/Denver/Chicago/Netherlands data center

Blog

Talk about the communication protocol I2C subsystem Hs Mode

Blog

The big "snowflake" light show at the opening ceremony of the Winter Olympics was foolproof, and it was LoRa that ensured it!

Blog

Has NB-IoT solved the business model problem that has plagued the Internet of Things for a decade?

How to make operation and maintenance get rid of the synonym of "busy, tired and hard"? It's simple, Donghua integrated operation and maintenance can easily do it!

[51CTO.com original article] Recently, the 23rd o...

Writing crawlers has become "prison programming" because you don't understand the Robots protocol

Will 5G really kill Wi-Fi? NR-U is here

Front-end 100 Questions: The Seven-layer Network Model and the Evolution of HTTP

How did the three major operators become involuted? Is it necessary for the three to merge to avoid involuntariness?

Network Technology Outlook 2023: Virtual Networks Will Drive NaaS Development, Cloud-Hosted Security Options Will Explode

Sharktech: 1Gbps unlimited traffic high-defense server starting from $79/month, Los Angeles/Denver/Chicago/Netherlands data center

Talk about the communication protocol I2C subsystem Hs Mode

The big "snowflake" light show at the opening ceremony of the Winter Olympics was foolproof, and it was LoRa that ensured it!

Has NB-IoT solved the business model problem that has plagued the Internet of Things for a decade?

Introduction and solution of TCP sticky packet and half packet (Part 1)

∑co Time | Empowering the edge from the "core" to unlock the smart password of edge data

Recommend

From rough to soft decoration: 5G R17 standard officially frozen

5G is here, will the next golden age belong to the Internet of Things?

UUUVPS newly launched Los Angeles AS9929 line VPS with 15% discount, native IP monthly payment starts from 33 yuan

The number of 5G commercial networks has reached 200 worldwide, and 1,257 5G terminals have been released

No matter what happens with 13 incense, the 5G mobile phone market will not be a winner-takes-all

How to make operation and maintenance get rid of the synonym of "busy, tired and hard"? It's simple, Donghua integrated operation and maintenance can easily do it!

Key 5G limitations facing enterprises

What is the transport layer protocol?

China Economic and Trade Development Park and Huawei signed a strategic cooperation agreement

V5.NET: 30% off US/Hong Kong cloud servers, starting from HK$42 (≈RMB35) per month

Connecting the World: 5G and Beyond Technology Trends in 2024

A Preliminary Study on Software Defined Network (SDN)

Why is HTTP 2.0 designed this way?

5G enables thousands of industries, and deepening transformation calls for cross-domain collaboration

Attention! Eight pitfalls in managing integrated cabling systems