1. Headers verificationThe essence of a web crawler is to simulate a network request. To send a request, the script needs to construct a request header, which is usually called Headers. Header verification refers to the server's detection of the key-value pairs of the request header in the HTTP request message. There are three main key-value pairs to be detected: (1) User-Agent: Detects the user agent of the requester. If this item is missing, the requester is considered a robot. (2) Referer: Detects whether the requester jumps to this page through the normal channel. It is often used in anti-hotlink technology. (3) Cookie: Detects the identity status of the requester and is usually required on websites that require login to access. It is very easy to deal with this type of header detection. You only need to capture the packet after accessing the page with a browser. In most cases, you can directly copy the content in the request header. It is worth noting that on pages that require login to access, the cookies in them are time-limited and need to be updated in a timely manner. On some websites with better security protection, some additional encryption parameters calculated after running local JS will be added to the header. 2. IP address recordsThe recording of IP addresses is mainly aimed at malicious crawlers to prevent them from initiating a large number of HTTP requests to access the website in a short period of time, resulting in the occupation of website resources. The principle of this anti-crawler method is to detect abnormal access users. If there is a request to access the website for dozens of times in a short period of time (for example, 3 seconds), the IP address will be recorded and it will be judged as a robot. When the HTTP request from the IP address is sent again, the server will reply with the status code 403 Forbidden, prohibiting the request from continuing to access. The advantages of this protection method are obvious, but the disadvantages are also obvious, that is, a one-size-fits-all approach can easily hurt human users. To deal with this kind of anti-crawler method, crawler developers need to slow down the HTTP request interval as much as possible to achieve a speed similar to that of normal human access to the page, so as to avoid being detected by the algorithm. Alternatively, you can build an IP proxy pool or purchase an agent IP, as shown in Figure 1: Figure 1 Fast Proxy IP Page When using a proxy IP to access HTTP requests, the local IP will be hidden behind the proxy. Even if it is detected by the algorithm, you only need to change the new IP address. 3. Ajax asynchronous loadingAjax (Asynchronous JS And XML) is a web development technology for creating interactive web applications. Simply put, when browsing an interface, the URL address itself does not change, but the page content is dynamically updated. As shown in Figure 2, the waterfall flow loading of Baidu Pictures on the web page uses Ajax. Figure 2 Baidu image capture At this time, directly using a GET request to obtain the page content will not locate the specific content, because its acquisition is generally returned through a data interface. Strictly speaking, this is not an anti-crawler technology, but after using Ajax, crawler developers need to select data packets in the network request packet by themselves, instead of simply writing crawlers by GET page source code. In the face of this kind of technology, you only need to capture web packets and find the data interface that actually contains the web page content in a large number of data packets. Because if the data is to be rendered to the page, there must be a data packet to transmit it to the client, and the developer only needs to find it. Generally speaking, the results returned by this kind of technology for data transmission are in JSON format, so JSON packets are needed for data parsing. 4. Font Anti-CrawlerDifferent from the general anti-crawler idea, font anti-crawler mainly tampers with data. The web page data to be obtained can be viewed normally in the browser, but garbled characters will be obtained after copying it locally. Its principle is that the website customizes a set of fonts, builds a mapping relationship and adds it to the font of css. When viewing it in the browser, the website will automatically obtain these files, thereby establishing a corresponding relationship and mapping the characters. When writing web crawlers, crawler developers often only request the URL address of the web page, which results in a lack of mapping files. There is no character set that can parse these characters, resulting in garbled characters. As shown in Figure 3, the intern monk web page uses a customized font file. Figure 3: Internseng font anti-crawler There are two ways to break through font anti-crawler. The first method is to find the URL request address of the font file, download it locally, and parse it out using an XML parsing tool. Then, based on the character correspondence, you can establish a local mapping to replace characters. The second method is to manually copy the encrypted characters, encode them locally to get the corresponding encoding, establish your own local mapping dictionary, and then crawl and replace characters. The reason why the second method can be used is that the encrypted characters of font anti-crawler are usually not many, and most of them are encrypted Arabic numerals and some commonly used Chinese characters on websites, so they can be directly copied manually for encoding mapping. 5. Verification code anti-crawlerMalicious crawlers are rampant on the Internet today. Although the above anti-crawler methods are feasible, they can be easily broken through by malicious crawlers. In order to deal with this situation, verification codes were born. From the initial English and number verification codes to the current picture-clicking verification codes, verification code technology is constantly being updated and iterated, and more types of verification codes will appear in the future. The protection of verification codes is mainly in two stages. The first stage is the login and registration stage, and the second stage is the page access stage. The former is to block malicious crawlers outside the door and allow human users to enter, and the latter is to clean up those malicious crawlers that break through the login and registration stage and enter the page crawling. If the server detects that a certain IP address has accessed a large number of times in a short period of time, it will not directly ban the user, but will display a verification code, thus avoiding accidental injuries to users. It is not a one-size-fits-all approach and is more humane. If it is a human user, it can naturally pass these point-to-point verification codes, but if it is a robot, it is difficult to break through this second level, as shown in Figure 4. Click the picture verification code. Figure 4 Click on the picture to select the verification code The main response to this type of anti-crawler method is to connect to major verification code recognition platforms or train deep learning neural network models to let the models help crawlers pass the verification code. Moreover, with the prevalence of deep learning frameworks, training models is no longer a difficult task. Simple verification code recognition can no longer stop web crawlers equipped with deep learning models, so website developers will add more complex JS parameter encryption behind the verification code recognition. Even if the verification code is recognized, it is difficult to construct the final encryption result, which increases the cracking threshold. However, using special testing tools, such as selenium, you can directly use training models to simulate human behavior to pass the verification code, eliminating the trouble of cracking JS encryption parameters. However, automated testing tools have obvious characteristics. Some websites will add recognition of automated software features in JS files, thereby refusing service. 6. JavaScript parameter encryptionJavaScript (hereafter referred to as JS) parameter encryption is commonly seen in POST form submissions, mainly to prevent malicious robots from bulk registering and simulating logins, etc. If you capture the POST form, you will find that the data you entered in the form is encrypted into an unknown string, which is mainly achieved by loading the local JS script of the website. In order to deal with this type of anti-crawler, in addition to being familiar with debugging techniques, readers are also required to have a solid foundation in JS language, because cracking this type of encryption usually requires developers to be able to read the JS encryption script of the target website and perform a series of deletion and modification operations, using static analysis to gradually "deduct" specific encryption functions from the huge JS script, simulate running locally to obtain the encryption result, and then pass the parameters through POST packets to get normal feedback. Therefore, it can block a large number of malicious crawlers with low technical capabilities. There are two main ways to crack this type of anti-crawler method: (1) Simple encryption is directly reproduced using Python language. (2) More complex encryption can extract specific functions, compose encrypted scripts, and then simulate their operation. At the same time, some browser fingerprint detection should also be simulated. 7. JS anti-debuggingFor developers who are familiar with the JS language, the threshold for preventing JS parameter encryption is not high. Therefore, in order to cut off developers' analysis of website encrypted files from the source, JS anti-debugging was born. The simplest method is to prohibit users from right-clicking and pressing shortcut keys such as F12. For this simple protection, you only need to modify the corresponding shortcut keys, or open the developer tools in a new window and then switch back to the original page. The more difficult ones mainly detect whether the user has opened the browser developer tools or modified the local JS script file, so as to determine whether to perform an infinite loop debugger jam, making it impossible for developers to debug the script. This kind of anti-crawler cracking requires familiarity with JS Hook related knowledge, because the source code for detecting the console status and the script file status is similar. You can write a Chrome extension to automatically Hook the anti-debugging code and replace the function, so as to pass the detection and allow developers to perform static analysis. 8. AST Confusion Anti-CrawlerTheoretically, no anti-crawler means can prevent crawlers from entering, because if a website wants to have user traffic, it will not set a threshold that is too high so that normal users cannot access it. As long as the developer's web crawler simulates the situation of human access to the website as much as possible, it can enter the website and run rampant. However, although it is impossible to completely prevent web crawlers from entering, it is possible to raise the threshold for web crawlers to enter and minimize the loss of the website. Among all anti-crawler protection methods, JS parameter encryption has a relatively outstanding protection effect, which can keep most low-tech crawler developers out. Even if the current website uses verification code protection, the HTTP request transmission behind it will also use JS to encrypt the verification code parameters. Although it cannot completely prevent the entry of crawlers, it can make crawler developers spend a lot of time on cracking. This is a low-cost but effective method. If the website's encryption script is frequently changed, even the most experienced crawler developers will be exhausted. Therefore, how to increase the difficulty of cracking JS scripts is a key point. Common methods to prevent developers from debugging JS script files are to prohibit right-clicking and opening developer tools, or to use JS code for detection. However, there are universal solutions for these methods because their protection level is not high and they can be bypassed as long as you are proficient in using search engines. In order to extend the time of being cracked by crawlers as much as possible in JS script protection, the best way is to use the AST abstract syntax tree to highly obfuscate the JS script code and convert it into a garbled file that is unreadable and unrecognizable but can function normally. As shown in Figure 5, the readability of the obfuscated JS code has been greatly reduced, which further increases the difficulty of JS reverse engineering. Figure 5 Obfuscated JS code IX. ConclusionIt is inevitable that anti-crawler technology cannot eradicate web crawlers. The front-end encrypted files of the website can be read at will by any user. The existence of the website relies on the traffic of real users. Today's web crawler technology is developing rapidly and can almost reach a state of being indistinguishable from the real thing. Even to check whether the JS file is actually running in the browser, the crawler developer can simulate the corresponding object prototype in the script file. But no matter what, the confrontation between crawlers and anti-crawlers has, on the one hand, raised the threshold of crawlers and strengthened the security awareness of enterprises, and on the other hand, promoted the development of crawler technology. References[1]Ma Xiaoju,Yan Min. Design and Implementation of Craweper Based on Scrapy[J]. Journal of Physics: Conference Series, 2021, 2033(1). [2]Deng Kaiying,Chen Senpeng,Deng Jingwei. On optimisation of web crawler system on Scrapy framework[J]. International Journal of Wireless and Mobile Computing,2020,18(4). [3]Wang Wei,Yu Lihua. UCrawler: A learning-based web crawler using a URL knowledge base[J]. Journal of Computational Methods in Sciences and Engineering, 2021, 21(2). [4]Tianyi Ma, Ziyang Zhang. Medical Consultation System based on Python Web crawler[C]//.Proceedings of 2021 2nd International Conference on Electronics, Communications and Information Technology (CECIT 2021)., 2021:772-776.DOI:10.26914/c.cnkihy.2021.065511. [5]Addo Prince Clement,Dorgbefu Jnr. Maxwell,Kulbo Nora Bakabbey,Akpatsa Samuel Kofi,Ohemeng Asare Andy,Dagadu Joshua Caleb,Boansi Kufuor Oliver,Kofi Frimpong Adasa Nkrumah. Video Ads in Digital Marketing and Sales: A Big Data Analytics Using Scrapy Web Crawler Mining Technique[J]. Asian Journal of Research in Computer Science, 2021. |
<<: How will 5G mainstreaming affect the Internet of Things in 2023?
>>: Ten important components of SDN controller
As SD-WAN technology continues to mature in 2021,...
As the world of technology continues to evolve, i...
Mobile network operators promise their users that...
For the 802.11ax wireless LAN standard, which is ...
This month we have shared information about BuyVM...
GigsGigsCloud is a long-established foreign hosti...
The official statement on June 7 that the U.S. De...
For the ICMP protocol, you may want to know the f...
Hengchuang Technology (henghost) has sent the eve...
In order to respond to public concerns, People...
On October 30, 2020, in Shanghai, the highly anti...
At present, 5G integrated applications are in a c...
2023 has officially begun, and RAKsmart has launc...
The "2021 Open Data Center Summit" was ...
RackNerd has launched some promotions in Los Ange...