Preface As an important tool for cold data startup and data enrichment, crawlers play an important role in business development. We have accumulated a lot of experience in the use of crawlers in the process of business development. We would like to share it with you, hoping to provide some ideas on technology selection for future business development, so as to better promote business development. We will share our experience from the following points
Application scenarios of crawlers In production, crawlers are mainly used in the following scenarios:
Technical selection of crawlers Next, we will introduce several commonly used technical solutions for crawlers from the shallower to the deeper Simple crawlers Speaking of crawlers, you may think that the technology is relatively advanced, and you will immediately think of using a crawler framework like Scrapy. This kind of framework is indeed very powerful, so does it mean that you must use a framework when writing a crawler? No! It depends on the situation. If the interface we want to crawl returns only very simple, fixed structured data (such as JSON), using a framework like Scrapy is sometimes tantamount to overkill, which is not very economical! For example, there is a business requirement: to capture the data of pregnant women in Yuxueyuan from "less than 4 weeks of pregnancy" to "more than 36 months of pregnancy" at each stage. For this kind of request, curl in bash is more than enough! First, we use Charles and other packet capture tools to capture the interface data of this page, as follows Through observation, we found that only the value of month (representing the number of weeks of pregnancy) in the requested data is different, so we can crawl all the data according to the following ideas: 1. Find all the month values corresponding to "pregnancy less than 4 weeks" ~ "pregnancy more than 36 months", and build a month array. 2. Build a curl request with month value as a variable. In Charles, we can get the curl request in the following way: 3. Traverse the month in step 1 one by one. Each time you traverse it, use the curl and month variables in step 2 to build a request and execute it. Save the result of each request into a file (corresponding to the month data of each pregnancy period). In this way, you can parse and analyze the data in this file later. The sample code is as follows. For the convenience of demonstration, the curl code in the middle has been simplified a lot. It is enough for everyone to understand the principle.
In the early stage, most of our business used PHP, and many crawler requests were processed in PHP. In PHP, we can also simulate the curl request in bash by calling libcurl. For example, if there is a need to crawl the weather conditions of each city in the business, we can use PHP to call curl, and it can be done with one line of code! After looking at these two examples, do you think that crawlers are just that simple? Yes, many such simple crawler implementations in business can meet the needs of most scenarios! Brain-opening crawler solution The crawler ideas introduced above can solve most of the daily crawler needs, but sometimes we need some creative ideas. Here are two simple examples: 1. Last year, an operations colleague gave me a link to a Tmall-selected milk powder URL.
Using search engine techniques, we can easily meet this operational demand Refer to the picture, the steps are as follows:
In this way, we have also cleverly achieved the operational needs. The data obtained by this crawler is an HTML file, not JSON or other structured data. We need to extract the corresponding URL information (existing in the tag) from the HTML. We can use regular expressions or XPath to extract it. For example, in HTML there is the following div element You can use the following xpath to extract
You can extract "Hello everyone!". It should be noted that in this scenario, "you still don't need to use a complex framework like Scrapy". In this scenario, since the amount of data is not large, a single thread can meet the needs. In actual production, we can use PHP to meet the needs. 2. One day, an operations colleague raised another requirement, wanting to crawl Meipai videos Through packet capture, we found that the URL of each video in Meipai is very simple, and we can watch the video normally by entering it into the browser. So we took it for granted that we can download the video directly through this URL, but in fact we found that this URL is fragmented (m3u8, a file format for playing multimedia lists designed to optimize loading speed), and the downloaded video is incomplete. Later, we found that opening the website `http://www.flvcd.com/` Enter the Meipai address and convert it to get the complete video download address "As shown in the picture: After clicking "GO!", the video address will be parsed and the complete video download address will be obtained." Further analysis shows that the request corresponding to the "Go!" button is "http://www.flvcd.com/parse.php?format=&kw= + video address". Therefore, as long as we get the video address of Meipai and then call the video conversion request of flvcd, we can get the complete video download address. In this way, we also solve the problem of not being able to get the complete address of Meipai. Complex crawler design The data we want to crawl above is relatively simple, and the data is ready to use. In fact, most of the data we want to crawl is unstructured data (html web pages, etc.), which needs to be further processed (data cleaning stage in the crawler). Moreover, each data we crawl is likely to contain a large number of URLs of web pages to be crawled, which means that URL queue management is required. In addition, requests sometimes require login, and each request also needs to add cookies, which involves cookie management. In this case, it is necessary to consider a framework like Scrapy! Whether it is written by ourselves or a crawler framework like Scrapy, it is basically inseparable from the design of the following modules
The following figure explains in detail how the various modules work together.
As you can see, for the crawler frameworks above, if there are many URLs to be crawled, the work of downloading, parsing, and storing them is very large (for example, we have a business similar to Dianping.com, and we need to crawl Dianping.com's data. Since it involves crawling millions of merchants, comments, etc., the amount of data is huge!), it will involve multi-threading and distributed crawling. It is not appropriate to implement it with a single-threaded model language such as PHP. Python is more than enough to implement these more complex crawler designs because it supports multi-threading, coroutines and other features. At the same time, due to Python's concise syntax, it has attracted a large number of people to write many mature libraries, which are ready to use and very convenient. The famous Scrapy framework has captured a large number of fans due to its rich plug-ins and ease of use. Most of our crawler businesses are implemented using scrapy, so let's briefly introduce Scrapy and also take a look at how a mature crawler framework is designed. We must first consider some of the problems that crawlers may encounter in the process of crawling data, so that we can understand the necessity of the framework and what points we should consider when designing our own framework in the future
From the above points, we can see that writing a crawler framework still takes a lot of effort. Fortunately, scrapy helps us solve the above problems almost perfectly, allowing us to focus on writing specific parsing and storage logic. Let's see how it implements the above functional points.
As you can see, Scrapy solves the main problems mentioned above. When crawling large amounts of data, it allows us to focus on writing the business logic of the crawler without having to pay attention to details such as cookie management and multi-threading management. This greatly reduces our burden and makes it easy to get twice the result with half the effort! (Note! Although Scrapy can use Selenium + PhantomJs to crawl dynamic data, with the emergence of Puppeter launched by Google, PhantomJs has stopped updating. Because Puppeter is much more powerful than PhantomJS, if you need to crawl a large amount of dynamic data, you need to consider the impact on performance. The Puppeter Node library is definitely worth a try. It is officially produced by Google and highly recommended.) Now that we understand the main design ideas and functions of Scrapy, let's take a look at how to use Scrapy to develop a crawler project for our audio and video business, and see what problems we will encounter when making an audio and video crawler. Audio and video crawler practice 1. Let's briefly introduce the system of our audio and video crawler project from several aspects 1. Four main processes
2. Currently supported functions
3. System process distribution diagram 2. Let’s talk about the details step by step 1. Technical selection of crawler framework When it comes to crawlers, everyone will naturally associate it with Python, so our technical framework is selected from the third-party libraries that stand out in Python. Scrapy is a very good one. I believe that many other friends who do crawlers have also experienced this framework. Then let's talk about the advantages of this framework that I feel most deeply after using it for so long:
2. Design of crawler pool db The crawler pool db is a very important key storage node for the entire crawling link, so early childhood education has also experienced many field changes. Initially, our crawler pool db table was just a copy of the official table, storing exactly the same content. After the crawl was completed, it was copied to the official table, and then the corresponding association was lost. At this time, the crawler pool was completely a draft table with a lot of useless data. Later, it was discovered that the operation needed to see the specific source of the crawler. At this time, there was no website source link in the crawler pool, and it was impossible to correspond to the data content of the crawler pool according to the album id of the official table. Therefore, the crawler pool db made the most important change. The first is to establish the association between the crawler pool data and the crawling source site, that is, the source_link and source_from fields, which represent the original website link corresponding to the content and the source statement definition respectively. The second step is to establish the association between the crawler pool content and the formal library content. In order not to affect the formal library data, we add target_id corresponding to the content id of the formal library. At this point, the need to inform the operation of the specific source of the crawled content can be met. Subsequent operations found that selecting high-quality content from a large amount of crawler data requires some reference values of source site data, such as the number of views on the source site. At this time, the storage content of the crawler pool db and the official library db is formally differentiated. The crawler pool is no longer just a copy of the official library, but represents some reference data of the source site and some basic data of the official library. The subsequent synchronous update of source site content function can also be easily achieved by relying on this set of relationships. During the whole process, the most important thing is to link the three unrelated blocks of "crawl source site content", "crawler pool content" and "official library content". 3. Why do resource processing tasks occur? Originally, the downloading of resources and some processing should be completed at the same time during the crawling stage, so why is there a separate resource processing process? First of all, the first version of the early childhood crawler system does not have this separate step, and it is executed serially during the scrapy crawling process. However, the shortcomings found later are:
To address the above issues, we added an intermediate state to the crawler table, which is a state where resource downloads fail, but the crawled information is retained. Then, we added independent resource processing tasks and used Python's multithreading to process resources. For these failed contents, we will run resource processing tasks regularly until they succeed. (Of course, if they keep failing, developers will need to troubleshoot the problem based on the logs.) 4. Explain why watermark processing is not placed in the resource processing stage, but in the post-processing stage (i.e. after formal storage) First of all, we need to understand that the principle of watermark removal is to use the delogo function of ffmpeg. This function is not like converting video formats, which only changes the package. It needs to re-encode the entire video, so it takes a very long time and the corresponding CPU usage is also very large. Based on the above, if it is placed in the resource processing stage, the efficiency of resource transfer to upyun will be greatly reduced. In addition, Youku alone has more than 3 types of watermarks, which is a very time-consuming task for sorting out rules. This time consumption will also reduce the progress of crawling. First, ensure that the resources are stored in the warehouse and then process the watermarks. On the one hand, the operation can flexibly control the listing and delisting. On the other hand, it also gives developers enough time to sort out the rules. In addition, when the watermark processing fails, there is still the source video that can be restored. 5. How to remove image watermarks Many images captured by crawlers have watermarks. Currently, there is no perfect way to remove watermarks. The following methods can be used: Search for the original image. Most websites will save the original image and the watermarked image. If you can't find the original link, there is nothing you can do. Cropping method: Since watermarks are usually at the corners of images, if the cropped image is acceptable, the watermarked part can be cropped directly in proportion. Use opencv library to process and call opencv, a graphics library, to perform image repair similar to PS. The effect is similar, but the repair effect is not good when encountering complex graphics. 3. Problems and Solutions Problems such as interruptions or failures often occur during the resource download phase [Solution: Separate resource download and related processing from the crawling process to facilitate task re-run] Although they are on different platforms, there are too many duplicate resources, especially video websites [Solution: Match resources by title before downloading, and filter them if they match exactly, saving extra download time] During a large number of crawling processes, the IP may be blocked [Solution: Dynamic IP Proxy] The resource acquisition rules of large video websites are frequently replaced (encryption, video cutting, anti-hotlinking, etc.), and the development and maintenance costs are high [Solution: you-get third-party library, which supports crawling of a large number of mainstream video websites, greatly reducing development and maintenance costs]
4. Finally, let’s make a summary The audio and video crawler code system of our video may not be applicable to all business lines, but the thinking and solutions to similar problems can indeed be used as a reference and applied to various business lines. I believe that the project owner will have a lot of inspiration for everyone. Crawler Management Platform When there are many crawler tasks, the ssh+crontab method will become very troublesome. You need a platform that can view and manage the crawler running status at any time. SpiderKeeper+Scrapyd is currently a ready-made management solution that provides a good UI interface. Features include: 1. Crawler job management: start the crawler regularly to crawl data, and start and close the crawler task at any time 2. Crawler log records: log records during the crawler operation can be used to query crawler problems 3. Check the crawler running status: check the running crawlers and crawler running time Summarize From the above description, we can briefly summarize the technical selection of crawlers
Choosing the appropriate technology according to the complexity of the business scenario can achieve twice the result with half the effort. We must consider the actual business scenario when selecting technology. This article is reprinted from the WeChat public account "Ma Hai", which can be followed through the following QR code. To reprint this article, please contact the Ma Hai public account. |
>>: Gartner: Global 5G network infrastructure spending will nearly double in 2020
Introduction The rapid development of Internet bu...
ColoCrossing Easter promotion has started, with 5...
The Internet plays a key role in our daily lives,...
Watching various live broadcasts every day, but o...
RAKsmart is a foreign hosting company operated by...
RAKsmart is a Chinese-run foreign hosting company...
spinservers has just released this month's pr...
Background The development convenience brought by...
[[181278]] On January 6, the Ministry of Science ...
During the 2017 Huawei Connect Conference HUAWEI ...
When writing crawlers, we often need to parse the...
High-density cabling products and standard modula...
2020 is coming to an end. After a year of hard wo...