Learn crawling skills in Yiwen

Learn crawling skills in Yiwen

[[336016]]

Preface

As an important tool for cold data startup and data enrichment, crawlers play an important role in business development. We have accumulated a lot of experience in the use of crawlers in the process of business development. We would like to share it with you, hoping to provide some ideas on technology selection for future business development, so as to better promote business development.

We will share our experience from the following points

  • Application scenarios of crawlers
  • Technical selection of crawlers
  • Practical explanation: crawler solutions in complex scenarios
  • Crawler Management Platform

Application scenarios of crawlers

In production, crawlers are mainly used in the following scenarios:

  1. Search engines, such as Google and Baidu, launch countless crawlers every day to crawl web page information, which makes it convenient, comprehensive and efficient for us to use search engines to query information (the working principle of search engines is explained in detail in this article, which is recommended for everyone to read)
  2. Cold data is the main tool for enriching data when starting. When a new business starts, there is not much data because it is just starting. At this time, we need to crawl data from other platforms to fill our business data. For example, if we want to make a platform like Dianping, there is no merchant information at the beginning, so we need to crawl the information of merchants such as Dazhong and Meituan to fill the data.
  3. Data service or aggregation companies, such as Tianyancha, Qichacha, Xigua Data, etc.
  4. Provide horizontal data comparison and aggregation services. For example, e-commerce often requires a price comparison system to capture price information of the same product from major e-commerce platforms, such as Pinduoduo, Taobao, and JD.com, in order to provide users with the most affordable product prices. This requires crawling information from major e-commerce platforms.
  5. Black industry, gray industry, risk control, etc. For example, if we want to apply for credit from certain funding parties, we must first deploy a risk control on the funding party's side to see whether your personal information meets the credit conditions. This personal information is usually crawled from various channels by some companies using crawler technology. Of course, this kind of scenario should be used with caution, otherwise it will be true that "the better the crawler is used, the earlier the monitoring will be."

Technical selection of crawlers

Next, we will introduce several commonly used technical solutions for crawlers from the shallower to the deeper

Simple crawlers Speaking of crawlers, you may think that the technology is relatively advanced, and you will immediately think of using a crawler framework like Scrapy. This kind of framework is indeed very powerful, so does it mean that you must use a framework when writing a crawler? No! It depends on the situation. If the interface we want to crawl returns only very simple, fixed structured data (such as JSON), using a framework like Scrapy is sometimes tantamount to overkill, which is not very economical!

For example, there is a business requirement: to capture the data of pregnant women in Yuxueyuan from "less than 4 weeks of pregnancy" to "more than 36 months of pregnancy" at each stage.

For this kind of request, curl in bash is more than enough!

First, we use Charles and other packet capture tools to capture the interface data of this page, as follows

Through observation, we found that only the value of month (representing the number of weeks of pregnancy) in the requested data is different, so we can crawl all the data according to the following ideas:

1. Find all the month values ​​corresponding to "pregnancy less than 4 weeks" ~ "pregnancy more than 36 months", and build a month array. 2. Build a curl request with month value as a variable. In Charles, we can get the curl request in the following way:

3. Traverse the month in step 1 one by one. Each time you traverse it, use the curl and month variables in step 2 to build a request and execute it. Save the result of each request into a file (corresponding to the month data of each pregnancy period). In this way, you can parse and analyze the data in this file later.

The sample code is as follows. For the convenience of demonstration, the curl code in the middle has been simplified a lot. It is enough for everyone to understand the principle.

  1. #!/bin/bash
  2.  
  3. ## Get the month corresponding to all gestational weeks . For the convenience of demonstration, only two values ​​are taken here
  4. month =(21 24)
  5. ## Traverse all month and assemble into curl request
  6. for   month   in ${ month [@]};
  7. do
  8. curl -H 'Host: yxyapi2.drcuiyutao.com'   
  9. -H 'clientversion: 7.14.1'   
  10. ...
  11. -H 'birthday: 2018-08-07 00:00:00'    
  12. --data "body=month%22%3A$month" ## construct curl request with month as variable  
  13. --compressed 'http://yxyapi2.drcuiyutao.com/yxy-api-gateway/api/json/tools/getBabyChange' > $var.log ## Output the curl request results to a file for subsequent analysis  
  14. done

In the early stage, most of our business used PHP, and many crawler requests were processed in PHP. In PHP, we can also simulate the curl request in bash by calling libcurl. For example, if there is a need to crawl the weather conditions of each city in the business, we can use PHP to call curl, and it can be done with one line of code!

After looking at these two examples, do you think that crawlers are just that simple? Yes, many such simple crawler implementations in business can meet the needs of most scenarios!

Brain-opening crawler solution

The crawler ideas introduced above can solve most of the daily crawler needs, but sometimes we need some creative ideas. Here are two simple examples:

1. Last year, an operations colleague gave me a link to a Tmall-selected milk powder URL.

  1. https://m.tmall.com/mblist/de_9n40_AVYPod5SU93irPS-Q.html, they hope to extract the information of this article, and find all the articles in Tmall Selected that mention the keyword "milk powder" and extract their content. This requires some advanced search engine skills. We noticed that the URL of Tmall Selected is in the following form:
  1. https://m.tmall.com/mblist/de_ + Unique signature for each article

Using search engine techniques, we can easily meet this operational demand

Refer to the picture, the steps are as follows:

  1. First, we enter the advanced query sentence "milk powder site:m.tmall.com inurl:mblist/de_" in the Baidu box, click search, and all Tmall articles containing milk powder in this page will be displayed.
  2. Note that the browser has generated the complete URL for the search in the address bar. After getting this URL, we can request this URL. At this time, we will get the HTML file containing the two blocks 3 and 4 in the above figure.
  3. After getting the HTML file obtained in step 2, each title in area 3 actually corresponds to a URL (in the form of ...). According to the regular expression, you can get the URL corresponding to each title, and then request these URLs to get the corresponding article information.
  4. Similarly, after getting the HTML file obtained in step 2, we can get the URL corresponding to each page of area 4, and then request these URLs in turn, and then repeat step 2 to get each page of Tmall's selected articles containing milk powder.

In this way, we have also cleverly achieved the operational needs. The data obtained by this crawler is an HTML file, not JSON or other structured data. We need to extract the corresponding URL information (existing in the tag) from the HTML. We can use regular expressions or XPath to extract it.

For example, in HTML there is the following div element

  1. <div id= "test1" >Hello everyone! </div>

You can use the following xpath to extract
  1. data = selector.xpath( '//div[@id="test1"]/text()' ).extract()[0]

You can extract "Hello everyone!". It should be noted that in this scenario, "you still don't need to use a complex framework like Scrapy". In this scenario, since the amount of data is not large, a single thread can meet the needs. In actual production, we can use PHP to meet the needs.

2. One day, an operations colleague raised another requirement, wanting to crawl Meipai videos

Through packet capture, we found that the URL of each video in Meipai is very simple, and we can watch the video normally by entering it into the browser. So we took it for granted that we can download the video directly through this URL, but in fact we found that this URL is fragmented (m3u8, a file format for playing multimedia lists designed to optimize loading speed), and the downloaded video is incomplete. Later, we found that opening the website `http://www.flvcd.com/`

Enter the Meipai address and convert it to get the complete video download address

"As shown in the picture: After clicking "GO!", the video address will be parsed and the complete video download address will be obtained."

Further analysis shows that the request corresponding to the "Go!" button is "http://www.flvcd.com/parse.php?format=&kw= + video address". Therefore, as long as we get the video address of Meipai and then call the video conversion request of flvcd, we can get the complete video download address. In this way, we also solve the problem of not being able to get the complete address of Meipai.

Complex crawler design

The data we want to crawl above is relatively simple, and the data is ready to use. In fact, most of the data we want to crawl is unstructured data (html web pages, etc.), which needs to be further processed (data cleaning stage in the crawler). Moreover, each data we crawl is likely to contain a large number of URLs of web pages to be crawled, which means that URL queue management is required. In addition, requests sometimes require login, and each request also needs to add cookies, which involves cookie management. In this case, it is necessary to consider a framework like Scrapy! Whether it is written by ourselves or a crawler framework like Scrapy, it is basically inseparable from the design of the following modules

  • url manager
  • Web page (HTML) downloader, corresponding to urllib2, requests and other libraries in Python
  • (HTML) parser, there are two main ways to parse

The following figure explains in detail how the various modules work together.

  • Regular Expressions
  • Structured parsing represented by css and xpath (i.e. reorganizing the document in the form of a DOM tree, extracting data by finding nodes), html.parser, BeautifulSoup, and lxml in Python are all in this category

  • First, the scheduler will ask the URL manager if there are any URLs to be crawled.
  • If there is, get the URL and pass it to the downloader for downloading
  • After the downloader has downloaded the content, it will pass it to the parser for further data cleaning. In this step, in addition to extracting valuable data, it will also extract the URL to be crawled for the next crawl.
  • The scheduler puts the URLs to be crawled into the URL manager and stores the valuable data for subsequent applications.
  • The above process will continue to loop until there are no more URLs to be crawled

As you can see, for the crawler frameworks above, if there are many URLs to be crawled, the work of downloading, parsing, and storing them is very large (for example, we have a business similar to Dianping.com, and we need to crawl Dianping.com's data. Since it involves crawling millions of merchants, comments, etc., the amount of data is huge!), it will involve multi-threading and distributed crawling. It is not appropriate to implement it with a single-threaded model language such as PHP. Python is more than enough to implement these more complex crawler designs because it supports multi-threading, coroutines and other features. At the same time, due to Python's concise syntax, it has attracted a large number of people to write many mature libraries, which are ready to use and very convenient. The famous Scrapy framework has captured a large number of fans due to its rich plug-ins and ease of use. Most of our crawler businesses are implemented using scrapy, so let's briefly introduce Scrapy and also take a look at how a mature crawler framework is designed.

We must first consider some of the problems that crawlers may encounter in the process of crawling data, so that we can understand the necessity of the framework and what points we should consider when designing our own framework in the future

  • URL queue management: For example, how to prevent repeated crawling of the same URL (de-duplication). If it is on a single machine, it may be OK, but what if it is a distributed crawling?
  • Cookie management: Some requests require account and password verification. After verification, the obtained cookies are needed to access subsequent page requests on the website. How to cache cookies for further operations?
  • Multi-thread management: As mentioned earlier, if there are many URLs to be crawled, the workload of loading and parsing is very large, and single-thread crawling is obviously not feasible. If multi-threading is used, management is a big hassle.
  • User-Agent and dynamic proxy management: The current anti-crawling mechanism is actually quite complete. If we use the same UA and the same IP to make multiple requests to the same website without restraint, it is likely to be blocked immediately. At this time, we need to use random-ua and dynamic proxy to avoid being blocked.
  • Crawling dynamically generated data: Generally, web page data obtained through GET requests contains the data we need, but some data is dynamically generated through Ajax requests. How to crawl it in this case?
  • DEBUG
  • Crawler Management Platform: How to view and manage the status and data of crawlers when there are many crawler tasks

From the above points, we can see that writing a crawler framework still takes a lot of effort. Fortunately, scrapy helps us solve the above problems almost perfectly, allowing us to focus on writing specific parsing and storage logic. Let's see how it implements the above functional points.

  • URL queue management: Use the scrapy-redis plug-in to do URL deduplication processing, and use the atomicity of redis to easily handle URL duplication problems
  • Cookie management: As long as you do a login check once, the cookie will be cached and automatically included in subsequent requests, saving us the trouble of managing it ourselves.
  • Multithread management: As long as you specify the number of threads CONCURRENT_REQUESTS = 3 in the middleware, scrapy can manage multithread operations for us, without having to worry about any complex logic such as thread creation, destruction, life cycle, etc.
  • User-Agent and dynamic proxy management: Use the random-useragent plug-in to randomly set a UA for each request, and use proxies such as Ant (mayidaili.com) to add proxy to each request header. In this way, our UA and IP are basically different each time, avoiding the dilemma of being blocked.
  • Crawling dynamic data (generated by ajax, etc.): Use Selenium + PhantomJs to crawl dynamic data
  • DEBUG: How to effectively test whether the crawled data is correct is very important. An immature framework is likely to re-download the web page every time we want to verify whether the data obtained by xpath, regular expression, etc. is correct, which is extremely inefficient. However, Scray-Shell provides a very friendly design. It will first download the web page to the memory, and then you can do various xpath debugging in the shell until the test is successful!
  • Use SpiderKeeper+Scrapyd to manage crawlers, GUI operation, simple and easy

As you can see, Scrapy solves the main problems mentioned above. When crawling large amounts of data, it allows us to focus on writing the business logic of the crawler without having to pay attention to details such as cookie management and multi-threading management. This greatly reduces our burden and makes it easy to get twice the result with half the effort!

(Note! Although Scrapy can use Selenium + PhantomJs to crawl dynamic data, with the emergence of Puppeter launched by Google, PhantomJs has stopped updating. Because Puppeter is much more powerful than PhantomJS, if you need to crawl a large amount of dynamic data, you need to consider the impact on performance. The Puppeter Node library is definitely worth a try. It is officially produced by Google and highly recommended.)

Now that we understand the main design ideas and functions of Scrapy, let's take a look at how to use Scrapy to develop a crawler project for our audio and video business, and see what problems we will encounter when making an audio and video crawler.

Audio and video crawler practice

1. Let's briefly introduce the system of our audio and video crawler project from several aspects

1. Four main processes

  • Crawling phase
  • Resource processing (including audio, video, image download and processing)
  • Officially entered the warehouse
  • Post-processing stage (similar to watermark removal)

2. Currently supported functions

  • Crawling various video and audio sites (Himalaya, iQiyi, Youku, Tencent, Children's Songs, etc.)
  • Synchronous updates of content on mainstream video and audio sites (Himalaya, Youku)
  • Video watermark removal (video logo)
  • Video screenshot (video content without cover)
  • Video transcoding adaptation (flv is not currently supported by the client)

3. System process distribution diagram

2. Let’s talk about the details step by step

1. Technical selection of crawler framework

When it comes to crawlers, everyone will naturally associate it with Python, so our technical framework is selected from the third-party libraries that stand out in Python. Scrapy is a very good one. I believe that many other friends who do crawlers have also experienced this framework.

Then let's talk about the advantages of this framework that I feel most deeply after using it for so long:

  • The underlying layer of request trigger uses Python's own yield coroutine, which can save content and the callback programming method is elegant and comfortable.
  • For efficient filtering and processing of HTML content, selector's xpath is really useful
  • Since the iteration time is very long, it has a very complete extension API. For example, middlewares can globally hook many event points, and dynamic IP proxy can be implemented by hooking request_start.

2. Design of crawler pool db

The crawler pool db is a very important key storage node for the entire crawling link, so early childhood education has also experienced many field changes.

Initially, our crawler pool db table was just a copy of the official table, storing exactly the same content. After the crawl was completed, it was copied to the official table, and then the corresponding association was lost. At this time, the crawler pool was completely a draft table with a lot of useless data.

Later, it was discovered that the operation needed to see the specific source of the crawler. At this time, there was no website source link in the crawler pool, and it was impossible to correspond to the data content of the crawler pool according to the album id of the official table. Therefore, the crawler pool db made the most important change. The first is to establish the association between the crawler pool data and the crawling source site, that is, the source_link and source_from fields, which represent the original website link corresponding to the content and the source statement definition respectively. The second step is to establish the association between the crawler pool content and the formal library content. In order not to affect the formal library data, we add target_id corresponding to the content id of the formal library. At this point, the need to inform the operation of the specific source of the crawled content can be met.

Subsequent operations found that selecting high-quality content from a large amount of crawler data requires some reference values ​​of source site data, such as the number of views on the source site. At this time, the storage content of the crawler pool db and the official library db is formally differentiated. The crawler pool is no longer just a copy of the official library, but represents some reference data of the source site and some basic data of the official library.

The subsequent synchronous update of source site content function can also be easily achieved by relying on this set of relationships.

During the whole process, the most important thing is to link the three unrelated blocks of "crawl source site content", "crawler pool content" and "official library content".

3. Why do resource processing tasks occur?

Originally, the downloading of resources and some processing should be completed at the same time during the crawling stage, so why is there a separate resource processing process?

First of all, the first version of the early childhood crawler system does not have this separate step, and it is executed serially during the scrapy crawling process. However, the shortcomings found later are:

  • The download pipe that comes with scrapy is not easy to use, and it cannot download in parallel during the download process, so the efficiency is low
  • Since audio and video files are large, there will be various unstable factors when merging resources, and there is a high probability that the download will fail. After the failure, the crawled information will be lost synchronously.
  • In the case of serial execution, a lot of scalability will be lost and re-running will be difficult.

To address the above issues, we added an intermediate state to the crawler table, which is a state where resource downloads fail, but the crawled information is retained. Then, we added independent resource processing tasks and used Python's multithreading to process resources. For these failed contents, we will run resource processing tasks regularly until they succeed. (Of course, if they keep failing, developers will need to troubleshoot the problem based on the logs.)

4. Explain why watermark processing is not placed in the resource processing stage, but in the post-processing stage (i.e. after formal storage)

First of all, we need to understand that the principle of watermark removal is to use the delogo function of ffmpeg. This function is not like converting video formats, which only changes the package. It needs to re-encode the entire video, so it takes a very long time and the corresponding CPU usage is also very large.

Based on the above, if it is placed in the resource processing stage, the efficiency of resource transfer to upyun will be greatly reduced. In addition, Youku alone has more than 3 types of watermarks, which is a very time-consuming task for sorting out rules. This time consumption will also reduce the progress of crawling. First, ensure that the resources are stored in the warehouse and then process the watermarks. On the one hand, the operation can flexibly control the listing and delisting. On the other hand, it also gives developers enough time to sort out the rules. In addition, when the watermark processing fails, there is still the source video that can be restored.

5. How to remove image watermarks

Many images captured by crawlers have watermarks. Currently, there is no perfect way to remove watermarks. The following methods can be used:

Search for the original image. Most websites will save the original image and the watermarked image. If you can't find the original link, there is nothing you can do.

Cropping method: Since watermarks are usually at the corners of images, if the cropped image is acceptable, the watermarked part can be cropped directly in proportion.

Use opencv library to process and call opencv, a graphics library, to perform image repair similar to PS. The effect is similar, but the repair effect is not good when encountering complex graphics.

3. Problems and Solutions

Problems such as interruptions or failures often occur during the resource download phase [Solution: Separate resource download and related processing from the crawling process to facilitate task re-run]

Although they are on different platforms, there are too many duplicate resources, especially video websites [Solution: Match resources by title before downloading, and filter them if they match exactly, saving extra download time]

During a large number of crawling processes, the IP may be blocked [Solution: Dynamic IP Proxy]

The resource acquisition rules of large video websites are frequently replaced (encryption, video cutting, anti-hotlinking, etc.), and the development and maintenance costs are high [Solution: you-get third-party library, which supports crawling of a large number of mainstream video websites, greatly reducing development and maintenance costs]

  • App related crawling is encrypted [Solution: Decompile]
  • Youku and Tencent Video will have logos [Solution: ffmpeg delogo function]
  • The crawled content has no connection with the anchor and looks like pirated content [Solution: When the content is officially stored, put the anchor vest on the content]
  • The crawled source site content is still being updated, but our platform content cannot be updated [Solution: Store the original site link in db and update according to the difference]
  • The album crawling task media of mainstream video websites such as Youku and iQiyi is stored in the server text file, and manual command triggering is required, which consumes manpower [Solution: Integrate script logic, use db as the medium, and use scheduled task detection to trigger]
  • The operation needs to add some data such as the original site playback volume to the operation background display as the basis for review, adding to the top, etc. [Solution: The crawler table lost its association after importing the data into the formal table. Now the association is established, and the crawler table adds the original site related data fields]
  • Since many of my resources are crawled, the security and anti-theft of resources are very important. So how can I ensure that my resources are still safe after being spit out by the interface? [Solution: Upyun's anti-hotlink space, the resource addresses under this space have corresponding timeliness]
  • The interface does not contain information related to the media file, but the platform requires it, such as duration [Solution: Media file parsing supported by ffmpeg]
  • Many downloaded videos cannot be played on the client [Solution: Verify the format and bitrate before uploading the resource to upyun, and perform corresponding transcoding if they do not meet the requirements]

4. Finally, let’s make a summary

The audio and video crawler code system of our video may not be applicable to all business lines, but the thinking and solutions to similar problems can indeed be used as a reference and applied to various business lines. I believe that the project owner will have a lot of inspiration for everyone.

Crawler Management Platform

When there are many crawler tasks, the ssh+crontab method will become very troublesome. You need a platform that can view and manage the crawler running status at any time.

SpiderKeeper+Scrapyd is currently a ready-made management solution that provides a good UI interface. Features include:

1. Crawler job management: start the crawler regularly to crawl data, and start and close the crawler task at any time

2. Crawler log records: log records during the crawler operation can be used to query crawler problems

3. Check the crawler running status: check the running crawlers and crawler running time

Summarize

From the above description, we can briefly summarize the technical selection of crawlers

  1. If it is structured data (JSON, etc.), we can use curl, PHP and other single-threaded module languages ​​to process it.
  2. If it is unstructured data (html, etc.), bash cannot handle this type of data, so you need to use regular expressions and xpath to process it. You can also use php and BeautifulSoup to process it. Of course, this situation is limited to the case where there are fewer URLs to be crawled.
  3. If there are many URLs to be crawled and a single thread cannot handle it, multi-threading is needed to handle it, or cookie management, dynamic IP proxy, etc. are required. In this case, we have to consider high-performance crawler frameworks such as scrapy.

Choosing the appropriate technology according to the complexity of the business scenario can achieve twice the result with half the effort. We must consider the actual business scenario when selecting technology.

This article is reprinted from the WeChat public account "Ma Hai", which can be followed through the following QR code. To reprint this article, please contact the Ma Hai public account.

<<:  After talking about the three-way handshake and the four-way wave, I was asked to write the HTTP protocol code manually.

>>:  Gartner: Global 5G network infrastructure spending will nearly double in 2020

Recommend

The secrets of the black industry: the things about the "coding platform"

Introduction The rapid development of Internet bu...

What is UTP Cable?

The Internet plays a key role in our daily lives,...

WiFi coverage throughout the house requires a remedy

Watching various live broadcasts every day, but o...

A brief analysis of traffic management of Istio component Envoy

Background The development convenience brought by...

my country has built the world's largest 4G network

[[181278]] On January 6, the Ministry of Science ...

...

One skill a day: You can make a mistake in splicing a URL, and write a crawler

When writing crawlers, we often need to parse the...

Pre-terminated trunk copper cable and method of using the same

High-density cabling products and standard modula...