50% of the traffic on the Internet is generated by crawlers?

50% of the traffic on the Internet is generated by crawlers?

[[238571]]

Among a large number of technical terms, the most familiar one to ordinary people is probably "crawler". In fact, the name crawler has already well expressed the role of this technology - like densely packed insects distributed on the Internet, crawling to every corner to obtain data; it also expresses people's emotional inclination towards this technology to a certain extent - insects may be harmless, but they are always unwelcome.

There have been many discussions about the functions, benefits or harms of crawlers. Because crawler technology causes a large number of IP addresses to visit websites, occupy bandwidth resources, and harm user privacy and intellectual property rights, many Internet companies will spend a lot of effort on "anti-crawling".

Compared with the crawler technology itself, anti-crawler is actually more complicated and its development history is more interesting.

How do we fight against crawlers? First courtesy, then force, then court

Almost at the same time as the emergence of crawler technology, anti-crawler technology was also born. In the 1990s, when search engine websites began to use crawler technology to crawl websites, some search engine practitioners and website owners discussed and established a "gentleman's agreement" through emails - robots.txt. That is, the website has the right to stipulate which content on the website can be crawled by crawlers and which content cannot be crawled by crawlers. This can not only protect privacy and sensitive information, but also be included in search engines and increase traffic.

When crawler technology was first invented, we were still in the ancient times. The Internet was a paradise where sages gathered, and most practitioners would tacitly abide by this agreement. After all, there was nothing to gain from information and data at that time. But soon the Internet began to be filled with commodity information, air ticket prices, personal privacy... Under the temptation of interests, naturally some people began to violate the crawler agreement.

When the gentleman's agreement failed, we began to use technical means to block the invasion of crawlers. For example, we found crawlers from the number of visits. When we browsed too fast on a website, the system often asked us to enter a verification code because this fast browsing behavior was very close to crawlers. Or we changed HTML tags from time to time so that they could not match the Web ranking to restrict crawlers.

But even so, there is no way to prevent crawlers from entering and exiting the website. We can only make it more difficult for crawlers to access. If the website can be accessed by humans, it must be accessible by crawlers. And if crawlers are completely blocked from crawling from the bottom up, it is likely that the website will not be included in the search engine.

Therefore, when all the methods of first using courtesy and then using force fail to deal with crawlers, the last remaining means of fighting against crawlers is to take them to court.

Two lawsuits and 17 years later, the crawlers have not changed, but we have changed

The first lawsuit about crawlers in history was born in 2000, when eBay sued BE, a price comparison website that aggregates price information. eBay claimed that it had written what information could not be crawled into the crawler agreement, but BE violated this agreement. However, BE believed that the content on eBay was collectively contributed by users and not owned by users, and the crawler agreement could not be used as a legal reference.

Finally, after repeated discussions within the industry and several rounds of heated debate in court, eBay won the case, and it also set a precedent for using the crawler protocol as the main reference.

But this has also caused dissatisfaction among many people. Does it mean that whether the crawler can crawl, how to crawl, and whose crawler can crawl are all decided by the company being crawled? When this kind of power is mastered, the profit-seeking and selfishness of the business world are immediately exposed.

There is a saying that 50% of the traffic on the Internet is created by crawlers. Although this statement is a bit exaggerated, it also reflects the ubiquity of crawlers. The reason why crawlers are everywhere is that crawlers can bring profits to Internet companies.

Take e-commerce websites for example. Many e-commerce websites are willing to have their information crawled by price comparison websites or other shopping information websites, because this can bring more traffic to their products. However, they are unwilling to have their price information and product descriptions obtained by other e-commerce websites, because they are worried that other e-commerce websites will maliciously compare prices or plagiarize. At the same time, they often crawl data from other e-commerce websites, hoping to see other people's prices.

This tangled and complicated feeling is like the competition between top students. Top students can copy notes for poor students because they know that no matter how hard the poor students try, they can only score 60 or 70 points. However, top students will be on guard against other top students because there is only real competition between top students. Therefore, "top top students" like JD.com and Taobao will clearly state in the agreement that the other party is prohibited from crawling data. Of course, it is hard to say whether both parties abide by this gentleman's agreement.

At the same time, there are also some websites that initially allow other websites to crawl data, but after a period of time, they will sue the websites that crawl data. The most typical example is LinkedIn. In 2017, LinkedIn took a data analysis company called HiQ to court because it was found that the company crawled the employment status information of LinkedIn users and provided it to two other companies that used machine learning to analyze employees' job-hopping tendencies and professional skills.

The result was that even though LinkedIn was under the banner of protecting user privacy, it still lost the case and was required by the federal court to open its data interface. The reason was that HiQ had been crawling LinkedIn's data for five years, and LinkedIn had been aware of it and had attended forums and summits organized by HiQ. Now that LinkedIn has launched its own business similar to HiQ, it is about to cut off HiQ's livelihood.

The reason why there were two lawsuits with completely different outcomes between 2000 and 2017 is that our original intentions for creating crawlers and anti-crawlers have changed. From obtaining information and protecting privacy at the beginning, it has become obtaining commercial benefits and countering opponents today.

Reptile makers say: Moral constraints are the best way to fight reptiles?

We also talked about this topic with two Python programmer friends.

Programmers are very individualistic creatures, and it is difficult for them to reach a consensus on issues, especially on issues like "What is the best language?" and "Are early Hammer phones rubbish?" But on the issue of anti-crawler, programmers seem to have shown unprecedented consensus.

A programmer working for a small OTA said that when the company was just starting out, they were often asked to crawl travel routes from travel websites. At this time, they would usually choose websites with stronger traditional corporate genes, such as China Youth Travel Service Atour.com, because their "anti-crawler capabilities are almost zero."

Another programmer from a large company said that companies usually outsource the dirty work such as crawling data, but when it comes to anti-crawling, if the data crawling party’s technology is good enough and does not bring excessive bandwidth pressure to the server, they will even turn a blind eye as long as the KPI is met.

At the same time, both parties admitted that sometimes they would create some small crawler programs out of interest to facilitate the acquisition of some data.

When talking about the legality of crawler technology, they told me that it is difficult to prevent crawler technology by law. Unless it involves bulk transfer of user-generated content between competing products, similar to the previous bulk transfer of Bilibili videos by 360 Quick Video and the recent bulk transfer of Xiaohongshu content by Dianping.com. As for those who crawl other people's data for analysis, on the one hand, it is difficult to obtain evidence to determine the object, and on the other hand, the entire litigation process will be very long. It is difficult for companies to clearly show where they have suffered losses, and they usually accuse the other party with the panacea of ​​"unfair competition".

When we asked them if there were any good anti-crawler methods from a technical perspective, they told me that the best anti-crawler method is neither technology nor law, but public relations - take screenshots and find a few media outlets to report it, with some insinuations about infringement, database invasion, and privacy information, and you can immediately discredit the other party from a moral high ground, so that people will not notice that your company's anti-crawler technology is not in place. If the other party is a listed company, the effect will be even better.

When the AI ​​era meets AI crawlers, the war has just begun

Although "moral anti-scraping" is just a joke, it does show to some extent that enterprise technicians are helpless against crawlers. However, it is foreseeable that with the increasing application of big data and machine learning, the era of turning a blind eye to crawlers and coexisting peacefully with them will soon be over.

The main problem is that the emergence of crawlers will greatly increase the difficulty of data analysis.

When data analysis companies use crawlers to obtain data for analysis, the existence of a large number of crawlers is making these data inaccurate. The inaccurate number of article views makes us misjudge people's attention to news facts, and the virtual IPs derived from crawlers need to be removed during data cleaning... The more advanced the crawler technology, the closer its behavior pattern is to that of a real person, which increases the difficulty of data analysis. Over time, those algorithms that we thought were looking for patterns in human behavior have actually found the behavior patterns of robots.

At the same time, traffic fluctuations caused by crawlers can also cause misjudgment in machine learning algorithms.

The most typical example is the dynamic pricing of air tickets. The website will determine the popularity of air tickets based on the current page views and adjust the price. At this time, if there are a large number of crawlers browsing the website, the algorithm will give a price that does not match the actual situation, which also damages the consumer's right to buy cheap products.

Some data analysis companies have even put up the sign of "AI crawlers", making the behavior patterns of crawler scripts more similar to ordinary users, making it difficult for the crawled companies to discover, and even using image recognition technology to crack the verification codes used by websites for interception.

In this case, it becomes more difficult and more important for websites to distinguish between humans and robots. Many websites have also begun to use machine learning technology to counter AI crawlers, such as dynamically coding graphic verification codes to deal with image recognition. At the same time, the development of hardware technology for PCs and mobile terminals has also made it possible for more complex verification methods such as biometrics to join the battle. Both sides are standing on the same level and using technology to fight each other.

It can be said that the battle between crawler technology and anti-crawler technology has lasted for more than a decade, but the real "war" has just begun. Before completely conquering malicious crawlers, we'd better remain skeptical of all the "boasting" about big data and accurate predictions.

<<:  As the strongest voice in the field of enterprise communications, Youyin Communications has been focusing on this for 13 years!

>>:  Newbie Science: A Brief Discussion on the Changes in Home Router Security

Recommend

HostNamaste: $18/year-1.5GB/30GB/1.5TB Los Angeles & Dallas data centers

HostNamaste is a foreign hosting company founded ...

DiyVM: 50 yuan/month-2GB/50GB/5M/Hong Kong CN2 line

DiyVM is a brand of Hong Kong Ruiou International...

Why Remote Work Requires Mobile Unified Communications

When IT departments consider work-from-home strat...