A complete history of web crawlers

[[415987]]

The well-known research organization Aberdeen Group once conducted a survey, and the results were shocking.

Across the entire Internet, web crawlers generate 37.2% of traffic!

In other words, out of every 100 Internet users, only 63 are real humans, and the rest of the traffic is generated by robots.

There is an even more terrifying statement that in the future more than 50% of Internet traffic will be generated by robots.

In the real world, humans are still worried about the threat of artificial intelligence, but in the virtual world, the traffic generated by robots is already on par with that of humans, or even exceeds that of humans.

At every moment, crawlers are imitating human online behavior, strolling around various websites, clicking buttons, checking data, or reciting the information they see. They never get tired and repeat the cycle over and over again.

You must have seen a verification code. It may look like this:

It could also be:

Or like this:

No matter what it looks like, a CAPTCHA has only one purpose: to identify a real human user.

Open Baidu search, search for some information, solve some problems. Inadvertently, you have become one of the many crawler users.

Crawlers have spread to every corner of the Internet, affecting everyone.

But do you know the past and present life of reptiles?

The good side

In 1994, Xiao Ma, who was participating in the "Information Media Digital Library" project at Carnegie Mellon University, developed a search engine called Lycos using 3 pages of code in order to solve some difficulties of the project.

Lycos is the abbreviation of Lycosidae (a type of wolf spider that is good at catching prey).

This simple search engine made Xiao Ma see the huge business opportunities behind it, so soon after, Lycos was officially established.

[[415990]]

In just two years, Lycos successfully went public, becoming the fastest company to go public in history. According to Nielsen/NetRatings, in October 2002, Lycos had 37 million visitors, making it the fifth most visited website in the world.

However, the big cake of search engines cannot escape the fate of being eaten by a pack of wolves.

In 1995, one year after the birth of Lycos, two computer science students at Stanford University, Xiaola and Xiaoxie, began to study a computer program called BackRub.

[[415991]]

This program is a search engine that uses backlink analysis to track and record data on the Internet.

They were determined to develop a powerful search engine that could be used by people around the world to obtain information from the Internet more conveniently.

In 1998, Laura and Michael Dowdell used all their assets, plus a little financial support from their alma mater and roommates, to establish a company called Google.

Because they did not have sufficient financial security, they had to buy second-hand computer parts and work in a garage.

The difficult entrepreneurial environment made Xiaola and Xiao Xie want to sell Google at one point. They invited Yahoo, Excite and several other Silicon Valley companies to buy Google. Unfortunately, these companies were only willing to offer $1 million, which was seriously inconsistent with their psychological expectations, so the matter had to be abandoned.

At almost the same time, on the other side of the earth, a young man named Xiao Ma developed a chat software called QQ and also wanted to sell it, but he was unsuccessful.

[[415993]]

History always repeats itself surprisingly well.

No one expected that these two little-known companies would become Internet giants.

On the other side of the world, Xiao Li, who had been in the United States for 8 years, saw that the domestic Internet environment had matured, so he immediately returned to China to start a business and founded a company called Baidu.

[[415994]]

At this point, a situation in which Google, Yahoo, and Baidu divided the market into three parts gradually formed.

In ancient times, the Internet was still a pure land where sages gathered. In order to respect the rights of websites, major search engines discussed and established a gentleman's agreement through emails - robots.txt.

Just put a robots file in the root directory of your website and tell the search engine which content cannot be crawled, and the web crawlers will abide by the agreement and not crawl these contents.

The evil side

With the development of the Internet, the amount of information has grown rapidly. The entire online world is filled with a lot of valuable information, including product information, flight information, and personal privacy data.

Some lawless elements saw huge benefits from it.

Under the temptation of profit, these people began to violate the crawler protocol, write crawler programs, and maliciously crawl the content of the target website.

The first lawsuit involving crawlers in history occurred in 2000, when eBay sued a website that aggregated price information.

[[415995]]

eBay believes that it has used robot protocols to clearly tell what information cannot be crawled and what information can be crawled, but this company violated the agreement and illegally crawled information such as product prices.

However, the defendant believes that user data on eBay and product information uploaded by users should belong to the users collectively, not to eBay, and the robot agreement is invalid.

Ultimately, the court ruled in favor of eBay.

This case set a precedent for using crawler protocols as primary reference evidence.

Nowadays, crawler technology is developing rapidly, and there are already types such as general web crawlers, focused web crawlers, incremental web crawlers, and deep web crawlers. There are also many ways to crawl targets, such as based on target web page features, based on target data patterns, based on domain concepts, etc.

Crawler technology, whether well-intentioned or malicious, will always be around the Internet, affecting every minute of every day of the Internet.

<<: Authoritative interpretation from the Ministry of Industry and Information Technology: What exactly is "5G+Industrial Internet"?

>>: Enterprises want to formalize WFH network architecture

Compared with IPv4, IPv6 is more than just an increase in address length

QQ account stolen in 22 years, friends help verify but appeal is invalid: the confusion behind Tencent's authentication system

Blog

Thousands of face photos can be bought for 2 yuan! CCTV reveals the black market of AI! The truth is far more than this

Blog

RhinoTech: Dedicated servers starting at $24.9 for the first month at 50% off, US/Japan/Korea data centers, clusters/large bandwidth options

Blog

What is coming will come. Taiwan may shut down 3G this year.

Blog

IT Viewpoint: Five major network challenges for 2019

Blog

Recommend

The results of the 14th 51CTO China Enterprise Annual Selection are out!

[51CTO.com original article] On January 6, 2020, ...

"Disruption" or "Pie in the sky", what is the charm of OpenRAN?

OpenRAN (Open Radio Access Network) seems to be v...

The number of 5G package users exceeds 600 million, and the profits of the three major operators have skyrocketed. Is 5G starting to make money?

In June 2019, the four major operators obtained 5...

ITLDC offers 50% discount on down payment, 11 data centers to choose from, unlimited traffic VPS starting from 16.5 euros for the first year

ITLDC has released a 50% discount code for all VP...

How fast is 5G? Is the radiation strong? Does it have any effect on the human body? Here comes the authoritative interpretation!

What are the improvements of 5G network compared ...

Huawei Storage promotes the "all-cloud, flash-based" strategy to eliminate bottlenecks in migrating key services to the cloud

[51CTO.com original article] With the development...

Promoting the large-scale development of 5G applications

By the end of last year, the number of 5G base st...

Don't call Riverbed just a WAN optimization company anymore. It's become something more.

[51CTO.com original article] Recently, Riverbed l...

A complete history of web crawlers

The good side

The evil side

Compared with IPv4, IPv6 is more than just an increase in address length

The 10 most representative software-defined network tools in 2020

Let’s talk about 5G cloud dedicated line, do you understand?

White Box in the Enterprise: Why Isn't It a Popularity?

API security representative manufacturer! Ruishu Information is selected into China's data security development roadmap

QQ account stolen in 22 years, friends help verify but appeal is invalid: the confusion behind Tencent's authentication system

Thousands of face photos can be bought for 2 yuan! CCTV reveals the black market of AI! The truth is far more than this

RhinoTech: Dedicated servers starting at $24.9 for the first month at 50% off, US/Japan/Korea data centers, clusters/large bandwidth options

What is coming will come. Taiwan may shut down 3G this year.

IT Viewpoint: Five major network challenges for 2019

Recommend

The results of the 14th 51CTO China Enterprise Annual Selection are out!

"Disruption" or "Pie in the sky", what is the charm of OpenRAN?

The number of 5G package users exceeds 600 million, and the profits of the three major operators have skyrocketed. Is 5G starting to make money?

How to implement RBAC with API Gateway and OPA

Ruijie Networks attended the Digital China Summit and attracted attention with its unique all-optical network

Sharktech: Los Angeles E3 high-security 1Gbps unlimited traffic server starting at $59/month

ITLDC offers 50% discount on down payment, 11 data centers to choose from, unlimited traffic VPS starting from 16.5 euros for the first year

How fast is 5G? Is the radiation strong? Does it have any effect on the human body? Here comes the authoritative interpretation!

Why has 5G, which has been hyped for so long, suddenly stopped working? How far are we from the 5G era?

Pictures speak louder than words. Good pictures bring more traffic. Ten tips to optimize website pictures

Huawei Connect 2017 previews: Emphasis on cloud implementation and practice

5G competition enters the second half, shifting from technology to application

Huawei Storage promotes the "all-cloud, flash-based" strategy to eliminate bottlenecks in migrating key services to the cloud

Promoting the large-scale development of 5G applications

Don't call Riverbed just a WAN optimization company anymore. It's become something more.