A complete history of web crawlers

A complete history of web crawlers

[[415987]]

The well-known research organization Aberdeen Group once conducted a survey, and the results were shocking.

Across the entire Internet, web crawlers generate 37.2% of traffic!

In other words, out of every 100 Internet users, only 63 are real humans, and the rest of the traffic is generated by robots.

There is an even more terrifying statement that in the future more than 50% of Internet traffic will be generated by robots.

In the real world, humans are still worried about the threat of artificial intelligence, but in the virtual world, the traffic generated by robots is already on par with that of humans, or even exceeds that of humans.

At every moment, crawlers are imitating human online behavior, strolling around various websites, clicking buttons, checking data, or reciting the information they see. They never get tired and repeat the cycle over and over again.

You must have seen a verification code. It may look like this:

It could also be:

Or like this:

No matter what it looks like, a CAPTCHA has only one purpose: to identify a real human user.

Open Baidu search, search for some information, solve some problems. Inadvertently, you have become one of the many crawler users.

Crawlers have spread to every corner of the Internet, affecting everyone.

But do you know the past and present life of reptiles?

The good side

In 1994, Xiao Ma, who was participating in the "Information Media Digital Library" project at Carnegie Mellon University, developed a search engine called Lycos using 3 pages of code in order to solve some difficulties of the project.

Lycos is the abbreviation of Lycosidae (a type of wolf spider that is good at catching prey).

This simple search engine made Xiao Ma see the huge business opportunities behind it, so soon after, Lycos was officially established.

[[415990]]

In just two years, Lycos successfully went public, becoming the fastest company to go public in history. According to Nielsen/NetRatings, in October 2002, Lycos had 37 million visitors, making it the fifth most visited website in the world.

However, the big cake of search engines cannot escape the fate of being eaten by a pack of wolves.

In 1995, one year after the birth of Lycos, two computer science students at Stanford University, Xiaola and Xiaoxie, began to study a computer program called BackRub.

[[415991]]

This program is a search engine that uses backlink analysis to track and record data on the Internet.

They were determined to develop a powerful search engine that could be used by people around the world to obtain information from the Internet more conveniently.

In 1998, Laura and Michael Dowdell used all their assets, plus a little financial support from their alma mater and roommates, to establish a company called Google.

Because they did not have sufficient financial security, they had to buy second-hand computer parts and work in a garage.

The difficult entrepreneurial environment made Xiaola and Xiao Xie want to sell Google at one point. They invited Yahoo, Excite and several other Silicon Valley companies to buy Google. Unfortunately, these companies were only willing to offer $1 million, which was seriously inconsistent with their psychological expectations, so the matter had to be abandoned.

At almost the same time, on the other side of the earth, a young man named Xiao Ma developed a chat software called QQ and also wanted to sell it, but he was unsuccessful.

[[415993]]

History always repeats itself surprisingly well.

No one expected that these two little-known companies would become Internet giants.

On the other side of the world, Xiao Li, who had been in the United States for 8 years, saw that the domestic Internet environment had matured, so he immediately returned to China to start a business and founded a company called Baidu.

[[415994]]

At this point, a situation in which Google, Yahoo, and Baidu divided the market into three parts gradually formed.

In ancient times, the Internet was still a pure land where sages gathered. In order to respect the rights of websites, major search engines discussed and established a gentleman's agreement through emails - robots.txt.

Just put a robots file in the root directory of your website and tell the search engine which content cannot be crawled, and the web crawlers will abide by the agreement and not crawl these contents.

The evil side

With the development of the Internet, the amount of information has grown rapidly. The entire online world is filled with a lot of valuable information, including product information, flight information, and personal privacy data.

Some lawless elements saw huge benefits from it.

Under the temptation of profit, these people began to violate the crawler protocol, write crawler programs, and maliciously crawl the content of the target website.

The first lawsuit involving crawlers in history occurred in 2000, when eBay sued a website that aggregated price information.

[[415995]]

eBay believes that it has used robot protocols to clearly tell what information cannot be crawled and what information can be crawled, but this company violated the agreement and illegally crawled information such as product prices.

However, the defendant believes that user data on eBay and product information uploaded by users should belong to the users collectively, not to eBay, and the robot agreement is invalid.

Ultimately, the court ruled in favor of eBay.

This case set a precedent for using crawler protocols as primary reference evidence.

Nowadays, crawler technology is developing rapidly, and there are already types such as general web crawlers, focused web crawlers, incremental web crawlers, and deep web crawlers. There are also many ways to crawl targets, such as based on target web page features, based on target data patterns, based on domain concepts, etc.

Crawler technology, whether well-intentioned or malicious, will always be around the Internet, affecting every minute of every day of the Internet.

<<:  Authoritative interpretation from the Ministry of Industry and Information Technology: What exactly is "5G+Industrial Internet"?

>>:  Enterprises want to formalize WFH network architecture

Recommend

DesiVPS: $3/month-2GB/20G SSD/2.5TB/San Jose & Netherlands Data Center

DesiVPS is an Indian VPS hosting provider headqua...

Network communication protocol TCP

It is very easy to create a local TCP server, whi...

With a downlink rate of over 100Mbps, can Starlink really replace 5G?

According to Mobile World Live, Ookla's lates...

How Fiber-to-the-Home Broadband Revolutionized Internet Connectivity

The internet has become an integral part of our l...

Cloud computing in 2018: Switch or die

Cloud computing technology is creating a new and ...

Morphling: How to maximize cost reduction when deploying AI in cloud native?

With the vigorous development of cloud-native tec...

On the Importance of Redundant Backup in Data Centers

It is a commonplace to say that data centers need...

Learn about server network cards in one minute

I have already introduced to you what a server is...

5G will catalyze the era of large-scale innovation in the whole society

Intuitively, 5G has a very obvious role in drivin...

Five API Gateway Technology Selections, yyds

This article intends to discuss gateways around s...

What will happen when 5G network falls in love with public cloud?

[[410935]] Recently, AT&T, the second largest...