How to quickly master the HTTP protocol (HD mind map)

How to quickly master the HTTP protocol (HD mind map)

The HTTP protocol is extremely complex. It affects many aspects such as browsers, crawlers, proxy servers, firewalls, CDNs, Web containers, microservices, etc. Its own specifications are not unified, and new and old versions of various software exist on the Internet at the same time. In this case, if you don't have a deep understanding of HTTP, you will easily be stumped by various network problems.

So, how can we quickly master the HTTP protocol?

In my opinion, we need to start from the following four aspects:

  1. If you want to do your work well, you must first sharpen your tools. First of all, we must master packet capture and related tools, so that we will be more comfortable when analyzing various network protocols.
  2. Let’s start with the architecture, and figure out what problems the HTTP protocol is trying to solve, what non-functional constraints it faces, and how it has evolved step by step to where it is today.
  3. Be familiar with the protocol format, understand the URI format under the tunnel or forward proxy, the transmission format of multi-expression package and variable-length package, and the DNS QUESTION/ANSWER.
  4. Understand the application scenarios, what is the conflict between cross-domain access and homology policy? How to finely control the shared cache on the proxy server?

First, let me share with you the HTTP learning knowledge graph I compiled. You can save it and refer to it from time to time:

(For high-resolution pictures, please see this article: http://note.youdao.com/noteshare?id=56e65085e89f449feb5804a887dbf058&sub=68510F5CB15E46C6BFA7021506A65330)

Below, we will elaborate on these four aspects one by one.

1. Use the right tools

To learn the HTTP protocol well, you need to use at least the following four tools:

1.1 Chrome Network packet capture panel

This tool has 4 advantages:

Quickly analyze HTTP requests

Conveniently offload TLS/SSL content

Can help analyze page loading performance

Convenient analysis of websocket content

The tool contains 5 panels and supports complex attribute filtering in the Filter input bar. In the request list, you can see the upstream and downstream of the request, as well as the time distribution of each request.

1.2 telnet

This tool is mainly used to construct the original application layer protocol to help us understand the format of HTTP actually transmitted in the network.

1.3 curl

There are 2 problems with telnet:

1. It is too cumbersome to enter a complete request every time. In fact, we may just want to change the method or a certain HEADER.

2. It does not support HTTPS or package compression, which results in the inability to initiate requests to certain sites.

Curl solves these problems perfectly. It is also used to construct customized HTTP requests and analyze HTTP response headers or bodies.

1.4 Wireshark

This is an essential tool for learning the complete Web protocol stack. We can capture packets with tcpdump on the server side and then analyze them conveniently on the visual Wireshark.

Wireshark is extremely powerful:

Supports both BPF capture filters and display filters during analysis;

Through flow tracking or session icons, we can easily analyze by session;

Configurable coloring rules help us easily identify problematic messages with different colors;

By marking and exporting messages, merging files, and shifting time, you can easily analyze and compare messages captured on multiple machines together.

You can see the readable value parsed from each layer of the message in Packet Detail, and you can also see the binary stream in Packet Byte.

Supports message statistics, which is very convenient for analyzing a large number of HTTP messages!

2. Understand the architecture

To understand the architecture of HTTP, you need to start from the following four aspects:

2.1 What problem does HTTP protocol try to solve?

The HTTP protocol was originally designed to solve the communication problem between humans and machines. The browser in the so-called "B/S architecture" is a factor that we must take into consideration.

Therefore, the HTTP protocol needs to transmit hypermedia data (including large-granularity data such as pictures and videos).

Of course, many devices in the Internet of Things are now also using the HTTP protocol, so it also solves the communication between machines.

Of course, web crawlers are also a problem that the HTTP protocol has to face, and specifications such as robots.txt came into being.

2.2 What non-functional constraints does the HTTP protocol face?

It mainly includes the following five aspects:

High scalability, because it needs to serve a global user base and a lifespan of more than decades

Low threshold, including both usage threshold and development threshold. The decline of Java Applet and the rise of Javascript are excellent examples.

Large-granularity data transmission in a distributed environment

Uncontrollable loads and a wide variety of components on the Internet

Forward compatibility: Many features in HTTP/1.1 need to take into account the fact that there are proxy servers that only support HTTP/1.0 running on the Internet.

2.3 What is the architectural design followed?

HTTP/1.1 is designed in full compliance with the REST architecture, which mainly includes the following four sub-architectures:

LCS: spatially layered client server, so we have tunnels, proxies, gateways, CDN, load balancing and other products;

CSS: stateless client server, so we have the Request/Response request mode, and require that the cookie header or URL cannot exceed 4K, etc.

COD: Code on Demand, which means moving the code from the server to the client and then running it. Today's front-end ecosystem is based on Javascript derived from this architecture.

$: Cache. There is cache everywhere in HTTP components, including shared cache and private cache. If the cache time limit is not specified, a cache expiration value must be estimated.

2.4 What are the characteristics of HTTP protocol?

First of all, we need to understand which layer of the OSI conceptual model it is at and where it is in the TCP/IP system.

Secondly, we can derive its definition from the above architecture: a stateless, application-layer protocol that operates in a request/response manner, which uses extensible semantics and self-describing message formats to flexibly interact with network-based hypertext information systems!

3. Be familiar with the protocol format

When learning the HTTP protocol format, you should start from the following three aspects:

3.1 Extended Backus-Naur Form: ABNF Metalanguage

Metalanguage can be used to describe protocol formats, and ABNF strictly defines the format of HTTP.

ABNF is not complicated and only takes us 10 minutes to learn. It consists of two parts: operators and core rules, which are not listed here.

3.2 HTTP protocol format

To master the HTTP protocol format, you need to sort out a tree-like knowledge graph. Please refer to the HTTP knowledge graph I compiled at the end of this article.

3.3 DNS protocol format

We need to master 3 aspects of knowledge:

DNS messages are based on UDP, and their general format is fixed. You need to understand the meaning of each field.

In the Questions section, you need to focus on how the QNAME domain name is encoded and the meaning of QTYPE.

The Answers section has more fields, especially the offset representation of the NAME and RDATA sections.

4. Understand the application scenarios

HTTP has a wide range of application scenarios. Below I list 9 common scenarios. The methods, response codes, headers, and body encoding methods mentioned in the protocol format are all related to specific scenarios.

4.1 How to negotiate content

Responsive negotiation is rarely used due to unclear RFC specifications, while proactive negotiation regarding language, encoding, media type, etc. is a common way we deal with in our daily lives.

4.2 How to submit a FORM form

Although there are three encoding methods for form submission, the most commonly used method is to have multiple expressions separated by boundaries coexist in a single package. The WAF firewall must consider how to apply SQL injection attacks in this package.

4.3 Use of Range Request

The breakpoint resumption and multi-threaded downloading used to transfer large files both require the use of the Range specification. In order to prevent the server from updating during the multi-request download process, the conditional request If-Range is also introduced.

4.4 Cookie and Session Design

There are many attributes in Set-Cookie, including expires-av and max-age-av that limit the validity period, domain-av and path-av that limit the scope of use, secure-av that limits the protocol, or httponly-av that limits the object of use.

All these restrictions are aimed at whether it is safe for browsers to use cookies. At the same time, for convenience, browsers also support third-party cookies, which makes it easier for manufacturers to collect user information.

4.5 Browser Homology Policy and Cross-Domain Requests

The same-origin policy is a restriction imposed by the browser. If we process the response directly based on the network library, we are not subject to this restriction. Therefore, the effectiveness of this same-origin policy is very dependent on the implementation of the browser. Of course, the same-origin policy does not include protection against CSRF attacks. The server usually solves CSRF attacks based on the token strategy.

Security and convenience must be weighed against each other. In order to increase convenience, cross-domain requests of AJAX must be allowed, so CORS was born.

4.6 Conditional Requests

Conditional requests can not only handle resource variables in the middle of multi-threaded downloads, but also take effect on multi-person collaborative wiki systems, and can also be used for cache updates. In fact, it has a lot of room for development in RESTful API design.

4.7 Shared Cache and Private Cache

Caches are everywhere on the Internet today. Even if the server does not configure certain resources to be cached, browsers are trying their best to estimate a period of time to cache resources. Because cache can greatly improve user experience and reduce network load! There are many HTTP headers that can control caches, which not only control the validity period of the cache, but also control the keywords based on the cache.

4.8 Application of Redirection

Regarding redirection, we need to understand it from two dimensions and four quadrants: changeable method | unchangeable method, cacheable | non-cacheable

This leads to five different response status codes: 301, 302, 303, 307, and 308.

4.9 Web Crawler

Crawlers are everywhere, not just the long-standing search engine crawlers. Currently, travel (such as 12306 train tickets or AirAsia), e-commerce, social networking (Sina Weibo), etc. are all widely harassed by crawlers. Crawlers not only crawl information, but also simulate human behavior, such as many ticket grabbing machines and zombie fans. On the other hand, in order to welcome Google/Baidu's crawlers, various SEO strategies and tutorials have been born, and many businesses have used PageRank loopholes to improve keyword rankings to make profits. Therefore, it is also very important to understand how crawlers work.

Of course, HTTP application scenarios are far more than these, but a thorough grasp of these scenarios will enable us to fully understand the common methods, headers, response codes, and so on in the HTTP protocol.

HTTP protocol is a very important part of Web protocol. As a programmer, whether you are a front-end or back-end engineer, or an operation and maintenance tester, if you want to interview for a higher position, or want to understand the technical business architecture from a higher perspective, and be able to solve problems quickly and efficiently when they arise, Web protocol is definitely a hurdle you cannot avoid. Proficiency in various commonly used Web protocols can help you easily deal with various network problems at work.

<<:  IPv6 series - 10 common problems for beginners

>>:  A brief discussion on IPv6 intrusion and defense

Recommend

Is working from home a good idea? See which companies are hiring remote developers

【51CTO.com Quick Translation】 When you encounter ...

LOCVPS 10th Anniversary Sale 20% off, top up 1000 yuan and get 100 yuan

LOCVPS has started the 10th anniversary event war...

Halfway through 2020: my country's 5G users exceed 100 million

Although we have not yet completely gotten rid of...

The future of connectivity: Five breakthroughs in smart device research for 2023

The future of connectivity has never been more ex...

Is WeChat and QQ file transfer too inhumane? Here's how to fix it

For example, if the other party sends a file to y...

How do IDC, CDN, and cloud service providers upgrade to IPv6?

With the rapid development of the Internet, IPv4 ...

Ruizhi Big Data: Injecting Intelligent Genes into Dual-State IT

In the summer of 2019, a set of data about China&...

What exactly is UWB technology?

This article is reprinted from the WeChat public ...