The HTTP protocol is extremely complex. It affects many aspects such as browsers, crawlers, proxy servers, firewalls, CDNs, Web containers, microservices, etc. Its own specifications are not unified, and new and old versions of various software exist on the Internet at the same time. In this case, if you don't have a deep understanding of HTTP, you will easily be stumped by various network problems. So, how can we quickly master the HTTP protocol? In my opinion, we need to start from the following four aspects:
First, let me share with you the HTTP learning knowledge graph I compiled. You can save it and refer to it from time to time: (For high-resolution pictures, please see this article: http://note.youdao.com/noteshare?id=56e65085e89f449feb5804a887dbf058&sub=68510F5CB15E46C6BFA7021506A65330) Below, we will elaborate on these four aspects one by one. 1. Use the right tools To learn the HTTP protocol well, you need to use at least the following four tools: 1.1 Chrome Network packet capture panel This tool has 4 advantages: Quickly analyze HTTP requests Conveniently offload TLS/SSL content Can help analyze page loading performance Convenient analysis of websocket content The tool contains 5 panels and supports complex attribute filtering in the Filter input bar. In the request list, you can see the upstream and downstream of the request, as well as the time distribution of each request. 1.2 telnet This tool is mainly used to construct the original application layer protocol to help us understand the format of HTTP actually transmitted in the network. 1.3 curl There are 2 problems with telnet: 1. It is too cumbersome to enter a complete request every time. In fact, we may just want to change the method or a certain HEADER. 2. It does not support HTTPS or package compression, which results in the inability to initiate requests to certain sites. Curl solves these problems perfectly. It is also used to construct customized HTTP requests and analyze HTTP response headers or bodies. 1.4 Wireshark This is an essential tool for learning the complete Web protocol stack. We can capture packets with tcpdump on the server side and then analyze them conveniently on the visual Wireshark. Wireshark is extremely powerful: Supports both BPF capture filters and display filters during analysis; Through flow tracking or session icons, we can easily analyze by session; Configurable coloring rules help us easily identify problematic messages with different colors; By marking and exporting messages, merging files, and shifting time, you can easily analyze and compare messages captured on multiple machines together. You can see the readable value parsed from each layer of the message in Packet Detail, and you can also see the binary stream in Packet Byte. Supports message statistics, which is very convenient for analyzing a large number of HTTP messages! 2. Understand the architecture To understand the architecture of HTTP, you need to start from the following four aspects: 2.1 What problem does HTTP protocol try to solve? The HTTP protocol was originally designed to solve the communication problem between humans and machines. The browser in the so-called "B/S architecture" is a factor that we must take into consideration. Therefore, the HTTP protocol needs to transmit hypermedia data (including large-granularity data such as pictures and videos). Of course, many devices in the Internet of Things are now also using the HTTP protocol, so it also solves the communication between machines. Of course, web crawlers are also a problem that the HTTP protocol has to face, and specifications such as robots.txt came into being. 2.2 What non-functional constraints does the HTTP protocol face? It mainly includes the following five aspects: High scalability, because it needs to serve a global user base and a lifespan of more than decades Low threshold, including both usage threshold and development threshold. The decline of Java Applet and the rise of Javascript are excellent examples. Large-granularity data transmission in a distributed environment Uncontrollable loads and a wide variety of components on the Internet Forward compatibility: Many features in HTTP/1.1 need to take into account the fact that there are proxy servers that only support HTTP/1.0 running on the Internet. 2.3 What is the architectural design followed? HTTP/1.1 is designed in full compliance with the REST architecture, which mainly includes the following four sub-architectures: LCS: spatially layered client server, so we have tunnels, proxies, gateways, CDN, load balancing and other products; CSS: stateless client server, so we have the Request/Response request mode, and require that the cookie header or URL cannot exceed 4K, etc. COD: Code on Demand, which means moving the code from the server to the client and then running it. Today's front-end ecosystem is based on Javascript derived from this architecture. $: Cache. There is cache everywhere in HTTP components, including shared cache and private cache. If the cache time limit is not specified, a cache expiration value must be estimated. 2.4 What are the characteristics of HTTP protocol? First of all, we need to understand which layer of the OSI conceptual model it is at and where it is in the TCP/IP system. Secondly, we can derive its definition from the above architecture: a stateless, application-layer protocol that operates in a request/response manner, which uses extensible semantics and self-describing message formats to flexibly interact with network-based hypertext information systems! 3. Be familiar with the protocol format When learning the HTTP protocol format, you should start from the following three aspects: 3.1 Extended Backus-Naur Form: ABNF Metalanguage Metalanguage can be used to describe protocol formats, and ABNF strictly defines the format of HTTP. ABNF is not complicated and only takes us 10 minutes to learn. It consists of two parts: operators and core rules, which are not listed here. 3.2 HTTP protocol format To master the HTTP protocol format, you need to sort out a tree-like knowledge graph. Please refer to the HTTP knowledge graph I compiled at the end of this article. 3.3 DNS protocol format We need to master 3 aspects of knowledge: DNS messages are based on UDP, and their general format is fixed. You need to understand the meaning of each field. In the Questions section, you need to focus on how the QNAME domain name is encoded and the meaning of QTYPE. The Answers section has more fields, especially the offset representation of the NAME and RDATA sections. 4. Understand the application scenarios HTTP has a wide range of application scenarios. Below I list 9 common scenarios. The methods, response codes, headers, and body encoding methods mentioned in the protocol format are all related to specific scenarios. 4.1 How to negotiate content Responsive negotiation is rarely used due to unclear RFC specifications, while proactive negotiation regarding language, encoding, media type, etc. is a common way we deal with in our daily lives. 4.2 How to submit a FORM form Although there are three encoding methods for form submission, the most commonly used method is to have multiple expressions separated by boundaries coexist in a single package. The WAF firewall must consider how to apply SQL injection attacks in this package. 4.3 Use of Range Request The breakpoint resumption and multi-threaded downloading used to transfer large files both require the use of the Range specification. In order to prevent the server from updating during the multi-request download process, the conditional request If-Range is also introduced. 4.4 Cookie and Session Design There are many attributes in Set-Cookie, including expires-av and max-age-av that limit the validity period, domain-av and path-av that limit the scope of use, secure-av that limits the protocol, or httponly-av that limits the object of use. All these restrictions are aimed at whether it is safe for browsers to use cookies. At the same time, for convenience, browsers also support third-party cookies, which makes it easier for manufacturers to collect user information. 4.5 Browser Homology Policy and Cross-Domain Requests The same-origin policy is a restriction imposed by the browser. If we process the response directly based on the network library, we are not subject to this restriction. Therefore, the effectiveness of this same-origin policy is very dependent on the implementation of the browser. Of course, the same-origin policy does not include protection against CSRF attacks. The server usually solves CSRF attacks based on the token strategy. Security and convenience must be weighed against each other. In order to increase convenience, cross-domain requests of AJAX must be allowed, so CORS was born. 4.6 Conditional Requests Conditional requests can not only handle resource variables in the middle of multi-threaded downloads, but also take effect on multi-person collaborative wiki systems, and can also be used for cache updates. In fact, it has a lot of room for development in RESTful API design. 4.7 Shared Cache and Private Cache Caches are everywhere on the Internet today. Even if the server does not configure certain resources to be cached, browsers are trying their best to estimate a period of time to cache resources. Because cache can greatly improve user experience and reduce network load! There are many HTTP headers that can control caches, which not only control the validity period of the cache, but also control the keywords based on the cache. 4.8 Application of Redirection Regarding redirection, we need to understand it from two dimensions and four quadrants: changeable method | unchangeable method, cacheable | non-cacheable This leads to five different response status codes: 301, 302, 303, 307, and 308. 4.9 Web Crawler Crawlers are everywhere, not just the long-standing search engine crawlers. Currently, travel (such as 12306 train tickets or AirAsia), e-commerce, social networking (Sina Weibo), etc. are all widely harassed by crawlers. Crawlers not only crawl information, but also simulate human behavior, such as many ticket grabbing machines and zombie fans. On the other hand, in order to welcome Google/Baidu's crawlers, various SEO strategies and tutorials have been born, and many businesses have used PageRank loopholes to improve keyword rankings to make profits. Therefore, it is also very important to understand how crawlers work. Of course, HTTP application scenarios are far more than these, but a thorough grasp of these scenarios will enable us to fully understand the common methods, headers, response codes, and so on in the HTTP protocol. HTTP protocol is a very important part of Web protocol. As a programmer, whether you are a front-end or back-end engineer, or an operation and maintenance tester, if you want to interview for a higher position, or want to understand the technical business architecture from a higher perspective, and be able to solve problems quickly and efficiently when they arise, Web protocol is definitely a hurdle you cannot avoid. Proficiency in various commonly used Web protocols can help you easily deal with various network problems at work. |
<<: IPv6 series - 10 common problems for beginners
>>: A brief discussion on IPv6 intrusion and defense
At the "Joining Inspur's Thinking and In...
1. Types of interference sources (1) Interference...
Guangzhou No. 6 Middle School is a famous key mid...
【51CTO.com Quick Translation】 When you encounter ...
LOCVPS has started the 10th anniversary event war...
Although we have not yet completely gotten rid of...
The future of connectivity has never been more ex...
For example, if the other party sends a file to y...
With the rapid development of the Internet, IPv4 ...
[[348632]] Recently, the three major operators ha...
In the summer of 2019, a set of data about China&...
DesiVPS sent a new email saying that it has launc...
Maxthon Hosting has been very low-key for a long ...
Justhost.ru is a Russian hosting company founded ...
This article is reprinted from the WeChat public ...