WeTest Introduction We often hear about how awesome the server-side system of a certain Internet application is, such as QQ, WeChat, and Taobao. So, what is so awesome about the server-side system of an Internet application? Why does a massive number of user accesses make a server-side system more complicated? This article wants to start from the most basic place and explore the basic concepts of server-side system technology.
Capacity is the reason for the existence of distributed systems When an Internet business becomes popular, the most obvious technical problem is that the server is very busy. When 10 million users visit your website every day, no matter what kind of server hardware you use, it is impossible to support it with just one machine. Therefore, when Internet programmers solve server-side problems, they must consider how to use multiple servers to provide services for the same Internet application. This is the origin of the so-called "distributed system". However, the problems caused by a large number of users accessing the same Internet service are not simple. On the surface, the most basic requirement to meet the requests of many users from the Internet is the so-called performance requirement: users report that web pages open very slowly, or that the actions in online games are very laggy, etc. These requirements for "service speed" actually include the following: high throughput, high concurrency, low latency and load balancing. High throughput means that your system can support a large number of users at the same time. The focus here is on the number of users that the entire system can serve at the same time. This throughput is definitely impossible to achieve with a single server, so multiple servers need to collaborate to achieve the required throughput. In the collaboration of multiple servers, how to effectively utilize these servers so that some of the servers do not become bottlenecks and affect the processing capacity of the entire system is a distributed system that requires careful consideration in terms of architecture. High concurrency is an extension of high throughput. When we are carrying a large number of users, we certainly hope that each server can work to the best of its ability without unnecessary consumption and waiting. However, the software system is not simply designed to handle multiple tasks at the same time and "as much as possible". In many cases, our program will cause additional consumption because of choosing which task to process. This is also a problem that distributed systems solve. Low latency is not a problem for services with a small number of users. However, if we need to return the calculation results quickly when a large number of users are accessing, this will be much more difficult. Because in addition to the large number of users accessing, which may cause requests to be queued, there may also be spatial problems such as memory exhaustion and bandwidth occupancy due to the length of the queue being too long. If a retry strategy is adopted due to queue failure, the entire delay will become higher. Therefore, distributed systems will adopt many request sorting and distribution methods to allow more servers to respond to user requests as quickly as possible. However, due to a large number of distributed systems, it is necessary to distribute user requests multiple times. The entire delay may become higher due to these distribution and transfer operations. Therefore, in addition to distributing requests, distributed systems should also try to reduce the number of distribution levels so that requests can be processed as quickly as possible. Since Internet users come from all over the world, they may come from various networks and lines with different delays in physical space, and may also come from different time zones in time. Therefore, in order to effectively deal with the complexity of this user source, multiple servers need to be deployed in different spaces to provide services. At the same time, we also need to effectively allow multiple servers to carry simultaneous requests. The so-called load balancing is a task that distributed systems are born to complete. Since distributed systems are almost the most basic way to solve the problem of Internet business capacity, it is extremely important for a server-side programmer to master distributed system technology. However, the problem of distributed systems cannot be easily solved by learning to use a few frameworks and libraries, because when a program runs on one computer, it becomes a program that runs on countless computers at the same time, which will bring great differences in development and operation and maintenance. Basic means of improving the carrying capacity of distributed systems: layered model (routing, proxy) The simplest idea of using polymorphic servers to collaboratively complete computing tasks is to let each server complete all requests, and then randomly send the requests to any server for processing. In the earliest Internet applications, DNS polling is done in this way: when a user enters a domain name to try to access a website, the domain name will be interpreted as one of multiple IP addresses, and then the access request for this website will be sent to the server with the corresponding IP, so that multiple servers (multiple IP addresses) can handle a large number of user requests together. However, simply forwarding requests randomly cannot solve all problems. For example, many of our Internet services require users to log in. After logging in to a server, users will initiate multiple requests. If we randomly forward these requests to different servers, the user's login status will be lost, causing some request processing failures. Simply relying on a layer of service forwarding is not enough, so we will add a batch of servers, which will forward the requests to the servers that specifically handle the business based on the user's cookies or login credentials. In addition to the login requirement, we also found that a lot of data needs to be processed by a database, and our data can only be concentrated in one database, otherwise the data results stored on other servers will be lost during the query. Therefore, we often separate the database into a group of dedicated servers. At this point, we will find that a typical three-layer structure has emerged: access, logic, and storage. However, this three-layer result is not a panacea. For example, when we need to allow users to interact online (online games are a typical example), the online status data divided on different logical servers cannot know each other. In this way, we need to make a special system similar to an interactive server, so that when users log in, a piece of data is also recorded there, indicating that a certain user has logged in on a certain server, and all interactive operations must first pass through this interactive server before the message can be correctly forwarded to the target user's server. For example, when we use the online forum (BBS) system, we cannot write the articles we post into only one database, because too many people's reading requests will slow down the database. We often write to different databases according to the forum section, or write to multiple databases at the same time. In this way, the article data is stored on different servers to cope with a large number of operation requests. However, when users read articles, they need a special program to find which server the specific article is on. At this time, we need to set up a special proxy layer and forward all article requests to it first, and it will find the corresponding database to obtain data according to our preset storage plan. According to the above example, although the distributed system has a typical three-layer structure, in fact, it often has more than three layers, but is designed into multiple layers according to business needs. In order to forward the request to the correct process for processing, we design many processes and servers specifically for forwarding requests. We often name these processes Proxy or Router, and a multi-layer structure often has various Proxy processes. These proxy processes often connect the front and back ends through TCP. However, although TCP is simple, it is not easy to recover after a failure. Moreover, TCP network programming is also a bit complicated. ——Therefore, people designed a better inter-process communication mechanism: message queue. Although a powerful distributed system can be built through various Proxy or Router processes, its management complexity is also very high. Therefore, people have come up with more methods based on the layered model to make the program of this layered model simpler and more efficient. Concurrency model (multithreading, asynchronous) When we write server-side programs, we know that most programs will handle multiple requests that arrive at the same time. Therefore, we cannot simply calculate the output from a simple input like HelloWorld. Because we will get many inputs at the same time and need to return many outputs. In the process of these processing, we often encounter situations where we need to "wait" or "block", such as our program needs to wait for the database to process the result, wait for the result of the request to another process, etc. If we process the requests one by one, then these idle waiting times will be wasted, causing the user's response delay to increase and the overall system throughput to drop drastically. So when it comes to how to handle multiple requests at the same time, the industry has two typical solutions. One is multithreading, and the other is asynchronous. In early systems, multithreading or multi-processing was the most commonly used technology. The code for this technology is relatively simple to write, because the code in each thread must be executed in sequence. However, since multiple threads are running at the same time, you cannot guarantee the sequence of code between multiple threads. This is a very serious problem for logic that needs to process the same data. The simplest example is to display the number of readings of a certain news. When two ++ operations run at the same time, it is possible that the result is only 1, not 2. Therefore, under multithreading, we often have to add a lot of data locks, and these locks in turn may cause thread deadlocks. Therefore, the asynchronous callback model became more popular than multithreading. In addition to the deadlock problem of multithreading, asynchrony can also solve the problem of unnecessary overhead caused by repeated thread switching under multithreading: each thread requires an independent stack space. When multiple threads run in parallel, the data of these stacks may need to be copied back and forth, which consumes additional CPU. At the same time, since each thread needs to occupy stack space, the memory consumption is also huge when a large number of threads exist. The asynchronous callback model can solve these problems very well, but asynchronous callback is more like a "manual version" of parallel processing, which requires developers to realize the problem of "parallelism" by themselves. Asynchronous callbacks are based on non-blocking I/O operations (network and files), so we don't have to "get stuck" in the function call when calling the read and write functions, but immediately return the result of "whether there is data or not". Linux's epoll technology uses the mechanism of the underlying kernel to allow us to quickly "find" the connection\file with data that can be read and written. Since each operation is non-blocking, our program can handle a large number of concurrent requests with only one process. Because there is only one process, the order of all data processing is fixed, and it is impossible for the statements of two functions to be interleaved in multi-threading, so there is no need for various "locks". From this perspective, asynchronous non-blocking technology greatly simplifies the development process. Since there is only one thread, there is no need for overhead such as thread switching, so asynchronous non-blocking has become the first choice for many systems with high requirements for throughput and concurrency. int epoll_create(int size); //Create an epoll handle. Size is used to tell the kernel how many listeners there are. int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout); Buffering technology In Internet services, most user interactions require immediate results, so there are certain requirements for latency. For services like online games, latency is required to be shortened to within tens of milliseconds. Therefore, in order to reduce latency, buffering is one of the most common technologies in Internet services. In the early WEB system, if each HTTP request was processed by reading and writing to the database (MySQL), the database would soon stop responding because the number of connections was full. This is because the number of connections supported by general databases is only a few hundred, while the concurrent requests of WEB applications can easily reach thousands. This is also the most direct reason why many poorly designed websites get stuck when there are too many people. In order to minimize the connection and access to the database, people have designed many buffering systems - storing the results of queries from the database on faster facilities, and reading directly from here if there are no related modifications. The most typical web application cache system is Memcache. Due to the thread structure of PHP itself, it is stateless. In the early days, PHP itself did not even have a way to operate the "heap" memory, so those persistent states must be stored in another process. Memcache is a simple and reliable open source software for storing temporary states. The current processing logic of many PHP applications is to read data from the database first, and then write it to Memcache; when the next request comes, try to read the data from Memcache first, which may greatly reduce the access to the database. However, Memcache itself is an independent server process, which does not have any special clustering function. In other words, these Memcache processes cannot be directly formed into a unified cluster. If one Memcache is not enough, we have to manually use code to allocate which data should go to which Memcache process. For a truly large distributed website, managing such a cache system is a very tedious task. Therefore, people began to consider designing some more efficient buffering systems: from the performance point of view, each request of Memcache must be transmitted through the network to pull the data in the memory. This is undoubtedly a bit wasteful, because the memory of the requester itself can also store data. This is the reason why many buffering algorithms and technologies that use the memory of the requester have been promoted. The simplest one is to use the LRU algorithm to put the data in the heap memory of a hash table structure. Memcache does not have a clustering function, which is also a pain point for users. Therefore, many people began to design how to separate data caches from different machines. The simplest idea is the so-called read-write separation, that is, each cache write is recorded in multiple buffer processes, while reads can be read randomly from any process. The effect is very good when there is a clear read-write imbalance in business data. However, not all businesses can be solved simply by separating reads and writes, such as some online interactive Internet businesses, such as communities and games. The data read and write frequencies of these businesses are not very different, and they also require very high latency. Therefore, people are trying to combine local memory and remote process memory cache to make data have two-level cache. At the same time, a piece of data is not copied to all cache processes at the same time, but distributed to multiple processes according to a certain rule. ——The most popular algorithm used for this distribution rule is the so-called "consistent hashing". The advantage of this algorithm is that when a process fails, there is no need to re-modify the location of all cached data in the entire cluster. You can imagine that if our data cache distribution is simply modulo the number of processes with the data ID, then once the number of processes changes, the process location where each data is stored may change, which is not conducive to the fault tolerance of the server. Oracle has a product called Coherence, which is well designed in the cache system. This product is a commercial product that supports the use of local memory cache and remote process cache collaboration. The cluster process is completely self-managed and supports user-defined calculations (processor functions) in the process where the data cache is located. This is not just a cache, but also a distributed computing system. Storage technology (NoSQL) I believe everyone is familiar with the CAP theory. However, in the early days of the Internet, when everyone was still using MySQL, many teams racked their brains to figure out how to make the database store more data and carry more connections. In fact, for many businesses, the main data storage method is files, and the database has become an auxiliary facility. However, when NoSQL emerged, people suddenly realized that the data format of many Internet businesses is so simple that they often do not need complex tables like relational databases. The index requirement is often just to search based on the primary index. The database itself cannot do more complex full-text searches. Therefore, many high-concurrency Internet businesses now prefer NoSQL as a storage facility. The earliest NoSQL databases include MangoDB, and now Redis seems to be the most popular. Some teams even regard Redis as part of the buffer system, which actually recognizes the performance advantages of Redis. In addition to being faster and having a larger capacity, the more important feature of NoSQL is that this data storage method can only be retrieved and written according to one index. Such demand constraints bring the benefit of distribution. We can define the process (server) where data is stored according to this main index. In this way, the data of a database can be easily stored on different servers. Under the inevitable trend of distributed systems, the data storage layer has finally found a way to distribute. Distributed systems are not simply a bunch of servers running together to meet the needs. Compared with a single machine or a cluster of a small number of servers, there are some special problems waiting for us to solve. Hardware failure rate The so-called distributed system is definitely not just one server. Assuming that the average failure time of a server is 1%, then when you have 100 servers, there is almost always one server that is failing. Although this analogy may not be very accurate, when your system involves more and more hardware, hardware failures will also change from accidental events to inevitable events. Generally, when we write functional code, we don't consider what to do when the hardware fails. If you are writing a distributed system, you must face this problem. Otherwise, it is very likely that only one server fails, and the entire cluster of hundreds of servers will not work properly. In addition to the server's own memory, hard disk and other failures, network line failures between servers are more common. Moreover, this kind of failure may be sporadic or automatically recover. In the face of this problem, it is not enough to simply remove the "failed" machine. Because the network may recover after a while, and your cluster may lose more than half of its processing capacity due to this temporary failure. How to make the distributed system automatically maintain and maintain external services as much as possible in various situations where failures may occur at any time has become a problem that must be considered when writing programs. Because we have to take this kind of failure into consideration, we must also consciously preset some redundant and self-maintenance functions when designing the architecture. These are not business requirements for the product, but purely technical functional requirements. Whether the right requirements can be put forward in this regard and then implemented correctly is one of the most important responsibilities of server-side programmers. Resource utilization optimization In a distributed system cluster, there are many servers. When the hardware carrying capacity of such a cluster reaches its limit, the most natural idea is to add more hardware. However, it is not so easy to improve the carrying performance of a software system by "adding" hardware. Because the work of software on multiple servers requires complex and detailed coordination. When expanding a cluster, we often have to stop the services of the entire cluster, then modify various configurations, and finally restart a cluster with new servers added. Since there may be some user data in the memory of each server, if you rashly try to modify the configuration of the services provided in the cluster during operation, it is likely to cause memory data loss and errors. Therefore, it is relatively easy to expand capacity at runtime for stateless services, such as adding some Web servers. However, it is almost impossible to perform simple runtime expansion for stateful services, such as online games. In addition to capacity expansion, distributed clusters also need to shrink capacity. When the number of users decreases and server hardware resources become idle, we often need to use these idle resources and put them into other new service clusters. Scaling capacity is similar to disaster recovery when there is a fault in the cluster. The difference is that the timing and target of shrinking capacity are predictable. Due to the expansion and reduction of distributed clusters, and the desire to operate online as much as possible, very complex technical issues need to be dealt with, such as how to correctly and efficiently modify the interrelated configurations in the cluster, how to operate stateful processes, and how to ensure normal communication between nodes in the cluster during expansion and reduction. As a server-side programmer, you will need to spend a lot of experience to carry out special development for a series of problems caused by cluster state changes of multiple processes. Software service content update It is now popular to use the term "iteration" in the agile development model to represent a service that continuously updates its program to meet new requirements and fix bugs. If we only manage one server, then updating the program on this server is very simple: just copy the software package and modify the configuration. But if you want to do the same operation on hundreds or thousands of servers, it is impossible to log in to each server to process it. The server-side batch installation and deployment tool is needed by every distributed system developer. However, in addition to copying binary files and configuration files, our installation work also involves many other operations, such as opening firewalls, creating shared memory files, modifying database table structures, rewriting some data files, etc. Some even require installing new software on the server. If we consider software updates and version upgrades when developing server-side programs, we will make certain plans in advance for the use of configuration files, command line parameters, and system variables, which will make the installed and deployed tools run faster and more reliably. In addition to the installation and deployment process, there is another important issue, which is the issue of data between different versions. When we upgrade the version, some persistent data generated by the old version of the program is generally in the old data format; and if the upgraded version involves modifying the data format, such as the data table results, then these old format data must be converted and rewritten into the new version of the data format. This leads to us considering the structure of these tables clearly when designing the data structure, using the simplest and most direct expression to make future modifications easier; or anticipating the scope of modification early on, specifically presetting some fields, or using other forms to store data. In addition to persistent data, if there are client programs (such as the hit APP), the upgrade of these client programs is often not synchronized with the server. If the upgrade includes changes to the communication protocol, this will cause us to have to deploy different server-side systems for different versions. In order to avoid maintaining multiple servers at the same time, we often tend to define protocols in a so-called "version compatible" way when developing software. How to design a protocol to have good compatibility is another issue that the server-side program needs to carefully consider. Statistics and decision making Generally speaking, the log data of distributed systems are collected together and then counted uniformly. However, when the size of the cluster reaches a certain level, the amount of data in these logs will become very terrifying. In many cases, counting the log volume for one day will consume more than one day of computer operation. Therefore, the work of log statistics has also become a very professional activity. The classic distributed statistical model is Google's Map Reduce model. This model is flexible and can use a large number of servers for statistical work. However, its disadvantage is that it is often not easy to use because the statistics of these data are very different from the statistics of our common SQL data tables. Therefore, we often end up throwing the data into MySQL to do more detailed statistics. Due to the huge number of distributed system logs and the increasing complexity of logs, we have to master technologies like Map Reduce to truly perform data statistics on distributed systems. We also need to find ways to improve the efficiency of statistical work. Directory service (ZooKeeper), a basic means to solve the manageability of distributed systems A distributed system is a whole composed of many processes. Each member of this whole has some status, such as its own responsible module, its own load situation, the mastery of certain data, etc. These data related to other processes become very important during fault recovery and capacity expansion and reduction. A simple distributed system can record these data through static configuration files: the connection correspondence between processes, their IP addresses and ports, etc. However, a highly automated distributed system must require that these status data be saved dynamically, so that the program can do disaster recovery and load balancing work by itself. Some programmers will write a DIR service (directory service) to record the running status of the processes in the cluster. The processes in the cluster will be automatically associated with this DIR service, so that when disaster recovery, capacity expansion, and load balancing are performed, the destination of the request can be automatically adjusted according to the data in these DIR services, thereby bypassing the faulty machine or connecting to a new server. However, if we only use one process to do this, then this process becomes the "single point" of the cluster - that is, if this process fails, the entire cluster may not be able to run. Therefore, the directory service that stores the cluster state also needs to be distributed. Fortunately, we have ZooKeeper, an excellent open source software, which is a distributed directory service area. ZooKeeper can simply start an odd number of processes to form a small directory service cluster. This cluster will provide all other processes with the ability to read and write its huge "configuration tree". This data will not only be stored in one ZooKeeper process, but will be carried by multiple processes according to a very secure algorithm. This makes ZooKeeper an excellent distributed data storage system. Since ZooKeeper's data storage structure is a tree-like system similar to a file directory, we often use its function to bind each process to one of the "branches", and then forward server requests by checking these "branches", which can easily solve the problem of request routing (who does it). In addition, the load status of the process can be marked on these "branches", so load balancing is also easy to do. Directory service is one of the most critical components in a distributed system. ZooKeeper is a good open source software that is used to complete this task. Message queue services (ActiveMQ, ZeroMQ, Jgroups) If two processes need to communicate across machines, we almost always use protocols such as TCP/UDP. However, it is very troublesome to directly use the network API to write cross-process communication. In addition to writing a large amount of underlying socket code, we also have to deal with a series of problems such as: how to find the process to exchange data, how to ensure the integrity of the data packet so that it is not lost, what to do if the other process of the communication hangs up, or the process needs to be restarted, etc. These problems include a series of requirements such as disaster recovery expansion and load balancing. In order to solve the problem of inter-process communication in distributed systems, people have summarized an effective model, which is the "message queue" model. The message queue model abstracts the interaction between processes into the processing of individual messages, and for these messages, we have some "queues", that is, pipes, to temporarily store messages. Each process can access one or more queues to read messages (consume) or write messages (produce) from them. Because there is a cached pipe, we can safely change the process state. When the process starts, it will automatically consume messages. The routing of the message itself is also determined by the queue where it is stored, which turns the complex routing problem into a problem of how to manage static queues. General message queue services provide two simple interfaces, "delivery" and "collection", but the management of message queues is relatively complex, and generally there are two methods. Some message queue services advocate point-to-point queue management: there is a separate message queue between each pair of communication nodes. The advantage of this approach is that messages from different sources do not affect each other, and the message cache space of other queues will not be squeezed out because of too many messages in a queue. In addition, the program that processes messages can also define the processing priority by itself - collect first, process more of a queue, and process less of other queues. However, this kind of point-to-point message queue will increase a large number of queues as the cluster grows, which is a complicated matter for memory usage and operation and maintenance management. Therefore, more advanced message queue services can allow different queues to share memory space, and the address information, creation and deletion of message queues are all automated. - These automations often need to rely on the "directory service" mentioned above to register the physical IP and port information corresponding to the queue ID. For example, many developers use ZooKeeper to act as the central node of the message queue service; and software like Jgropus maintains a cluster state to store the current and past of each node. Another type of message queue is similar to a public mailbox. A message queue service is a process, and any user can deliver or receive messages in this process. This makes it easier to use the message queue and more convenient to operate and manage. However, in this usage, any message will go through at least two inter-process communications from the time it is sent to the time it is processed, and the latency is relatively high. And because there are no predetermined delivery and collection constraints, it is also easier to have bugs. Regardless of the message queue service used, in a distributed server-side system, inter-process communication is a problem that must be solved. Therefore, as a server-side programmer, when writing distributed system code, the most commonly used code is message queue-driven code, which directly led to EJB3.0 adding "message-driven beans" to the specification. Transaction System In a distributed system, transactions are one of the most difficult technical problems to solve. Since a process may be distributed across different processing processes, any process may fail, and this failure problem requires a rollback. Most of these rollbacks involve multiple other processes. This is a diffuse multi-process communication problem. To solve transaction problems in a distributed system, two core tools must be available: one is a stable state storage system; the other is a convenient and reliable broadcast system. The status of any step in the transaction must be visible in the entire cluster and have disaster recovery capabilities. This requirement is generally met by the "directory service" of the cluster. If our directory service is robust enough, we can write the processing status of each transaction step to the directory service synchronously. ZooKeeper can play an important role here again. If a transaction is interrupted and needs to be rolled back, this process will involve multiple steps that have already been executed. Perhaps the rollback only needs to be rolled back at the entry point (if the data required for the rollback is saved there), or it may need to be rolled back on each processing node. If it is the latter, then the node with the abnormality in the cluster needs to broadcast a message such as "Rollback! Transaction ID is XXXX" to all other related nodes. The underlying layer of this broadcast is generally carried by a message queue service, and software such as Jgroups directly provides broadcast services. Although we are now discussing transaction systems, the "distributed lock" function often required by distributed systems can actually be completed by this system at the same time. The so-called "distributed lock" is a restriction that allows each node to check first and then execute. If we have an efficient directory service for single-operation, then this lock state is actually a state record of "single-step transaction", while the rollback operation defaults to "pause operation and try again later". This "lock" method is simpler than transaction processing, so it is more reliable, so more and more developers are willing to use this "lock" service instead of implementing a "transaction system". Automatic deployment tool (Docker) Since the biggest requirement of distributed systems is to change the service capacity at runtime (maybe interrupted services are required): expansion or reduction. When some nodes in distributed systems fail, new nodes are also needed to restore work. If these are still like old-fashioned server management methods, by filling in forms, filing out, entering the computer room, installing servers, deploying software..., the efficiency will definitely not work. In the environment of distributed systems, we generally use the "pool" method to manage services. We will apply for a batch of machines in advance, and then run service software on some machines, and others as backups. Obviously, our batch of servers cannot only serve a certain business, but will provide multiple different business hostings. Those backup servers will become a common backup "pool" for multiple businesses. As business needs change, some servers may "exit" service A and "join" service B. This frequent service changes rely on highly automated software deployment tools. Our operation and maintenance personnel should master the deployment tools provided by the developer rather than thick manuals to perform this type of operation and maintenance operations. Some more experienced development teams will unify all business underlying frameworks, hoping that most deployment and configuration tools can be managed with a general system. There are similar attempts in the open source industry. The most well-known RPM installation package format is the RPM installation package format. However, the packaging method of RPM is still too complex and does not meet the deployment needs of server-side programs. So later, a programmable general deployment system represented by Chef appeared. However, when NoSQL emerged, people suddenly realized that in fact, the data format of many Internet businesses is so simple that many times the roots do not require complex tables like relational databases. The requirements for indexes are often only based on the main index. For more complex full-text searches, the database itself cannot do it. Therefore, for a considerable number of highly concurrent Internet businesses, NoSQL is the first to use for storage facilities. The earliest NoSQL databases include MangoDB, etc., and the most popular one now seems to be Redis. Some teams even regard Redis as part of the buffer system, which actually recognizes the performance advantages of Redis. In addition to being faster and larger carrying capacity, NoSQL has a more important feature that this data storage method can only be retrieved and written according to one index. Such demand constraints bring distribution benefits, and we can define the process (server) of data stored by this main index. In this way, the data of a database can be easily stored on different servers. Under the inevitable trend of distributed systems, the data storage layer has finally found a distribution method. In order to manage a large number of distributed server-side processes, we do need to spend a lot of effort to optimize their deployment management. Unifying the operation specifications of server-side processes is the basic condition for realizing automated deployment management. We can use Docker technology based on the "operating system" as the specification; or we can use certain PaaS platform technologies based on the specification; or we can define some more specific specifications and develop a complete distributed computing platform by ourselves. Log Service (log4j) Server-side logs have always been an important and easily overlooked issue. Many teams only regarded logs as auxiliary tools for development, debugging and troubleshooting bugs at the beginning. But they will soon find that after the service is operated, logs are almost the only effective means for server-side systems to understand the program situation at runtime. Although we have various profile tools, most of these tools are not suitable for opening on formally operated services because they will seriously reduce their operating performance. Therefore, we need to analyze them based on logs more often. Although logs are essentially line-by-line text information, because they have great flexibility, they will be highly valued by development and operation and maintenance personnel. The log itself is a very vague thing in concept. You can open a file at will and write some information. However, modern server systems generally make some standardized requirements for logs: the log must be line by line, which is more convenient for future statistical analysis; each line of log text should have some unified headers, such as date and time are the basic requirements; the output of the log should be graded, for example, fatal/error/warning/info/debug/trace, etc. The program can adjust the output level at runtime so that it can save the consumption of log printing; the header of the log generally requires some header information such as user ID or IP address, which is used to quickly find and locate and filter a certain batch of log records, or there are some other fields used to filter and reduce the log viewing range, which is called the dyeing function; the log file also needs to have a "rollback" function, that is, keep multiple files of a fixed size to avoid filling the hard disk after long-term operation. Due to the above various needs, the open source industry provides many log component libraries for games, such as the famous log4j and the log4X family library with many members. These are tools with wide applications and well-received applications. However, compared with the log printing function, the log collection and statistics functions are often more easily overlooked. As a programmer of a distributed system, he definitely hopes to collect statistics from a centralized node to the entire cluster log situation. Some log statistical results can even be obtained repeatedly in a short time to monitor the health of the entire cluster. To do this, there must be a distributed file system to store the continuous arrival of logs (these logs are often sent through the UDP protocol). On this file system, there is a statistical system similar to the Map Reduce architecture, so that massive log information can be quickly counted and alarmed. Some developers will directly use the Hadoop system, while others use Kafka as the log storage system, and then build their own statistical programs. Log services are instrument panels and periscopes for distributed operation and maintenance. Without a reliable log service, the health of the entire system may be out of control. Therefore, no matter how many or few nodes of your distributed system are, important energy and special development time must be spent to establish a system that automates statistical analysis of logs. Problems and solutions caused by distributed systems in development efficiency According to the above, the functions of distributed systems in business requirements believe that many additional non-functional requirements need to be added. These non-functional requirements are often designed and implemented for the stable and reliable operation of a multi-process system. These "extra" work will generally make your code more complicated, and if you do not have good tools, your development efficiency will be seriously reduced. Microservice framework: EJB, WebService When we discuss the distribution of server-side software, communication between service processes is inevitable. However, communication between service processes cannot be completed simply by sending and receiving messages. This also involves message routing, encoding and decoding, reading and writing of service status, etc. If the entire process is developed by oneself, it would be too tiring. Therefore, the industry has launched various distributed server-side development frameworks very early, the most famous of which is "EJB" - enterprise JavaBean. Any technology called "enterprise" is often the required part of the distribution, and EJB is also a technology for distributed object calling. If we need to let multiple processes cooperate to complete tasks, we need to decompose the tasks into multiple "classes", and then these "classes" objects will survive in each process container, thereby providing services in collaboration. This process is very "object-oriented". Each object is a "microservice" that can provide certain distributed functions. Other systems are moving towards learning the basic model of the Internet: HTTP. Therefore, there are various WebService frameworks, from open source to commercial software, and each WebService is implemented. This model simplifies complex routing, codec and other operations into a common HTTP operation, which is a very effective abstraction. Developers develop and deploy multiple WebServices to the web server, and complete the construction of a distributed system. Whether we are learning EJB or WebService, we actually need to simplify the complexity of distributed calls. The complexity of distributed calls is that it is necessary to integrate functions such as disaster recovery, capacity expansion, and load balancing into cross-process calls. Therefore, using a set of common code to communicate (calls) for all cross-process communication (calls), unified implementation of non-functional requirements such as disaster recovery, capacity expansion, load balancing, overload protection, status cache hits, etc. can greatly simplify the complexity of the entire distributed system. Generally, our microservice framework will observe the status of all nodes in the entire cluster during the routing stage, such as which addresses are running which service processes, how the load status of these service processes is available, and whether it is available. Then, for stateful services, algorithms similar to consistency hashing will be used to try to improve the cache hit rate. When the state of nodes in the cluster changes, all nodes under the microservice framework can obtain this change as soon as possible, and re-plan the future service routing direction based on the current state, thereby realizing automated routing and avoiding nodes with excessive load or failure. There are some microservice frameworks, which also provide tools for converting IDL into "skeleton" and "stake" codes. In this way, when writing remote calls, there is no need to write complex network-related code at all. All the transport layer and encoding layer codes are automatically written. In this regard, EJB, Facebook's Thrift, and Google gRPC have this ability. With the framework of code generation, we write a functional module available in distributed form (may be a function or a class), just as simple as writing a local function. This is definitely a very important efficiency improvement in distributed systems. Asynchronous programming tools: coroutines, Futrue, Lamda In a distributed system, you will inevitably encounter a large number of "callback" APIs. Because distributed systems involve a lot of network communications. Any business command may be decomposed into multiple processes and combined through multiple network communications. Since asynchronous non-blocking programming models are popular, our code often encounters "callback functions" at any time. However, the asynchronous programming model of callback is a programming method that is very unfavorable to code reading. Because you cannot read the code from beginning to end to understand how a business task is gradually completed. The code belonging to a business task is divided into many callback functions due to multiple non-blocking callbacks and is connected in various parts of the code. What's more, sometimes we choose to use "observer mode", we will register a large number of "event-response functions" in one place, and then issue an event in all places where callbacks are needed. —— Such code is more difficult to understand than simply registering callback functions. Because the response functions corresponding to events are usually not found at the time of issue. These functions will always be placed in some other files, and sometimes these functions will change at runtime. The event name itself is often incredible and difficult to understand, because when your program needs hundreds of events, it is almost impossible to give a name that is easy to understand. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : Another way to improve the writing of callback functions is often called the Future/Promise model. The basic idea of this writing is to "write all callbacks together at once." This is a very practical programming model. It does not allow you to completely kill the callbacks, but allows you to scatter the callbacks from all over the place and concentrate them in one place. In the same piece of code, you can clearly see how each asynchronous step is executed in series or in parallel. Finally, let’s talk about the lamda model. This writing method is popular in the wide application of js language. Since in other languages, it is very troublesome to set a callback function: Java language needs to design an interface and then implement it, which is simply a five-star problem; C/C++ supports function pointers, which is relatively simple, but it is also easy to make the code not understand; scripting languages are relatively better, and it also needs to define a function. Writing the content of the callback function directly where the callback is called is the most convenient to develop and is more conducive to reading. More importantly, lamda generally means closure, that is, the call stack of this callback function is saved separately. Many need to establish a state saving variable similar to "session pool" in asynchronous operations, which are not needed here, but can take effect naturally. This is similar to coroutines. No matter which asynchronous programming method is used, the encoding complexity is definitely higher than that of synchronously called code. Therefore, when writing distributed server code, we must carefully plan the code structure to avoid the random addition of functional code, resulting in the destruction of the readability of the code. Unreadable code is unmaintainable code, and server-side code with a large number of asynchronous callbacks is more likely to occur. Cloud service model: IaaS/PaaS/SaaS In the process of complex distributed system development and use, how to operate and maintain a large number of servers and processes has always been a problem throughout. Whether it is using a microservice framework, unified deployment tools, or log monitoring services, it is because a large number of servers are managed centrally, which is very difficult. The reason behind this is mainly that a large number of hardware and networks cut the logical computing power into many small pieces. With the improvement of computer computing capabilities, virtualization technology that emerged can unify the divided computing units more intelligently. The most common one is IaaS technology: when we can use one server hardware to run multiple virtual server operating systems, the number of hardware we need to maintain will decrease exponentially. The popularity of PaaS technology allows us to deploy and maintain the system operating environment for a specific programming model. Instead of installing the operating system, configuring the running container, uploading the running code and data one after another. Before there was a unified PaaS, installing a large number of MySQL databases was once a work that consumed a lot of time and energy. When our business model is mature enough to be abstracted into some fixed software, our distributed system will become easier to use. Our computing power is no longer code and libraries, but cloud SaaS that provides services through the network. In this way, users do not need to maintain and deploy tasks. Just apply for an interface and fill in the expected capacity quota and can be used directly. This not only saves a lot of events in developing corresponding functions, but also means handing over a large amount of operation and maintenance work to SaaS maintainers - and they will be more professional in doing such maintenance. In the evolution of operation and maintenance models, the application scope from IaaS to PaaS to SaaS may be narrower and narrower, but the convenience of use has been greatly improved exponentially. This also proves that the work of software labor can also improve efficiency in a more specialized and subdivided direction through division of labor. Summarize the solution paths to distributed system problems In response to the problem of server carrying capacity, Tencent WeTest has used more than ten years of internal practical experience to summarize the experience and stress tests based on real business scenarios and user behaviors to help game developers discover performance bottlenecks on the server side, perform targeted performance tuning, reduce server procurement and maintenance costs, and improve user retention and conversion rates. |
>>: As work-from-home increases, so do attacks on VoIP and unified communications
Choosing the most appropriate network layout is c...
Local Area Networks (LANs) have historically been...
V5.NET has announced the news of new cloud server...
DHCP appears A computer or mobile phone needs an ...
Starting from July 1, the three major operators o...
The timetable for canceling data roaming charges ...
The Lunar New Year is here, and DogYun has launch...
On June 10, 2021, the "Data Security Law of ...
introduction Hello, everyone. I am your technical...
After becoming the first country in the world to ...
Business Challenges of Web Broadcasting Whether i...
On September 5, 2017, during HUAWEI CONNECT 2017,...
In order to further regulate domestic online publ...
[[376484]] In my work, the thing I deal with most...
Topology Specification Applicable to all versions...