1 Do you know how Kafka's ultra-high concurrent network architecture is designed?We know that Kafka's network communication architecture uses Java NIO and Reactor design patterns. Let's first take a look at the complete network communication layer architecture as a whole, as shown in the following figure:
01 Accept ThreadIn the classic Reactor design pattern, there is a "Dispatcher" role, which is mainly used to receive external requests and distribute them to the actual processing threads below. In the Kafka network architecture design, this Dispatcher is the " Acceptor thread", a thread used to receive and create external TCP connections. On the Broker side, each SocketServer instance will only create one Acceptor thread. Its main function is to create a connection and pass the received Request request to the downstream Processor thread for processing.
02 Processor ThreadThe Acceptor only processes the request entry connection, so the actual creation of network connections and distribution of network requests are completed by the Processor thread. Each Processor thread will create three queues when it is created.
03 RequestHandlerPool thread poolThe Acceptor thread and the Processor thread are just "porters" for requests and responses, while the KafkaRequestHandlerPool thread pool is the one that " actually processes Kafka requests " . In the above network ultra-high concurrent communication architecture diagram, there are two parameters related to the entire process, namely " num.network.threads " and "num.io.threads". Among them, num.io.threads is the size configuration of the I/O worker thread pool.
Next, we will explain the core process of a complete request processing in conjunction with the Kafka ultra-high concurrency network architecture diagram:
2 Do you know how Kafka's high-throughput log storage architecture is designed?Kafka is mainly used to process massive data streams. The main features of this scenario include:
Based on the above two analyses, for write operations, directly using the " sequential append log " method can meet Kafka's write efficiency requirements for millions of TPS. How to efficiently query these logs? We can imagine designing the Offset of the message as an ordered field, so that the messages are stored in order in the log file, and there is no need to introduce an additional hash table structure. The messages can be directly divided into several blocks. For each block, we only need to index the Offset of the first message in the current block. This is a bit like a binary search algorithm. That is, first find the corresponding block according to the Offset size, and then search sequentially from the block. As shown in the following figure: In this way, you can quickly locate the location of the message you are looking for. In Kafka, we call this index structure a " sparse hash index " . The above shows the final storage implementation of Kafka, which is based on sequential append log + sparse hash index. Next, let's take a look at the Kafka log storage structure: As can be seen from the above figure, Kafka stores logs based on the structure of " topic + partition + replica + segment + index". Now that we know the overall log storage architecture, let's take a look at the Kafka log format. The Kafka log format has also gone through multiple version iterations. Here we mainly look at the log format of version V2: From the above figure, we can conclude that the V2 version log format mainly improves the space utilization rate of the message format through variable length , and extracts certain fields into message batches (RecordBatch). At the same time, a message batch can store multiple messages, which can greatly save disk space when sending messages in batches. Next, let's take a look at the overall process of writing log messages to disk as shown in the following figure: 3. How do you deploy the Kafka online cluster?Here, we start from the essential capabilities of architects and take the e-commerce platform as an example to explain how to implement the Kafka production-level capacity assessment plan and how to gain recognition from company leaders and operation and maintenance departments to approve your plan. For more details, please read: Eight steps to deeply analyze the Kafka production-level capacity assessment solution 4. How do you monitor the Kafka online system?Kafka is an important part of the large-scale system architecture and plays a vital role. Therefore, the stability of the Kafka cluster is particularly important. We need to monitor the production Kafka cluster in an all-round way. Generally, the online system can be monitored from the following five dimensions: 01 Host node monitoringThe so-called host node monitoring is to monitor the performance of the node machine where the Kafka cluster Broker is located. Host node monitoring is the most important for Kafka, because many online environment problems are first caused by some performance problems of the host. Therefore, for Kafka, host monitoring is usually the first step to discover problems. The main performance indicators are as follows: " Machine load", " CPU usage", " Memory usage", " Disk I/O usage ", " Network I/O usage", " Number of TCP connections", " Number of open files", and " Inode usage ". If you want to better monitor host performance, there are two tutorials for you to learn and refer to: 02 JVM MonitoringAnother important monitoring dimension is JVM monitoring. Monitoring the JVM process is mainly to give you a comprehensive understanding of the Kafka Broker process. To monitor the JVM process, you need to pay attention to three indicators: "Monitor the frequency and duration of Full GC", " Monitor the size of active objects on the heap ", " Monitor the total number of application threads " 03 Kafka Cluster MonitoringAnother important monitoring dimension is the monitoring of the Kafka Broker cluster and various clients. There are three main methods:
04 Kafka Client MonitoringClient monitoring mainly involves monitoring producers and consumers. Producers send messages to Kafka. At this time, we need to understand the round-trip delay (RTT) between the client machine and the Broker machine. For clusters across data centers or in different locations, the RTT will be even greater, making it difficult to support a large TPS. From the Producer's perspective: request-latency is the JMX indicator that needs to be paid special attention to, that is, the delay of message production requests; in addition, the running status of the Sender thread is also very important. If the Sender thread hangs, the user will not be aware of it, and the only symptom is that the message sending on the Producer side has failed. Consumer perspective: For the Consumer Group, you need to focus on the join rate and sync rate indicators, which indicate the frequency of rebalance. In addition, it also includes message consumption offset, message accumulation number, etc. 05 Monitoring between BrokersThe last monitoring dimension is the monitoring between Brokers, which mainly refers to the performance of replica pulling. Follower replicas pull data from Leader replicas in real time. At this time, we hope that the pulling process is as fast as possible. Kafka provides a particularly important JMX indicator called "under replicated partitions " , which means that, for example, we stipulate that this message should be saved on two Brokers. Assuming that only one Broker saves the message, the partition where this message is located is called under replicated partitions. This situation is of special concern because it may cause data loss. Another important indicator is "active controller count" . In the entire Kafka cluster, you should ensure that only one machine has an indicator of 1, and all others should be 0. If you find that one machine has an indicator greater than 1, there must be a split brain. At this time, you should check whether there is a network partition. Kafka itself cannot resist split brain and relies entirely on Zookeeper. However, if a network partition really occurs, there is no way to deal with it. It should be allowed to fail quickly and restart. 5. How do you tune the Kafka online system?For Kafka, " throughput " and " latency " are very important optimization indicators. Throughput TPS: refers to the number of messages that the Broker or Client can process per second. The larger the better. Latency: It refers to the time interval from when the Producer sends a message to when the Broker persists the message to when the Consumer consumes the message successfully. Contrary to throughput TPS, the shorter the latency, the better. In short, high throughput and low latency are our main goals for tuning the Kafka cluster. 01 Improve throughputThe first is to improve the throughput parameters and measures:
02 Reduce latencyThe purpose of reducing latency is to minimize end-to-end latency. Compared with the parameters for improving throughput above, we can only adjust the parameter configurations on the Producer and Consumer ends. For the Producer side, we hope to send messages quickly at this time, so we must set linger.ms=0, turn off compression, and set acks = 1 to reduce the replica synchronization time. For the Consumer side, we only keep fetch.min.bytes=1, that is, as long as the Broker side has data that can be returned, it will be returned to the Consumer immediately to reduce latency. 03 Reasonably set the number of partitionsThe more partitions, the better, nor the fewer partitions, the better. You need to build the cluster, perform stress testing, and then flexibly adjust the number of partitions. Here you can use the official script of Kafka to perform stress testing on Kafka. 1) Producer stress test: kafka-producer-perf-test.sh 2) Consumer stress test: kafka-consumer-perf-test.sh |
<<: Securing the edge cloud and 5G: How to do it and why it matters
At the end of last month, I just shared the news ...
One of the fascinating things about technology is...
RAKsmart offers a series of promotional products ...
Topology Specification Applicable to all versions...
Traditional WANs can no longer keep up. In the br...
[[381851]] This article is reprinted from the WeC...
TripodCloud is an early-established Chinese hosti...
The GSM 2G network has been around for more than ...
What is certain is that operators will continue t...
Recently, China Mobile, China Telecom and China U...
Three companies reported: 5G driving effect is ev...
The cellular-based Narrow Band Internet of Things...
As we all know, Ethernet has become the most wide...
[[437208]] This article is reprinted from the WeC...
Just after Double Eleven, RackNerd released Black...