WOT Zhang Fuxing: Evolution of Zhihu container platform and its integration with big data

[51CTO.com original article] On May 18-19, 2018, the Global Software and Operation Technology Summit hosted by 51CTO was held in Beijing. Technical elites from global companies gathered in Beijing to discuss the forefront of software technology and jointly explore the new boundaries of operation technology. In this conference, in addition to the star-studded main forum, the 12 sub-forums were also unique. At the "Open Source and Container Technology" sub-forum on the afternoon of the 19th, Zhang Fuxing, head of Zhihu's computing platform, delivered a wonderful speech entitled "The Evolution of Zhihu's Container Platform and Its Integration Practice with Big Data".

Zhang Fuxing, head of Zhihu computing platform, gave a speech

Zhang Fuxing, head of Zhihu computing platform, graduated from the Institute of Computing Technology of the Chinese Academy of Sciences with a master's degree. Before joining Zhihu, he worked in Sohu Research Institute and Yahoo Beijing R&D Center on distributed storage and cloud platform development. Now he is mainly responsible for containers, traffic load balancing, and big data basic components in Zhihu computing platform. In this speech, Zhang Fuxing mainly explained from three aspects, including the evolution of Zhihu container platform, the pitfalls encountered in the production environment during practice, and some integration practices of container technology and big data applications.

The evolution of Zhihu's container platform

At the beginning of his speech, Zhang Fuxing introduced the evolution of Zhihu's container platform. The evolution of Zhihu's container platform can be roughly divided into three stages. In September 2015, Zhihu's container platform was officially launched in the production environment; in May 2016, 90% of Zhihu's business was migrated to the container platform; currently, in addition to business containers, multiple basic components such as HBase and Kafka are also deployed in containers, with a total number of physical nodes reaching more than 2,000 and a number of containers reaching more than 30,000.

During the evolution, Zhang Fuxing summarized three key points:

*** is the change in technology selection from Mesos to Kubernetes;

The second is the architectural adjustment from a single cluster to a multi-cluster hybrid cloud;

The third is an optimization in usage from rolling deployment to separation of deployment and release.

Zhang Fuxing first shared his thoughts on the selection of technologies from Mesos to Kubernetes. Zhihu started using the container platform in production in 2015, when Kubernetes had just been released and was not yet mature. Therefore, Zhihu chose the Mesos technology solution. The advantage of Mesos is that it is very stable, and in the architectural design, the Master load is relatively light, and the scale of containers that can be accommodated in a single cluster is relatively large. According to official introduction, a single cluster can accommodate 50,000 nodes. The disadvantage is that a separate framework needs to be developed, which will bring higher usage costs.

The advantages of Kubernetes are its strong community, complete functions, and low cost of use. However, since it stores all states in etcd, the node scale that a single cluster can accommodate is not as large as Mesos. The official statement is that after etcd is upgraded to version V3, it can now accommodate up to 5,000 nodes.

The second architectural change is from a single cluster to a hybrid cloud architecture. Why do we need to do this? It is actually required in the production environment. We need to make a grayscale cluster, that is, any parameter changes for Mesos or Kubernetes must be verified on the grayscale cluster before large-scale practice in the production environment. The second point is to expand the capacity of a single cluster, and the third point is to tolerate single cluster failures. The hybrid cloud architecture uses a large cluster pool in the public cloud, which can greatly increase the size of the elastic resource pool and resist sudden expansion.

In the process of implementing the hybrid cloud architecture, Zhihu also investigated solutions such as Kubernetes Federation. First of all, the Kubernetes Federation solution is not officially recommended for running in a production environment. Second, due to many problems in deployment and management, we use a self-developed management solution. The general principle is that each of our business containers will create a Deployment on multiple clusters at the same time. The configuration of these Deployments, such as the container version or the CPU and memory resource configuration, is exactly the same. The only difference is that the number of containers will be adjusted according to the size of different clusters.

Another change is the switch from the rolling deployment mode to the deployment and release separation mode. Here we first give our definition of deployment and release. We can look at a typical service launch process, including three release stages: intranet, canary, and production environment. At each stage, it is necessary to observe whether the verification indicators are normal in order to decide whether to continue to go online or roll back. If the rolling deployment of Deployment is adopted, some containers can be upgraded each time, and the service will not be interrupted during the upgrade process, and the instantaneous *** resources consumed are small, but the rolling deployment cannot control the progress pause, resulting in the inability to correspond to multiple release stages. If an independent k8s Deployment is used in each stage, the deployment speed will be slow, and because the old container is destroyed during the rolling process, if you want to roll back, you need to redeploy, and the rollback speed is slow. To this end, we separated the deployment and release stages. When the service is launched, a group of new version container instances are started in the background. When the number of instances meets the release conditions, some instances are released to receive external traffic. During the external traffic verification process, the background can also continue to deploy new container instances in parallel, so that users cannot perceive the time of container instance deployment and achieve second-level deployment. At the same time, during the release process, the old container instance is no longer visible to the outside world, but the instance is still retained for a period of time. In this way, if there is a problem in the release process, you can switch immediately and roll back in seconds.

In terms of container usage patterns, Zhihu has introduced persistent storage to provide corresponding support. In terms of container usage patterns, we have gone from the initial stateless web application to the use of persistent storage in containers to implement containerized deployment of some stateful services. For example, based on Hostpath, we can use Daemonset to start the consul agent on each node, because Daemon Set ensures that only one pod is started on each node, and there will be no hostpath conflict. Based on local volume, local disk scheduling management can be performed to implement the deployment of kafka broker, which will be introduced in detail later when introducing the kafka solution. In addition, there is a method of mapping the distributed file system to the container based on Fuse, which is currently mainly used in business data reading scenarios.

Another usage mode is container network. In the Kubernetes solution, we choose the Underlay IP network. The mode we adopt is that the container IP and the physical machine IP are completely equal. This makes the interconnection simple. Since there is no overlay packet unpacking processing, the performance is almost lossless. In addition, in the IP mode, the source of the network connection can be easily located, and fault location is easier. In the specific practice process, we give each machine a fixed network IP segment, and then use the CNI plug-in IPAM to be responsible for the allocation and release of the container IP.

The pitfalls of maintaining the container platform

Zhang Fuxing has extensive experience in the production environment and establishment of container platforms, and has also stepped on many technical "pitfalls". At this WOT summit, he shared with everyone the more typical technical failures and technical traps he had experienced in the past, and the valuable "pitfall experience" benefited the guests a lot.

Trap 1: K8S events

In the middle of a dark and windy night, our K8S etcd suddenly became inaccessible. After investigation, we found that the reason was the performance problem caused by the default storage method of K8S events. By default, K8S will record all event changes that occur in the cluster in etcd. K8S events are configured with an expiration policy by default. After a period of time, the events will be recycled to release the storage space of etcd. This sounds reasonable, but in etcd's implementation of TTL, each traversal search is very inefficient. As the cluster size increases, the cluster frequently releases deployment changes, resulting in more and more events and more and more etcd load, until etcd crashes and the entire K8S cluster crashes.

K8S is actually aware of this problem, so it provides users with a configuration that can record events in a separate etcd cluster. In addition, we can clear the entire event etcd during the off-peak hours, which is equivalent to implementing the expired cleanup strategy ourselves.

Trap 2: K8S Eviction

The second pitfall is K8S Eviction, which directly deletes all Pods in the production environment. Its occurrence first requires that the API server be configured for high availability, but even if high availability is configured, there is a small probability that it will crash. If it is not handled in time, for example, if the API server goes down for more than five minutes, the Nodes in these clusters will lose connection with it, which will trigger the control node to kill all pods and evict them from the cluster. After version 1.5, the official has added a configuration, which is -unhealthy-zone-threshold. For example, when more than 30% of the Nodes are in a disconnected state, the cluster will prohibit the Controller Manager's eviction policy to prevent the accidental expulsion of cluster containers when encountering large-scale exceptions.

Another issue is the container port leakage. When using the port mapping mode, it is often found that the port is already allocated when a container is started, saying that the port is already in use. By analyzing the Docker Daemon code, we found that the process of Docker Daemon allocating ports and recording these ports in the internal persistent storage is not atomic. If the Docker Daemon is restarted in the middle, it will only restore according to the stored port record, so it forgets the previously allocated port. After finding this reason, we proposed a corresponding solution to the official.

Pitfall 3: Docker NAT

Another pitfall is encountered in Docker NAT network. If you have read the system configuration I posted, you will find that the out-of-order processing of network data packets is too strict. If the process in the container accesses some image services on the public network such as Alibaba Cloud, when the network is poor, if the out-of-order packets exceed the TCP sliding window, this system configuration will reset the network connection. Turning off this configuration can solve this problem.

An attempt to integrate container platforms and big data components

In the big data scenario, there are actually two main processing paths. As shown in the figure, the one on the left is real-time processing, and the one on the right is batch processing. Because of the different starting points, the design ideas of these components are very different. Batch processing is mainly used for ETL tasks, so it is mainly used to build data warehouses, pursuing throughput and is not sensitive to latency.

However, real-time processing is sensitive to latency. The failure of any node in it will cause data landing delays and data display unavailability. Therefore, for real-time processing, the load of the machines running its components cannot be particularly high.

We often encounter some problems in the process of maintaining the big data production environment. For example, a certain business has made some changes, and the traffic it writes to Kafka has suddenly increased a lot. At this time, the entire load of Kafka training is increased. If it is too high, the entire cluster may collapse, and the production environment will be affected.

How to implement the corresponding governance? The basic idea is to divide and isolate clusters according to business parties. We divided the clusters, which brought another problem. We changed from one cluster to dozens of clusters. How can we unify the configuration management and deployment of so many clusters? This is very costly. We used the K8s template because it can easily build multiple identical operating environments with one click.

Another problem is that the amount used by each business party is different. Some business parties use a larger amount, while others use a smaller amount. If the usage is relatively small, for example, only a few dozen kbs, it is also necessary to maintain a highly available cluster. If three machines are equipped, this will result in a large amount of resource waste. Our solution is to use containers to achieve more fine-grained resource allocation and configuration.

In terms of the management of the big data platform, we practice the idea of DevOps, which means that we are the platform party rather than the operation and maintenance party. We position ourselves as the developer of the tool. Daily operation and maintenance operations, including cluster creation, restart, and capacity expansion, are all completed autonomously by the business party through the PaaS platform. The benefits of this are, first of all, it reduces the cost of communication. As the company grows bigger, the communication between business parties is particularly complicated. Secondly, it is convenient for the business party. For example, if it is found that he needs to expand capacity, he can directly complete it independently on this platform, which reduces the daily workload and allows us to focus more on the technology itself and how to make this platform and the underlying technology better.

On the big data platform, we also provide a wealth of monitoring indicators, which is also a part that we must understand in DevOps practice. Because the business itself has a relatively shallow understanding of systems such as Kafka or HBase, how can we convey our accumulated understanding and experience of these clusters to the business side? We fully expose these indicators to the business side through monitoring indicators, hoping to turn the Kafka cluster into a white box instead of a black box, so that when a failure occurs, the business side can directly see various anomalies on the indicator system, and then cooperate, for example, we customize some alarm thresholds for them, so that they can do some processing in time.

In summary, our ideas on the integration of containers and big data are, first, we need to isolate clusters according to the business side, second, we use K8S to manage and deploy multiple clusters, and then use Docker to isolate resources and perform fine-grained allocation and monitoring. In terms of management and operation, we practice the concept of DevOps, and then let the platform side focus more on tool development itself, rather than being limited to trivial operation and maintenance operations.

As for the future prospects, on the one hand, we hope to move more basic design components to K8S. On the other hand, we provide a PaaS platform on top of these components to provide business parties with better, more convenient and more stable services. The ultimate ideal is to uniformly hand over the resources of our data center to K8S for scheduling and management to realize DCOS.

The speeches of the speakers at this WOT Summit are compiled and edited by 51CTO. If you want to know more, please log in to WWW.51CTO.COM to view them.

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<: Data roaming is cancelled. China Mobile, China Unicom and China Telecom, the three major operators, said that international roaming fees will continue to be reduced.

>>: Riverbed China Survey: There is a huge gap between digital vision and digital performance

[11.11] Hostons: VPS hosting annual payment free double hard disk + double traffic

HostXen recharges 300 yuan and gets 50 yuan / recharges 618 yuan and gets 150 yuan, Hong Kong 2G memory VPS monthly payment starts from 50 yuan

Blog

Analysis of the Four Major Disaster Recovery Technologies in Data Centers

Recommend

The Ministry of Industry and Information Technology issued an urgent warning: fraud using "number portability" has been launched

On December 6, the Ministry of Industry and Infor...

WOT Zhang Fuxing: Evolution of Zhihu container platform and its integration with big data

[11.11] Hostons: VPS hosting annual payment free double hard disk + double traffic

Digital-vm 50% off, US/Japan/Singapore VPS monthly payment starts from 3 USD, 1-10Gbps unlimited traffic

It is estimated that by the end of 2020, the number of 5G users in North America will reach 20.9 million

Keda releases global perception trinocular camera | one device can do three things

Five common OSPF problems

HostXen recharges 300 yuan and gets 50 yuan / recharges 618 yuan and gets 150 yuan, Hong Kong 2G memory VPS monthly payment starts from 50 yuan

Analysis of the Four Major Disaster Recovery Technologies in Data Centers

A large number of 5G users complained: "I changed to a 5G phone, but the network speed is slower than 4G"

HTTP 2.0 is a bit explosive!

LiFi looks good, but it is difficult to pass the market test

Recommend

The Ministry of Industry and Information Technology issued an urgent warning: fraud using "number portability" has been launched

V.PS: €5.95/month KVM-1GB/20GB/1TB/12 data centers available in the United States, Japan, Hong Kong, etc.

The impact of blockchain technology on the future world and data centers

Nokia deploys 5G SA private network for crane manufacturer Konecranes

Twisted Pair vs. Fiber Optic Cable Advantages and Challenges

A brief discussion on WebSocket interface testing

LuxVPS: €3/month KVM-4GB/30GB/1TB/Germany data center

These 8 technologies may disrupt 2018

Computer Network Architecture

Forty-five kinds of traditional knowledge about optical fiber and optical cable

Are you ready for network automation?

The briefing session for the Kunpeng Application Innovation Competition 2020 (Tianjin Division) was successfully held!

Introduction to Socks5 Proxy Protocol

Learn more about load balancers

TCP send window, receive window and how they work