Tongcheng Yilong Wang Xiaobo: Cache should be managed in this way to handle high-concurrency scenarios with ease!

[51CTO.com original article] The Global Software and Operation Technology Summit hosted by 51CTO was held in Beijing on May 18-19, 2018. The summit focused on 12 core hot topics such as artificial intelligence, big data, the Internet of Things, and blockchain, and brought together 60 front-line experts from home and abroad. It was a high-end technology feast and a platform that top IT technical talents should not miss to learn and expand their network.

At the "High Concurrency and Real-time Processing" session on the afternoon of the 19th, Tongcheng-Elong Air Ticket Business Group CTO Wang Xiaobo delivered a keynote speech on "Cache Governance in High Concurrency Scenarios", elaborating on hot topics such as how to make cache more suitable for high concurrency, how to use cache correctly, and how to resolve cache problems through governance. After the meeting, 51CTO reporters sorted out the content of Wang Xiaobo's speech at the WOT2018 Global Software and Operation Technology Summit.

Wang Xiaobo mentioned in his speech that in high-concurrency scenarios, many people regard cache (high-speed cache memory) as a panacea that can "extend life". Wherever there is high concurrency pressure, cache is uploaded to solve the concurrency problem. But sometimes, even if cache is used, the system is still stuck and crashed. Is it because of poor cache technology? No, in fact, it is because the cache management work is not done well.

Take a look at the pitfalls that Tongcheng has encountered

Wang Xiaobo gave a relatively systematic introduction to the “pitfalls” that Tongcheng had encountered.

In order to relieve the pressure of high concurrency, Tongcheng initially chose memcache (distributed cache system) technology, and later switched to Redis architecture (data structure server, which can be used as database cache), and deployed nearly 200 servers. But the situation did not improve. The system often crashed, the scripts called in the application were messy, the resources of multi-instance deployment were unbalanced, and the data was too fragile to disappear.

In order to manage these servers, Tongcheng started the master-slave + keepalived (IT layer 3, 4, and 5 exchange mechanism software) mode and chose to gradually upgrade from single-machine Redis to cluster Redis. They soon found that when clusters were deployed in large numbers, the operation and maintenance side had no way to do operation and maintenance. Although they could run them uniformly through scripts, the cluster was uncontrollable, and many operation and maintenance technical means were prone to cause high-concurrency system downtime, which directly affected the overall business side. "The system could crash at any time, and the operation and maintenance team was about to collapse." Wang Xiaobo recalled.

What is the crux of all the problems encountered? Wang Xiaobo concluded that the biggest problem lies in the technical staff's specification of cache usage. People often forget its own shortcomings and only think of its advantage of "fast". He gave an example. In a system failure summary report, a technician wrote that he did not expect that Redis, which had only 30,000 lines of code in its initial state, would bring such a magical function. This idea makes programmers feel like they have a hammer in their hands and want to hammer nails when they see them. In other words, they want to use cache to solve any needs they see. Under such misleading, people began to frequently use cache-based log collectors, cache-based countdowns, cache-based counters, and cache-based order systems. After these functions appeared, people were only intoxicated by its speed, but ignored how to ensure its normal operation.

What is the real fault of cache? Wang Xiaobo summarized it into four points: First, over-reliance, which is the most prominent point. Sometimes, the cache is not needed, but the technicians insist on using it. Second, data is dropped to disk, third, the capacity is too large, and fourth, the cache avalanche. Why do these faults occur? Wang Xiaobo believes that the biggest problem is the abuse, misuse, and laziness of users. In addition, the operation and maintenance of thousands of cache servers without any usage rules, the operation and maintenance personnel do not understand development, and the developers do not understand operation and maintenance, resulting in the use of cache without design and control, and too many server resources being wasted. These are all common phenomena.

What kind of cache do people need? What kind of cache governance do they need? Wang Xiaobo believes that, from the perspective of true development philosophy, what people want is a "magic box" that can magically meet various high-concurrency requirements. In simple terms, developers do not need to care whether the cache is large or small, good or bad. Because developers have limited knowledge of cache technology, they are most afraid of using it indiscriminately. It is worth noting that many developers ignore the fact that many data in the cache are not always hot data, and do not make sufficient estimates before high concurrency arrives, which results in the discovery of bottlenecks too late during application.

Tongcheng's "Phoenix Nirvana"

In order to truly bring out the role of cache and cope with high concurrency, the Tongcheng technical team finally developed the phoenix solution. When they first designed it, they hoped that there would be a simple SDK on the application side in this architecture for developers to use. As long as the developer declares the project and related data scenarios, he will get a key. With this key, the SDK will assign a new cache warehouse to the developer, on which Redis can be run, and the entire scheduling platform can call it very quickly. In addition, phoenix can also start comprehensive monitoring from the client call. Of course, more importantly, it can prevent cache collapse and achieve dynamic expansion and contraction.

Later, the Phoenix solution added a proxy layer. Because the time cost of client multi-language development is too high, and the upgrade of the client in the application is a big problem, Wang Xiaobo revealed that the upgrade of the middleware of almost all embedded applications is a huge trouble. Once upgraded, the system needs to be retested and it is easy to crash again. Therefore, it is better to control through local cache, and disk can be used as cache for some infrequently used parts.

***, containers were added to the Phoenix solution. After container deployment was implemented, Tongcheng's overall monitoring, data migration, and scaling scheduling became more flexible and easier to operate through multiple small clusters + single nodes, cluster division by scenario, and real-time balanced scheduling data. Taking data migration as an example, Tongcheng has developed a complete migration system, from traffic expansion to data expansion, from vertical and horizontal expansion, all of which have achieved a relatively good fully automatic processing.

The above content is compiled by 51CTO reporter based on the interview with Wang Xiaobo, CTO of Tongcheng-Elong Air Ticket Business Group, at the WOT2018 Global Software and Operation Technology Summit. For more information about WOT, please visit .com.

[51CTO original article, please indicate the original author and source as 51CTO.com when reprinting on partner sites]

<<: Is the 5G era really here? Let’s solve the dilemma of 5G spending too much and earning too little first

>>: Unlimited traffic ≠ unlimited traffic usage. Have you ever encountered this kind of "trap"?

How to realize LoRa networking without a gateway?

NexusBytes: US VPS monthly payment starts from 2 USD, Singapore/Japan VPS monthly payment starts from 3.2 USD, large hard disk VPS monthly payment starts from 4 USD

Blog

QQ account stolen in 22 years, friends help verify but appeal is invalid: the confusion behind Tencent's authentication system

Blog

What you don’t know about blockchain is quietly subverting banks, BAT

Blog

MWC2023: Huawei releases a series of innovative solutions for simplified networks and data centers to accelerate the transition to an intelligent world

Blog

API Gateway Performance Comparison: Nginx vs. Zuul vs. Spring Cloud Gateway vs. Linkerd

Explore F5G industry scenario application practices and promote the digital transformation of enterprises into a new journey

More and more industry cases tell us that F5G (fi...

Innovation improves people's livelihood, H3C's public rental housing smart door lock solution won the 2020 World Internet of Things Expo Award

Recently, the public rental housing smart door lo...

Tongcheng Yilong Wang Xiaobo: Cache should be managed in this way to handle high-concurrency scenarios with ease!

How to realize LoRa networking without a gateway?

China Mobile may withdraw all 3G networks by 2020, but terminals still need to support GSM

5G architecture innovation is obvious and the bearer network should fully support it

NexusBytes: US VPS monthly payment starts from 2 USD, Singapore/Japan VPS monthly payment starts from 3.2 USD, large hard disk VPS monthly payment starts from 4 USD

QQ account stolen in 22 years, friends help verify but appeal is invalid: the confusion behind Tencent's authentication system

What you don’t know about blockchain is quietly subverting banks, BAT

MWC2023: Huawei releases a series of innovative solutions for simplified networks and data centers to accelerate the transition to an intelligent world

API Gateway Performance Comparison: Nginx vs. Zuul vs. Spring Cloud Gateway vs. Linkerd

Yecao Cloud: Hong Kong VPS annual payment starts from 139 yuan, Hong Kong dedicated server starts from 199 yuan/month

Can the United States make China disappear from the Internet?

Recommend

Essential for IoT experts: Network protocol stack LwIP (I)

Public transport Wi-Fi is too difficult to monetize and too costly to be shut down in more than a dozen cities

Financial reports of the three major operators: In 2020, 5G started with an explosion

Explore F5G industry scenario application practices and promote the digital transformation of enterprises into a new journey

Practice: How to connect two routers through WAN and LAN ports respectively?

5G and the Internet of Things: What does it mean for the telecommunications industry?

Edge networks are evolving towards intelligence and computing enhancement

The Ultimate Guide to Enterprise Network Management

RepriseHosting: $25.97/month-L5640, 16G memory, 1TB hard disk, 20TB/1Gbps, Seattle data center

China Mobile has built more than 410,000 5G base stations

HostHatch: Hong Kong/Tokyo/Singapore AMD EPYC series starting from $25/year, 2GB/25G NVMe/1TB monthly traffic

Innovation improves people's livelihood, H3C's public rental housing smart door lock solution won the 2020 World Internet of Things Expo Award

Why is the WiFi signal full but the internet speed is still slow?

Say hello politely - TCP protocol three-way handshake

Liu Liehong: my country has built the world's largest optical fiber network and 4G network