Another batch of long transactions, who is to blame for the P0 failure?

Another batch of long transactions, who is to blame for the P0 failure?

In recent weeks, there have been many service errors caused by transaction problems. The phenomenon is that the database connection pool is full, the database connection waits for a long time, and finally the request thread hangs, and the service reports a large number of errors. At this time, a large number of service resources and database resources are idle, but they just can't go on, which has a relatively bad impact.

[[321600]]

Who should take the blame? Of course, the architect, because all services are alive this time, and there is nothing for operation and maintenance to do.

During the interview, you may encounter questions about transactions, and the upgraded version may be distributed transactions. In the Internet industry, a casual compensation transaction can get you by, after all, they are short and sharp interfaces.

But in many enterprise applications, this doesn’t work. Let’s face it.

Why use long transactions?

In many background systems with very complex businesses, DB operations are often performed frequently. In order to ensure data consistency and be able to roll back data when errors occur, transactions are usually used.

Take the simplest stand-alone database transaction as an example.

During a transaction operation, if the duration is too long, the DB connection will not be released until the transaction ends. This kind of transaction operation that occupies the DB connection for a long time is called a long transaction. Once there are a large number of external requests and this operation is called concurrently, a large number of DB connections will be held and not released until the connection pool is full.

At this time, if other requests come in, they will most likely fail.

That is, the connection resources are occupied by a few long transaction operations. In this case, even the simplest interface query cannot be performed normally.

A few bad apples spoil the whole barrel.

Some magical reactions

When you troubleshoot this kind of problem, you may get stuck. jstack shows that most requests are actually blocked on the tomcat thread pool, and some requests with very fast access speed are blocked.

For example, out of the 200 threads of Tomcat, 180 are blocked on the /status interface which takes less than 1ms.

Many people are confused. Experience fails.

The output of jstack at this time deceives us. The real cause of the blockage is the extra 20 threads.

What improvements have been made?

Keeping transactions short is a basic requirement, including but not limited to:

The calling frequency of slow queries should be controlled to minimize slow queries. In many cases, this rule is self-deceptive and requires the business to make some compromises.

The transaction should not include any RPC calls to reduce the granularity of the transaction. Usually, some RPC calls, including calls to other non-transactional resources, are time-consuming and uncontrollable. If they are also included in the scope of the transaction, it will inevitably increase resource usage. The transaction should not include other services that are prone to timeout or long-term blocking, such as HTTP calls and IO operations.

Secondary priority services such as message queues should not be placed in transactions to avoid service unavailability caused by message queue unavailability. It is very necessary to set a reasonable timeout for components like message queues, otherwise it will always wait there. But even so, try not to include them in transaction operations.

Cross-database and cross-type (such as Redis) transactions should not be placed in the same transaction to avoid cross-impact.

You can see that some of the above descriptions are contrary to the data consistency we pursue. This is not surprising, as it is still a trade-off based on the CAP principle. Some businesses would rather be stuck and no longer respond than enter abnormal data; others would rather keep the business running and correct the dirty data through compensating transactions.

It all depends on your choice.

Someone always takes the blame for the design, and someone always makes sacrifices for compensation.

Solution

So how can we quickly solve the problem of service unavailability caused by large transactions?

There is no solution except to expand the capacity. Restarting the system may not work either, because the blocked requests will come back more fiercely.

You might think of increasing the size of the connection pool, but in practice, this is not very useful, as large transaction requests will quickly fill up the connection pool.

But we can take precautions in advance.

Taking Spring as an example, most transactions are controlled by @Transactional annotations or declarative transactions. I suggest the following ways to prevent and detect them:

1) Rescan or review the business code to check whether the above mentioned situations exist in the transaction. Then move all operations except DB operations out of the transaction.

2) Pay enough attention to each transaction operation. For transactions with uncertain execution complexity and time complexity, add timeout alarms to promptly identify the causes.

At the same time, it is also necessary to strengthen monitoring and assist in troubleshooting.

1) The business can consider printing the database connection pool information regularly and conduct preliminary troubleshooting by viewing the logs.

2) Use jstack to query the execution stack and find the blocking point.

3) Troubleshoot and contact downstream services to find out the main cause

xjjdog tends to use monitoring to quickly find problems. As shown in the figure, through connection pool monitoring, we can see that the number of connections in the database connection pool remains high for a long time and is not released, while the number of waiting threads increases sharply. When this phenomenon occurs, it can be considered whether it is caused by the above reasons.

When a problem occurs, you should use jstack in a timely manner (multiple times) to locate the blocking position of the thread, and then check whether there is a problem with the downstream service or whether there is a slow query.

The best case scenario is that the service has already sorted out the code, so the only cause is likely to be slow queries. For slow queries, the druid database connection pool provides SQL aggregation, which can view the specific execution status of each type of query statement. As shown in the figure, SQL requests surged in a short period of time, the maximum execution time increased, and the connection pool was full:

It is clear at a glance which SQL statement caused the problem.

End

Long transaction problems are of high risk and usually result in serious consequences. They can be prevented through observation and monitoring.

The best solution is, of course, to improve the business model, but this involves development costs and cross-departmental collaboration.

The boss who pays you cannot understand your talking in your dreams.

In some companies, both of these are maddening things, and it is better to just take the blame.

<<:  Survey shows consumers are very dissatisfied with 5G and cellular network quality

>>:  Understand the benefits of cloud-native networking for secure access to the service edge

Blog    

Recommend

...

South Korea's 5G users approach 10 million, with mixed results for the future

As the first country in the world to announce the...

10 Things You Need to Know About Cisco Global Gold Certification

Welcome to Cisco Global Gold! For the first time ...

Should you upgrade your 5G package? Read this article before deciding

Recently, I often receive such calls on my two mo...

【Funny story】An attack launched by a network cable

Not long after I entered college, I encountered a...

Why is the 5G signal weaker than 4G?

The advent of 5G technology promises lightning-fa...

Detailed explanation of five Docker container network modes

Docker networking sets up how containers communic...

2018 F5 China Application Service Summit Forum was successfully held

The "F5 China Application Service Summit For...

Goodbye, endless pop-up ads

In recent years, with the rapid development of mo...