Use wireshark to analyze tcpdump network data packets to quickly troubleshoot and solve the frequent abnormal TCP connection disconnection problem of OceanBase in a certain environment

Use wireshark to analyze tcpdump network data packets to quickly troubleshoot and solve the frequent abnormal TCP connection disconnection problem of OceanBase in a certain environment

1. Problem phenomenon

A customer reported that in one of its acceptance environments, when business colleagues were conducting ordinary business tests (non-performance stress testing scenarios), a microservice frequently encountered exceptions when performing database operations. The microservice log indicated that the underlying cause was:

 com.alipay.oceanbase.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure\n\nThe last packet successfully received from the server was 9 milliseconds ago. The last packet sent successfully to the server was 2 milliseconds ago. at sun.reflect.GeneratedConstructorAccessor143.newInstance(Unknown Source) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at com.alipay.oceanbase.jdbc.Util.handleNewInstance(Util.java:439) at com.alipay.oceanbase.jdbc.SQLError.createCommunicationsException(SQLError.java:1236) at com.alipay.oceanbase.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:4394) at com.alipay.oceanbase.jdbc.MysqlIO.reuseAndReadPacket(MysqlIO.java:4252) at com.alipay.oceanbase.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:4789) at com.alipay.oceanbase.jdbc.MysqlIO.sendCommand(MysqlIO.java:2993) at com.alipay.oceanbase.jdbc.ServerPreparedStatement.serverExecute(ServerPreparedStatement.java:1396) at com.alipay.oceanbase.jdbc.ServerPreparedStatement.executeInternal(ServerPreparedStatement.java:831) at com.alipay.oceanbase.jdbc.PreparedStatement.executeQuery(PreparedStatement.java:2042) at com.zaxxer.hikari.pool.ProxyPreparedStatement.executeQuery(ProxyPreparedStatement.java:52) at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeQuery(HikariProxyPreparedStatement.java) at org.springframework.jdbc.core.JdbcTemplate$1.doInPreparedStatement(JdbcTemplate.java:696) at org.springframework.jdbc.core.JdbcTemplate.execute(JdbcTemplate.java:638) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:688) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:720) at org.springframework.jdbc.core.JdbcTemplate.query(JdbcTemplate.java:730) at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:798) at org.springframework.jdbc.core.JdbcTemplate.queryForObject(JdbcTemplate.java:817)

2. Problem Background

The OceanBase version of the database server in this environment is observer 4.2.1.x, while the OceanBase JDBC driver version used by the microservice is 1.1.7. We learned from OB officials that OceanBase JDBC drivers of versions 1.1.7 and earlier are prone to abnormal disconnection. The solution given by OB officials is to upgrade the JDBC driver, and it is recommended to use at least version 2.4.x or above. However, customers have reported that in other environments, including production environments, some microservices use the same version of the OceanBase JDBC driver, and the database server also uses the same version of OceanBase, but basically no such problem occurs, or the frequency of such problem is very low. We still hope that we can check whether there are other reasons, such as whether it is related to the specific configuration of the database or operating system, or even the network conditions. We hope that we can alleviate this problem by adjusting parameters without upgrading the driver in a short period of time.

3. Preliminary analysis of the problem

  • The exception com.alipay.oceanbase.jdbc.exceptions.jdbc4.CommunicationsException:Communications link failure actually refers to the abnormal disconnection of the TCP connection. However, judging from the above error log, the microservice last successfully received a data packet from obproxy 15 milliseconds ago, and the microservice last successfully sent a data packet to obproxy 1 millisecond ago. The time is very short, and our business traffic is not large (the scenario is normal business testing, not performance stress testing). It is preliminarily speculated that it may not be a network packet loss problem, but a configuration problem of obproxy or the server.
  • To rule out network problems, we conducted a preliminary test of the network status of the environment using ping. The result data is as follows. It can be seen that the network status of the intranet environment is good, with low latency and no packet loss:
 ping -c 10 -i 1 172.253.34.74 PING 172.253.34.74 (172.253.34.74) 56(84) bytes of data. 64 bytes from 172.253.34.74: icmp_seq=1 ttl=61 time=0.727 ms 64 bytes from 172.253.34.74: icmp_seq=2 ttl=61 time=0.231 ms 64 bytes from 172.253.34.74: icmp_seq=3 ttl=61 time=0.278 ms 64 bytes from 172.253.34.74: icmp_seq=4 ttl=61 time=0.227 ms 64 bytes from 172.253.34.74: icmp_seq=5 ttl=61 time=0.219 ms 64 bytes from 172.253.34.74: icmp_seq=6 ttl=61 time=0.227 ms 64 bytes from 172.253.34.74: icmp_seq=7 ttl=61 time=0.384 ms 64 bytes from 172.253.34.74: icmp_seq=8 ttl=61 time=0.235 ms 64 bytes from 172.253.34.74: icmp_seq=9 ttl=61 time=0.232 ms 64 bytes from 172.253.34.74: icmp_seq=10 ttl=61 time=0.331 ms --- 172.253.34.74 ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9000ms rtt min/avg/max/mdev = 0.219/0.309/0.727/0.148 ms。

4. Cause of the problem

  • The reasons for abnormal disconnection of TCP connection are complicated. In addition to being related to the client jdbc driver version and the server obproxy version, it is also related to the specific configuration of the database, the specific configuration of the client and server operating systems, and even the network conditions.
  • Since there are too many related configuration parameters, in order to quickly troubleshoot and confirm the problem, we plan to analyze the network data packets first.
  • After coordination, we first used "tcpdump -i any port 2883 -w 2883.pcap" on the microservice side to capture network packets for about 5 minutes. Through the microservice log, we can confirm that the above problem occurred within 5 minutes, namely com.alipay.oceanbase.jdbc.exceptions.jdbc4.CommunicationsException:Communications link failure.
  • After opening the pcap network packets captured by the above command with wireshark, it is found that some TCP connections are indeed closed abnormally through RST, as shown in the following figure. Note that port 2883 here is the listening port of obproxy on the server side:

picture

  • To further confirm the problem, we coordinated to capture packets on the obproxy server. After opening the captured pcap network packets with wireshark, we found that the phenomenon was consistent with that on the client, that is, some TCP connections were indeed closed abnormally through RST, as shown in the following figure. Note that port 2883 here is the listening port of obproxy on the server:

picture

Let’s sort out the results according to the client packet capture:

  • When the TCP connection is successfully created and used normally, the microservice sends the SQL request to the server obproxy through the ob jdbc driver;
  • After about 0.001448 seconds, the server obproxy (port 2883) actively sent a [FIN, ACK] signal to close the TCP connection. This is the first abnormality, because under normal circumstances, obproxy will not actively close the TCP connection.
  • After about 0.03107 seconds, the microservice received the [FIN, ACK] signal from obproxy to close the connection, and replied with a [PUSH, ACK] signal to confirm that it had received the signal to close the connection;
  • After about 0.000033 seconds, the microservice [FIN, ACK] closes the TCP connection. This is the second abnormality, because normally, the microservice will not actively close the TCP connection when waiting for the SQL execution result from the server.
  • After about 0.000172 seconds and 0.000043 seconds, the server obproxy (port 2883) actively sent two [RST] signals to reset the TCP connection; (Combined with the server packet capture, it can be seen that these two [RST] are actually responses to the [PUSH, ACK] and [FIN, ACK] packets sent by the client).

The direct cause of this problem has been confirmed. It is not caused by network timeout or network packet loss, but obproxy actively sends [FIN, ACK] to disconnect for some reason!

  • Further tracking of the TCP flow revealed that obproxy actually responded to the client with an exception message before this: "HY000ORA-01000: maximum open cursors exceeded", as shown in the following figure:

picture

  • Further checking with the command show parameters like '%cursor%' revealed that the open_cursors parameter in this environment was configured to 50, while in other environments without such errors, the parameter was configured to 500 or even 1500!
  • Therefore, the root cause of this problem is that the open_cursors parameter configuration value in this environment is too low. When the number of cursors actually opened at the bottom of a session is greater than the parameter value, the server-side observer sends [FIN] to actively close the TCP connection for self-protection reasons! This also explains why the client will find related logs of frequently creating new JDBC connections.
  • The open_cursors parameter description is as follows: "specifies the maximum number of open cursors a session can have at once. can use this parameter to prevent a session from opening an excessive number of cursors. Range: [0, 65535] in integer."

picture

5. Problem Solving

In the acceptance environment, the command "alter system set open_cursors=1500;" was used to change the value of the parameter open_cursors from 50 to 1500, and the problem was successfully solved. In addition, after consulting the DBA, the relevant specifications are as follows:

  • Parameter open_cursors: [Effective at the tenant level], used to set the number of cursors opened by a single session. The default value is 50. If it is too small, ORA-01000 maxium open cursors exceeded may occur. It can be modified according to the situation. The calculation formula is as follows: tenant memory (64G)*_temporary_file_io_area_size(5%)/macroblock (2m) = 1600. The modification command is as follows: alter system set open_cursors=1500;
  • Parameter temporary_file_io_area_size: [Effective at the tenant level], the total size of SQL intermediate results (such as hash join) that can be used in the storage layer can be appropriately increased (tenant-level parameter, which needs to be adjusted through sys tenant). The modification command is as follows: alter system set "_temporary_file_io_area_size" = 5;

6. Follow up on issues

  • The exception log printing logic of the microservice needs to be optimized. The current log reveals that the underlying cause is "com.alipay.oceanbase.jdbc.exceptions.jdbc4.CommunicationsException:Communications link failure\n\n The last packet successfully received from the server was 9 milliseconds ago. The last packet sent successfully to the server was 2 milliseconds ago.", but does not reveal the underlying cause, namely "HY000ORA-01000: maximum open cursors exceeded". This underlying cause also needs to be reflected in the log to facilitate troubleshooting;
  • Subsequent microservices need to upgrade the OB JDBC driver version. It is recommended to upgrade to the official minimum version, such as OB JDBC driver 2.4.3;

7. Reference Links

  • https://open.oceanbase.com/blog/10900349
  • https://www.oceanbase.com/knowledge-base/oceanbase-database-1000000000763298
  • https://segmentfault.com/a/1190000042448986/en

<<:  NWCU's Smart New Campus 2.0, layout of "IT unified smart operation and maintenance" (Part 2): operation and maintenance organization management and process management​

>>: 

Recommend

Do you know the origin and function of Wi-Fi?

Since its introduction 25 years ago, Wi-Fi has pl...

Knowledge Popularization | 7 Deployment Solutions for 5G Private Networks

[[315546]] What is 5G private network? 5G private...

Fiber will play a key role in 5G development

CommScope recently said that in the future of bro...

ACI's "hardcore security" is more eye-catching

[51CTO.com original article] According to market ...

PacificRack has run away

The tribe has not shared any information about Pa...

Eight facts about data center design and construction

This article points out eight facts in data cente...