Racing against time: Why does Weimob's data recovery take so long?

Racing against time: Why does Weimob's data recovery take so long?

Several days have passed since the WeMall "deleted database and ran away" incident. It is reported that WeMall's services have been fully restored. For new users, all related business activities can be started normally. However, for old users, the data has not been fully restored. According to the information on its official website, the merchant account and equity data have been restored. As of the evening of February 28, about 70% of the data will be restored.

As B-side users and the general public, you may be curious about why the entire recovery cycle takes so long when cloud computing, containerized deployment, elastic scaling, data backup and other technologies are already very advanced. Today, I will talk about my understanding from a technical perspective.

Before we officially talk about technology, I would like to talk about Luo Pang’s New Year’s Eve speech "Friends of Time" this year. Luo Pang talked about "getting involved", which made me, a "person of my generation" who has been dealing with IT technology for many years, deeply touched. Many times when we stand outside the game, we feel that many things are not complicated, but when you get involved, you will find that we have only seen the tip of the iceberg, and many things are far more complicated and difficult than you thought.

[[316971]]

To give a very vivid example, people usually like to pick low-hanging fruits, because in terms of brain feedback, low-hanging fruits are easy to pick. However, a fruit that looks low may not be really low. It is very likely that you are too far away from it. When you move closer, you will find that it is higher than you initially saw. When you move closer, you will find that it is out of reach.

It's like a mountain. When you are far away from it, you will feel that the mountain is not high. Only when you walk to the foot of the mountain, you will realize that it is impossible for you to climb up. Here I have attached a picture of me at the mountaineering base camp on the north slope of Mount Everest. The altitude was about 5,300 meters at that time. Behind me is the legendary Mount Everest, the world's top with an altitude of 8,848 meters. You may think that it does not seem high, but that is because I am far enough away. In other words, when you think something is simple, it is often not really simple, but it is probably because you don't understand it.

[[316972]]

Back to the Weimob incident, the same reason applies. Modern large-scale Internet products, whether toC or toB, are easy to use from the user's perspective, but the complexity of the architecture behind them is the part under the iceberg, and its complexity is far beyond your imagination. I often say that "cognition limits your imagination." Therefore, I believe that at this moment, Weimob must be doing its best under the iceberg to promote the early recovery of data.

Okay, let's talk about some technical topics. Obviously, the main problem of Weimeng is the recovery of the database. Since the official has not released specific technical details, I can only find a very high-level architecture diagram on the Internet, and I can't get detailed information about the system infrastructure, especially the database architecture. So I can only make some possible guesses from the perspective of personal experience, with the aim of letting you understand the technical complexity.

First, let us understand the operating environment of the database. To simplify, there are mainly three types:

"Not on the cloud" : Establishing in one's own data center and completely managing hardware, software and data by oneself, this was the mainstream practice before the popularization of cloud platforms. In this model, all related database high availability, capacity expansion, and data backup must be managed and maintained by a very professional team (DBA team and operation and maintenance team), which places relatively high technical requirements on the enterprise.

"Full cloud" : completely built on the cloud environment. Note that the cloud here can be a public cloud or a private cloud. Cloud vendors will provide a full set of solutions to support features such as high availability, capacity expansion and data backup. It can be said that with the popularization of cloud computing and the rapid development of pan-database services (DBaaS), more and more emerging companies will choose this solution.

"Fake cloud computing" : This solution is the most bizarre, a bit like using a Louis Vuitton bag to carry vegetables, but it is not uncommon in the industry. It should be said that this is a product of the transitional stage. This method is to use the cloud solution as a virtual machine. This method is very similar to the above "not on the cloud". It does not take advantage of the cloud at all, but just moves the machines in the data center to the cloud. The disaster recovery and capacity expansion functions that the cloud solution can provide are castrated.

For the above three methods, "not going to the cloud" and "fake cloud" will pose greater risks to data than "full cloud". In the cases of "not going to the cloud" and "fake cloud", operation and maintenance personnel are more likely to have the opportunity to perform extreme operations such as "rm -rf /*" and "fdisk", while in the case of "full cloud", it is less likely to have the opportunity to execute such commands from the operating system level, and the database data will not be deleted by rm -rf /.

If the deletion operation does not occur at the data file level of the operating system (backups usually exist in the form of files), then the efficiency of recovering accidentally deleted data will be greatly improved by utilizing the characteristics of the database itself.

Similarly, in the face of data misoperation problems (for example, incorrectly updating a field in a table in batches), "full cloud" has obvious advantages over "no cloud" and "fake cloud". I have personal experience with this. In the past, a project used a self-built database. Due to a DBA's misoperation, an update statement without a where condition was executed on the production database, which directly caused all the bid record fields of the auctioned goods to be lost. Then it was a difficult full rollback and binlog replay, which eventually took more than 4 hours to recover. Later, the same misoperation occurred in the cloud database, and the rollback and recovery time only took a few minutes.

From Tencent Cloud's previous response, we can roughly see that the deleted data of Weimeng is not on Tencent Cloud. Combined with the current data recovery speed, we can almost determine that Weimeng has a high probability of not adopting the "all-on-cloud" architecture, or only part of the data is on the cloud, and it is very likely that the more extreme "rm -rf /*" and "fdisk" situations have occurred. In this case, all the master-slave library files, full backup files, incremental backup files, and binlogs are lost together. The technical challenge here mainly lies in how traditional IT vendors perform disk recovery, which is no longer a skill point for any cloud vendor.

To recover all data in this case, it is conceivable that the technical difficulty is very high. According to my rough understanding, at least the following technical thresholds must be overcome.

Getting a full backup is ideal if there is an offsite cold backup or disaster recovery, but since full backups are usually very large, it takes a long time to complete the file transfer and verification. If there is no offsite full backup available, then a more time-consuming disk recovery method must be used, which cannot guarantee 100% full success. I will explain why disk recovery is more time-consuming in a moment. Another problem here is that the full backup may be too "old", which also brings more time cost to the subsequent recovery.

Get incremental backups. Often, there is not enough time to do off-site disaster recovery backups for incremental backups, so there is a high probability that they will have to be restored from disk, which is a huge waste of time and also cannot guarantee 100% complete recovery.

Get binlog. Binlog is a binary log file that records all database table structure changes (such as CREATE, ALTER TABLE, etc.) and table data modifications (INSERT, UPDATE, DELETT, etc.). It is usually stored on the disk in the form of index files (suffixed with .index) and log files (suffixed with .00000*). In order to ensure the accuracy of binlog data changes, binlog is usually in row format. Therefore, the file size is not small and the number of files is also large.

With the above as basic input, you can start importing and recovering data at the database level. This process also takes a lot of time, and it is based on the premise that the above files can be obtained 100%. If there are data problems in the above backup files, the additional time cost will be even greater.

Finally, let's talk about disk file recovery. When we delete files on storage media such as disks, or even format them (except low-level formatting), the data on the disk does not actually disappear from the disk, but is only marked in the file allocation table. The data in the data area itself is not immediately erased. As long as the data area of ​​the file is not overwritten by the information written later, these deleted files can be recovered. This is the theoretical basis for disk file recovery after deletion.

However, the data files and backup files of the database are often very large. If any individual data area is overwritten, the recovered file will be incomplete. At this time, human intervention is required to correct it. The workload and technical difficulty are very large, and sometimes special instruments and equipment are required. In more complex cases, data carving technology (File Carving) is also used. Data carving technology is a file recovery technology frequently used in digital forensics research. It extracts files from seemingly undifferentiated binary data sets, that is, raw disk images, without using the disk's file system type.

In addition, for a system as large as Weimob, each vertical business unit may have its own business database, and these databases may even use different solutions. This architectural heterogeneity will also bring great challenges to the recovery process. In addition, even after partial data recovery is completed, it cannot be put online immediately, but must wait for other related data to be recovered, and do a good job of cross-checking the data to ensure that the data is foolproof, all of which takes a lot of time.

These are just some of the scenarios I can think of. I am also far away and looking at the problem from the perspective of an outsider, so I believe the actual situation will be more complicated than what I described. We cannot make any inferences about the final recovery results yet, and all we can do is wait.

<<:  Cloud signing! Changhong Jiahua and Huawei Enterprise Business reached a general distributor cooperation intention in China

>>:  HTTPS 7-way handshake and 9 times delay

Recommend

SRv6 opens a new IP era

We know that IP data transmission in current bear...

Deep dive into the Kubernetes network model and network communication

Kubernetes defines a simple and consistent networ...

Why should enterprises choose SD-WAN?

While MPLS still dominates the WAN market, no ent...

Stop praising 5G!

On January 17, Huawei founder and CEO Ren Zhengfe...

Can lightweighting become the spark that sets 5G off?

At the 31st PT Expo held recently, 5G became the ...

Is LoRaWAN the solution to cellular IoT challenges?

Ten years ago, there were high hopes for cellular...

5G has nothing to do with WiFi

A quick note: the Wi-Fi that all of our connected...

What? You still don’t know the best assistant for 5G? Come in!

who I am Hello everyone, my name is OpenStack, a ...

Cloud, IPv6 and all-optical networks

With the development of technologies such as 5G a...