The scale of data is growing explosively. Practical sharing of data-based operations of cloud-native data warehouses

The scale of data is growing explosively. Practical sharing of data-based operations of cloud-native data warehouses

At the recent 2021 Alibaba Cloud Financial Data Intelligence Summit - "Cloud Native Drives the "Growth Dark Horse" of Digital Operations", Wei Chuang, a senior technical expert of Alibaba Cloud Database, first started from the perspective of data value chain, explaining how cloud native data warehouse supports data-based operations, full-link marketing and Alibaba Group's Double 11 business, and demonstrated best practice cases and application scenarios for financial customers. The content of this article is compiled based on the speech recording and PPT.

[[412539]]

Wei Chuangxian, senior technical expert of Alibaba Cloud Database

1. Background and Trends

1. Alibaba’s 15 years of cloud computing practice

Looking back at Alibaba’s 15-year journey of cloud-native development, it can be roughly divided into three stages.

The first stage was the Internet-based application architecture stage from 2006 to 2015, which was the process of cloud native from 0 to 1. At the beginning, Alibaba made middleware on Taobao, which was the earliest prototype of cloud. At that time, we were studying Oracle database and IBM minicomputer. But Alibaba found a problem. As Taobao traffic increased, Oracle's machines could no longer meet business needs. After three months, our data would not be able to be stored or counted. This was a very serious problem, so Alibaba launched a plan to de-IOE.

At this time, Alibaba found that our business was doing very well, but there were many technical challenges. Therefore, Alibaba established Alibaba Cloud in 2009, developed its own Fei Tian operating system, and ushered in the cloud era. Taobao and Tmall merged to build a business middle platform, and the three core middleware systems were launched.

FeiTian operating system is based on Apsara and is a distributed operating system. There are two core services on top of the basic public modules: Pangu and Fuxi. Pangu is a storage management service, and Fuxi is a resource scheduling service. The storage and resource allocation of applications on the FeiTian kernel are managed by Pangu and Fuxi. FeiTian core services are divided into: computing, storage, database, and network.

To help developers easily build cloud applications, FeiTian provides a wealth of connection and orchestration services to conveniently connect and organize these core services, including notifications, queues, resource orchestration, distributed transaction management, and more.

The top layer of FeiTian is the first software trading and delivery platform created by Alibaba Cloud - Cloud Market. It is like the "App Store" of cloud computing. Users can open "software + cloud computing resources" with one click on the Alibaba Cloud official website. The Cloud Market has thousands of products for sale and supports software and service access of types such as images, containers, orchestration, API, SaaS, services, and downloads.

This is the earliest basic framework of the cloud and also a cloud-native architecture.

We started to do container scheduling in 2011, and started to do online business in the group, and online business began to move towards containerization. By 2013, our self-developed Fei Tian operating system fully supported the group's business.

In 2015, Alibaba Cloud's cloud-native technology was not only used for Alibaba's internal business, but also began to be commercialized externally. The above is the first stage.

The second stage is the comprehensive cloud-native stage of the core system from 2016 to 2019.

Since 2017, we have not only been doing online business, but also adopted cloud-native technologies for offline business. The Double 11 Shopping Festival has a large amount of transaction data, and the background analysis and post-processing of this data are all done offline. Based on cloud-native technology, we have unified the underlying resource pools of online and offline business to support millions of e-commerce transactions.

By 2019, 100% of Alibaba's core systems were on the cloud. This was actually very difficult because Alibaba's business volume is so huge that no ordinary system could support it.

The third phase is from 2020 to now, which is the stage of comprehensively upgrading the next generation of cloud native technology . Alibaba established the Cloud Native Technology Committee, and cloud native was upgraded to Alibaba's new technology strategy. Alibaba's core systems fully use cloud native products to support the promotion. Alibaba Cloud's cloud native technology was fully upgraded, and the Serverless era began.

2. Alibaba Cloud’s Assertions on Cloud Computing

How does Alibaba view cloud computing? What is the difference between cloud computing and traditional technologies?

For example, in a village where every household needs to dig a well, each household decides how wide to dig a well based on the number of people in the family, the amount of water needed, whether there will be guests, etc. If there are many guests at home or there is a drought, the water may not be enough. In addition to the cost of digging the well, the daily maintenance of the well also requires a high cost.

The above scenario is mapped to the enterprise, that is, the enterprise, based on its own IT infrastructure, still needs to buy a computer room from the operator and buy a few servers to support its own services. If these machines are idle later, the enterprise still needs to pay a large amount of fees, which is very costly.

The problem that cloud solves is to realize resource pooling through virtualization technology. Using the above example of digging a well, it is like building a water plant. The difference between a water plant and a well is that, first, the water supply is large, even if 100 guests come, the water supply can meet the demand. Second, there is no need to invest a lot of money in digging wells in the early stage, but to charge according to the water demand. Even if the tap water pipeline is connected, if you don't use it, you will never need to pay for it.

This brings two major benefits to enterprises. The first is that when enterprises need to make quick decisions, they don’t have to spend a lot of time “digging wells”, but can use it out of the box. The second is that the initial investment cost is very low.

These are the benefits of the cloud, so what is cloud native?

Cloud native is a standard service, and we don't need to plan many things in advance. For example, if I want to do digital transformation, the requirements are very simple. I need someone to provide me with this service, and he will allocate me the amount I need. I don't need to make advance preparations. As my business grows, the underlying infrastructure can grow with it, and it has very good elasticity. This also greatly reduces the cost and energy of enterprises, allowing them to focus more on doing what they are best at, greatly improving efficiency.

Through the above examples, the following points are very easy to understand.

First of all, we believe that containers + K8s will become a new interface for cloud computing, which is a trend in the future.

Secondly, the entire software lifecycle will also change. The original software lifecycle was very long, but now cloud-native technologies can make iteration faster and faster, extending downward to software and hardware integration and extending upward to architecture modernization.

Finally, accelerate the digital upgrade of enterprises. In the past, digital transformation of enterprises was very complicated. It might take three to five years to complete the purchase of machines, databases, and applications. However, today, digital transformation of enterprises can be fully achieved in just a few months.

3. Industry Trends: Data Production/Processing is Undergoing Qualitative Change

Judging from industry trends, what changes will occur in data in the future and what changes will it bring to applications?

First, we believe that data will explode in size in the future. In 2020, the global data size will be about 40 ZB. What does 40 ZB mean? For example, if each movie is 1GB, and if everyone in the world goes to see a movie, then the total amount of data is about 40 ZB.

In addition, we estimate that the global data scale in 2025 will be 430% of that in 2020, and the global data scale is growing every year.

The second is real-time data production and processing. In the past, we might look at reports once a month. With big data, we can look at yesterday's data every day. Data is becoming more and more real-time, and can respond in seconds. Take the marketing scenario as an example. During the Double Eleven Shopping Festival, when merchants find that a certain activity in the store is not effective, they can adjust the advertising or delivery strategy within one minute or several minutes to achieve better marketing results. If the data is fed back on a daily basis, when the data is seen on November 12, the effect of the activity has been greatly reduced. Therefore, real-time data plays a very important role in such similar scenarios, and real-time data will also bring real-time applications.

The third is intelligent data production/processing. Currently, unstructured data accounts for 80% of all data, mainly including text, graphics, images, audio, video, etc. Especially in the current popular live broadcast field, intelligent processing of unstructured data can understand the audience's preferences and other information, which facilitates better business development. In addition, unstructured data continues to grow at a rate of 55% per year, and will become a very important source of data analysis in the future.

The fourth is the acceleration of data migration to the cloud. We believe that data migration to the cloud is unstoppable, just as gasoline cars will eventually be replaced by electric cars. It is estimated that by 2025, the scale of data storage in the cloud will be 49%, and by 2023, the scale of databases in the cloud will be 75%.

4. Industry Trends: Cloud Computing Accelerates the Evolution of Database Systems

Another industry trend cannot be ignored: cloud computing accelerates the evolution of database systems.

First, let's look at the history of database development. Databases were born as early as the 1980s and 1990s, when they were mainly commercial databases, such as Oracle and IBM DB2, some of which still occupy the market today.

In the 1990s, open source databases began to emerge, such as PostgreSQL and MySQL. MySQL is more commonly used in China, while PostgreSQL is more commonly used abroad. After the 1990s, the amount of data has increased. When the amount was small, PostgreSQL or MySQL could be used on a single machine to solve the problem. As the amount of data exploded, it became necessary to use distributed or minicomputer methods to solve large amounts of data and analysis problems.

Why is data analysis important?

For example, the data warehouse company Snowflake had a market value of $100 billion when it first went public, and now it has a market value of $70 billion. For a company that only makes one product, this is a very high market value. Why is its market value so high?

I talked with a teacher some time ago. He said that for current enterprises, especially Internet enterprises such as e-commerce or live broadcasting, the biggest cost of their enterprises was manpower in the past, and employee wages accounted for the main expenditure. But now the biggest expenditure is information and data. For the company's future development plan, it is necessary to have a large amount of data to analyze what the current customers want and need most, and what the industry is developing. Therefore, the company needs to purchase a large amount of data and do a lot of data analysis. The cost in this regard has exceeded the personnel cost. This is why a company that only does data warehouses can have a market value of 70 billion US dollars.

After 2000, people started using Hadoop and Spark, and in 2010, cloud-native, integrated distributed products began to appear, such as AWS and AnalyticDB.

(V) Industry Trend: Data warehouses accelerate their evolution from Big Data to Cloud-Native + Fast Data

The above is the evolution history of data warehouses. The computing method has evolved from offline to online, then to offline and online integration, and then to distributed. Functions have evolved from statistics to AI, data types have evolved from structured to multi-mode fusion of structured and unstructured, loads have evolved from OLAP to HTAP, hardware has also been upgraded to software and hardware integration, and delivery has evolved from On-Premise to Cloud - Native + Serverless.

In different stages of evolution, there are various products to support it.

6. Evolution of database system architecture

The above picture shows the evolution of the database system architecture. The simple logic can be understood as one person working in a factory, then ten people working in a factory, and then it developed into multiple factories with multiple people working. This is the development history of the entire data warehouse, from a single machine to a distributed one, and multiple people using the same data.

The development of databases is similar to human work. In the past, some stores could be maintained by a couple, with one person responsible for production and the other responsible for sales. As the store developed, more and more customers came to the store. The store was still the same, but the number of employees might be 10. Later, the business grew even bigger, and 100,000 employees were hired at once, and then they worked in 10 locations. This is a distributed cloud-native data warehouse.

7. Industry Trends: Key Technologies of Cloud-Native Databases

The above are the key technologies of cloud native databases.

Here we will briefly introduce two technologies. The first is cloud native. What does cloud native mean? If a user buys a database, when the business volume is low or it is not used during statutory holidays, the fee will be low, and when the business volume is high, the fee will be higher. Charging on demand and on volume is one of our requirements for data warehouses.

Another is security and reliability. For example, Alibaba has an investment department. If it invests 5 million in Company A and 1 million in Company B, this information is highly private and cannot be disclosed. If this information is managed by employees, there is a possibility that the employees will resign. Once a leak occurs after resignation, it will be difficult to hold them accountable at the legal level. How to completely encrypt this highly private information so that even the DBA with the highest authority cannot view this information, and make it safe and reliable? This will be elaborated in detail later.

2. Cloud Native and Big Data Applications

1. Challenges facing the business

The business faces many challenges, mainly in four areas.

First of all, the data is scattered and inconsistent, and there are many data sources, so collecting the data is a big challenge.

Secondly, the system is extremely complex, with more than 40 systems or components. Originally it might have been based on Hadoop, but now it requires a lot of systems or components, including HDFS at the bottom, YARN and HBase above, and Hive, Flink and many other things above, which is very complicated.

In addition, the analysis is not real-time, and its data can only be processed on T+1, which is a traditional big data architecture.

Finally, there is the high learning cost. The version iteration speed of different technologies is very fast, and the learning cost is very high.

2. Cloud-native data warehouse + cloud-native data lake build a new generation of data storage and processing solutions

At that time, Alibaba Cloud adopted the simplest architecture, which can solve the entire product architecture with one or two products, making it easier for users to use and solve various problems with SQL, such as the original OSS data and the centralized analysis of data processed by various production processes.

3. Cloud Native Data Warehouse: Cloud Native

The cloud-native feature of the cloud-native data warehouse is mainly reflected in that if there is only one piece of data, then only storage for one piece of data will be allocated. If the amount of data increases, it will automatically allocate more storage.

The same is true for computing. If there is no computing or analysis demand, it will not allocate resources. Only when there is a demand will resources be allocated for computing or analysis. The entire process is based on pay-as-you-go and resource elasticity.

4. Cloud Native Data Warehouse: Integration of Database and Big Data

The above are key technologies in cloud-native data warehouses, such as row-column mixed storage, which can support high-throughput writing and high-concurrency queries.

The second is mixed load, which means that it can run ETL and do queries.

In addition, there is also intelligent indexing. An important point in the database is to understand the business, understand the index, and know what affects the query and what affects the write, so we hope that this thing can be made smarter so that users don’t have to manage these things.

5. Next-generation data warehouse solutions

The above is the architecture diagram of the new generation data warehouse solution. The bottom layer is the data warehouse, and the top layer is the data warehouse model. Alibaba has made a lot of models in Taobao Index and data insights, including associating all information through an ID. This information is aggregated into a model. The model has a data construction management engine, which can do data warehouse planning, code development, data asset management, data services, etc.

At the top is business empowerment, which has many applications, including regulatory reporting, business decision-making, risk warning, and marketing and operations.

6. Cloud Data Security

Let's talk about the issue of data security on the cloud. Every company has top-secret data, which faces many security issues, such as administrators/users overstepping their authority, stealing data backups, maliciously modifying data, etc. In addition, data is fully encrypted during storage, query, and sharing, and no one (including administrators) can obtain plaintext data. Ensure the integrity of logs in untrusted environments, and no one (including administrators) can tamper with log files. Ensure the correctness of query results in untrusted environments, and no one (including administrators) can tamper with query results.

The previous solution was very simple, that is, the data was encrypted when it was written to the database. For example, if it was written as 123, it would become a random order after encryption, such as 213, 312, etc. This seems to be a good method, but what is the problem? It cannot be queried. For example, we want to check transactions over 50 yuan, but because 50 is not 50 after encryption, it may become 500, and the original 500 is 50 after encryption, so this query cannot be performed, which is equivalent to it becoming a storage and cannot be analyzed and queried.

7. Cloud-based data encryption will never be leaked

Is there a way for us to do data analysis while keeping it confidential and still using the original SQL?

The core issue here is the hardware we use. Through ApsaraDB RDS (PostgreSQL version) + Shenlong Bare Metal Server (security chip TEE technology), the key can be stored in advance, and then all calculations and logic are performed in the encryption hardware. Because the entire process is protected by encryption hardware, even if someone copies all the system memory, the copied data is all encrypted, which ensures that even if the operation and maintenance personnel get top-secret data, there is no risk of leakage.

III. Best Practices

Let’s look at a few best practices:

DMP: Full-link Marketing

DMP (Data Management Platform) stands for data management platform, also known as data marketing platform.

What is the core of marketing? The core of marketing is to find people, to find the group of people you care about the most, which is professionally called "circling people".

For example, in what scenarios do we need to circle people? For example, today we want to find people who are interested in cloud native to discuss cloud native together. The process of finding people who are interested in cloud native is called circle people.

Another type is similar to the Tmall and Taobao report. For example, some time before Double Eleven, the merchant believes that a certain customer may buy clothes or a bag this year and is a potential customer, so the merchant will promote some consumption coupons to the customer.

The key is to accurately locate the crowd and distinguish the crowd accurately. There are about 800 million e-commerce consumers in China. Pushing messages to people who are interested in a certain item is the core of the matter.

Alibaba uses data warehouses to collect people. First, we look for some seed groups, which number several million people. We consider them to be high-quality customers, such as those who spend more than RMB 5,000 or RMB 10,000 on Taobao every month. After all the people are found, the second step is to cluster the groups.

Clustering means dividing millions of people into several small categories. Each category may like a certain category, for example, one category likes to buy cosmetics, another likes digital products, and another likes to buy books. After the classification, for example, there may be 100,000 people who like to buy cosmetics, but most of these 100,000 people may have bought cosmetics before, and they are unlikely to buy them this time.

Therefore, we need to find those people who are really likely to buy cosmetics among the 800 million consumers. How should we do it?

We need to convert each customer's consumption behavior and historical purchase records into a vector for the AI ​​model. If two customers have similar purchase behaviors, then the distance between their vectors will be very small, so our approach is very simple. For example, we put people who are interested in digital products as seeds into 800 million people to find, and if there are 10 million people whose vectors are closest to these people, then we send digital product advertisements or coupons to these 10 million people, and use this method to do business marketing.

There are several core aspects of this process.

The first is to cluster and divide the population and know their historical transactions. The data must be able to support multi-dimensional analysis in any dimension.

The second is to be able to perform specific analysis on the data in the entire data warehouse.

The third is the vector approximation retrieval after clustering, which finds the people who are close to each class vector and pushes messages to them.

This is the capability we have, which is currently implemented based on AnalyticDB.

Another thing is to do Ad-hoc query. For example, we need to find people who are interested in digital products and did not buy iPhone 12 last year, so that they may buy iPhone 12 this year. Or for those who bought iPhone 12 and AirPods last year, we think there is a high probability that they will buy Apple keyboards or Apple computers, etc. We need to do various transaction queries on these people to accurately find our target population.

Refined advertising management

Business Challenges:

1) Keyword search events require high concurrency and real-time storage;

2) All users query conversion rates through the dashboard at the same time, and complex queries have high QPS;

3) High response time requirements to avoid missing the golden period for price adjustment.

Business Value:

1) Unified management of keywords for multiple sites and multiple stores;

2) Handle tens of thousands of TPS concurrent writes;

3) Real-time analysis of massive data and intelligent price adjustment by time period;

4) Rapidly identify and analyze keywords to maximize profits.

Online e-commerce

Business Challenges:

1) Traditional MySQL database analysis is full, and complex reports with tens of millions or hundreds of millions of data cannot be returned;

2) Complex reports are returned in seconds;

3) Compatible with MySQL ecosystem;

4) The business is developing rapidly, with different requirements for computing and storage.

Business Value:

1) RDS + AnalyticDB implements HTAP joint solution and isolates business and analysis;

2) 2-10 times analysis performance improvement;

3) Distributed architecture, horizontal expansion, flexible configuration, support for different data volume and access volume requirements

This is the stage of comprehensively upgrading the next generation of cloud native technology from 2020 to now - the Serverless era. Alibaba has established a cloud native technology committee, and cloud native has been upgraded to Alibaba's new technology strategy. In the future, the cloud native data warehouse will have more new features to solve more core pain points for the industry. Stay tuned.

<<:  Combining AI and big data, the intelligent operation and maintenance platform helps Liulishuo improve its core competitiveness

>>:  "Weibu Online" was successfully selected into the list of the third batch of specialized and innovative "little giant" enterprises announced by the Ministry of Industry and Information Technology

Recommend

How to break the 100-meter transmission distance limit?

Local Area Networks (LANs) have historically been...

Why do you need to consider whether IPv6 is supported when adopting SD-WAN?

The Internet of Things (IoT) has fundamentally ch...

5 Fast-Developing Technology Trends in the Network Industry in 2017

At the start of every new year, experts and forec...

10 things to know about MU-MIMO Wi-Fi

Multi-User MIMO allows multiple Wi-Fi devices to ...

3 Ways 5G is Driving Edge Intelligence

5G is closely tied to edge computing. With a whol...