【51CTO.com Quick Translation】In 2014, two years after Instagram joined Facebook, Instagram’s engineering team migrated the company’s infrastructure from Amazon Web Services (AWS) servers to Facebook’s data centers. Facebook has multiple data centers in Europe and the United States, but until recently Instagram only used data centers in the United States.
The main reason Instagram wants to expand its infrastructure across the ocean is that we no longer have space in the United States. As the service continues to grow, Instagram has reached a point where we need to consider leveraging Facebook's data centers in Europe. Another benefit: local data centers mean lower latency for European users, which will hopefully create a better user experience on Instagram. In 2015, Instagram expanded its infrastructure from one data center to three to provide much-needed resiliency: Our engineering team didn’t want to repeat the AWS disaster of 2012, when a major storm in Virginia brought down nearly half of its instances. Scaling from three data centers to five was easy; we simply increased the replication factor and copied the data to the new regions; however, it was harder to scale when the next data center was far away on another continent. Understanding infrastructure Infrastructure can generally be divided into two types:
Everyone loves stateless services, they are easy to deploy and scale, and can be started whenever and wherever needed. In fact, we also need stateful services like Cassandra to store user data. Running Cassandra with too many replicas not only increases the complexity of maintaining the database, but also wastes capacity, not to mention how slow it is to transmit quorum requests across the ocean. Instagram also uses TAO (Distributed Data Store for Social Graphs) as a data storage system. We run TAO as a single master for each shard, without any slaves updating the shards for any write request. It forwards all writes to the master region of the shard. Since all writes are done in the master region located in the United States, the write latency in Europe is unbearable. You may have noticed that our problem feedback is basically at the speed of light. Potential Solutions Can we reduce the time it takes for a request to travel across the ocean (or even make the round trip disappear)? There are two ways to go about this. 1. Partitioning Cassandra To prevent arbitration requests from traveling across the ocean, we are considering splitting the dataset into two parts: Cassandra_EU and Cassandra_US. If European users' data is stored in the Cassandra_EU partition and US users' data is stored in the Cassandra_US partition, users' requests will not have to travel long distances to get data. For example, let's say there are five data centers in the US and three in the EU. If we deploy Cassandra in Europe by replicating the current cluster, the replication factor will be 8, and quorum requests must contact 5 of the 8 replicas. But if we can find a way to split the data into two groups, we can have a Cassandra_US partition with a replication factor of 5 and a Cassandra_EU partition with a replication factor of 3, each partition can operate independently without affecting the other partition. At the same time, the quorum requests for each partition can remain on the same continent, thus solving the round-trip transmission latency problem. 2. TAO is limited to writing to local To reduce latency for TAO writes, we can restrict all EU writes to the local region. This will look almost the same to the end user. When we send a write to TAO, TAO will update locally and will not block the write from being sent synchronously to the primary database; instead, it will queue the write in the local region. In the local region of the write, the data will be available from TAO immediately, while in other regions, the data will be available after it propagates from the local region. This is similar to regular writes today, where data propagates from the primary region. While different services may have different bottlenecks, if we focus on reducing or eliminating cross-ocean traffic, we can address them one by one. Lessons Learned As with every infrastructure project, we learned some important lessons along the way. Here are a few of the main ones.
Original title: How Instagram is scaling its infrastructure across the ocean, author: Sherry Xiao [Translated by 51CTO. Please indicate the original translator and source as 51CTO.com when reprinting on partner sites] |
<<: Detailed explanation of several wireless transmission modes!
CDN is usually a large number of distributed syst...
In actual project development, the most commonly ...
MPLS VS IP (1) IP forwarding principle: The route...
The traditional Lunar New Year is approaching, an...
POE power supply technology has become the darlin...
Vinton Cerf's encounter with TCP/IP began in ...
For more than 30 years, Cisco has been driving th...
Network edge is an inevitable trend, and user nee...
DMIT.io has newly launched the LAX EB series of C...
Recently, the three major domestic operators, Chi...
IMIDC (Rainbow Network) Hong Kong, Taiwan and Jap...
Smart buildings, whether residential, commercial ...
RAKsmart is carrying out the "New Year's...
The report on the economic operation of the commu...
The expansion of network infrastructure to multip...