Analysis of Facebook data center practices and introduction of OCP's main work results

Analysis of Facebook data center practices and introduction of OCP's main work results

Confidentiality is a common practice in the data center industry. In November 2014, I went alone to visit the SuperNAP data center in the south of Las Vegas. After getting off the car, I tried to take pictures of the building exterior with my mobile phone several times, but was quickly stopped by the guards patrolling in a Hummer. I went in and waited for the appointment time. Although it is common for guards to carry guns in the United States, I was still impressed by the guards in the guard room who were ready to deal with robbers at any time. It is a rule that you are not allowed to take pictures inside the data center, but I have been accompanied by special people when visiting data centers before, and I have never enjoyed such strict security treatment.

[[129224]]

Caption: The reception room of SuperNAP 7 Data Center. I waited here for more than 20 minutes and was able to observe the guard room through the small window. Image from SuperNAP official website, same below

This is related to the nature of the hosting data center, which must be kept confidential for tenants. Google, which is itself a customer, regards infrastructure as one of its core competitive advantages, which can be seen from the company's consistent emphasis on infrastructure. Therefore, Google has long kept its data center and customized hardware designs secret, and employees have to sign a confidentiality agreement when they join the company, and they cannot disclose it within one or two years after leaving Google.


Figure: SuperNAP 7 data center at night, a typical American large flat structure

But what about those photos of the interior and exterior of Google's data centers that Google itself has released?

In March 2009, Facebook hired Amir Michael, a hardware engineer who had worked at Google for nearly six years (and had previously worked as an intern at Cisco), to lead hardware design. On April 1, 2010, Facebook announced the appointment of Ken Patchett to lead its first self-built data center in Prineville, Oregon. Ken Patchett started his career at Compaq and accumulated nearly six years of experience in data center and network operations at Microsoft. After joining Google, he directed the data center in The Dalles, Oregon, and before joining Facebook, he worked in Asia for more than a year, managing Google's own and hosted data centers. After a round trip, he returned to Oregon.

Figure: Guard room of SuperNAP data center

From server design to data center operations, Facebook insisted on poaching Google's talents, and the latter was reluctant to sue - which meant that more details had to be made public. Even more amazing was yet to come: in April 2011, taking advantage of the opening of the Prineville data center, Facebook announced the launch of the Open Compute Project (OCP), open-sourcing a series of hardware designs including data centers and custom servers.

In three years, Facebook has made two major moves: first poaching people and then opening up. Although Facebook's data center scale is an order of magnitude smaller than Google's, it is often compared with the top three (as well as Microsoft and Amazon). OCP has made great contributions to this, and even Microsoft, the "millionaire buzzer" (referring to the noise of server fans, not in a derogatory sense), has joined.

Facebook has taken the PR war of data center publicity to a new level. In October 2012, Google disclosed some of its data center technologies, including inviting journalists to visit and posting nearly 100 high-resolution photos on its website. However, Google is still very secretive about IT equipment (servers and networks) and related technologies, at most mentioning its abandoned servers. The two editions of the book that Urs participated in also focused on macro concepts and construction principles at the data center level.

Caption: Google's data center in The Dalles, Oregon, is surrounded by mountains and water (the Columbia River), where team members can enjoy rafting, wind surfing, fishing and hiking. Look at the mountainside in the upper left corner (Source: Google official website)

Interestingly, James Hamilton also made some comments on the information disclosed by Google. AWS, which was once considered to be a rival to Google in terms of technical center strength and confidentiality, now seems to be the most mysterious.

In general, Google reveals its long history and recent status, and not much has been passed down in the process of its growth. Facebook's development history may be a reference. #p#

From one server to multiple data centers

In February 2004, Mark Zuckerberg launched Facebook in his Harvard dorm room with only one server. Just five years later, the world's largest social networking site has more than 300 million active users, processes 3.9 trillion feeds, more than 1 billion chat messages, 100 million search requests, and more than 200 billion page views per month...

In the early days when only a small number of people used it, with a small number of photos and no videos, it was no problem for all services to run on one server. The Facebook website in 2009 was obviously different: a seemingly simple action like loading a user's homepage required accessing hundreds of servers in less than a second, processing tens of thousands of pieces of data scattered everywhere and submitting the required information.

It is not difficult to imagine the growth rate of servers. There are signs that Facebook’s server numbers are:

In April 2008, about 10,000 units;

In 2009, about 30,000 units;

At least 60,000 units in June 2010...

Even today, this number ranks among the top Tier 2 Internet customers (100,000 servers are considered Tier 1, and Facebook is one of the dozen or so). Energy efficiency is an issue that must be considered. Based on a conservative calculation of 200W per server, the annual power consumption exceeds 100 million kWh. If the data center PUE (Power Usage Effectiveness) can be reduced from 1.5 to 1.1, 42 million kWh of electricity can be saved each year.

Until 2009, Facebook still relied on rented data center space and did not build its own data center. The advantage of renting data center space (deploying servers, networks and other IT facilities by yourself) is that it is delivered quickly and can be completed within 5 months; building a data center takes about a year and more upfront investment, but it can be customized according to its own needs in terms of power supply and heat dissipation, which is more cost-effective for ultra-large-scale users. Google, Microsoft and Amazon have already built their own data centers.

Figure: Two data center buildings in Prineville (Source: Facebook official website, 2014)

In January 2010, Facebook announced the construction of its first data center in Prineville, Oregon, with a planned area of ​​about 14,000 square meters and a target PUE of 1.15. In July of the same year, the social giant decided to double the size of the Prineville data center to about 30,000 square meters. It was completed in December 2010. Thanks to a series of high-efficiency designs that use 100% external air cooling and do not require air conditioning, the PUE can be as low as 1.073. Compared with the "industry average" of 1.51, the energy saving is slightly better than our previous assumption.

Figure: At the end of August 2013, the Altoona data center construction site was under the setting sun, covering an area of ​​about 194 acres. By mid-November 2013, more than 200 people were working every day, with a total of nearly 100,000 hours of work (Source: Facebook official website)

Facebook, which has tasted the sweetness of self-built data centers, continued to build data centers in Forest City, North Carolina (announced in November 2010), Luleå, Sweden (announced in October 2011), and Altoona, Iowa (announced in April 2013). Each data center has been expanded after completion, such as Prineville and Forest City, each of which has added a data center (building) for cold storage. The second phase of Luleå and Altoona projects also started in 2014.

Origin of OCP: Is the disciple better than the master?

Without open source, there would be no Internet industry today, but this is mainly from the perspective of software. Google has done a lot of work in software open source, and the famous Hadoop can be seen as the result of Google's inadvertent "open source" thinking. In February 2015, Google announced that it would open source MapReduce for C (MR4C) acquired in June 2014. This is a MapReduce framework developed in C++. This move allows users to run native C and C++ code in their own Hadoop environment, which is good news for the Hadoop community.

The Internet infrastructure is supported by open hardware technology, which is different from open source. Intel defeated IBM and other RISC manufacturers (ARM is another matter) by creating an ecosystem with open hardware technology, but at least before the emergence of OCP, it was unimaginable that Dell and HP would disclose the detailed design materials of their servers. Moreover, "open source + open" does not mean that the result must be transparent. Google has built a proprietary data center based on open source software and open hardware technology.

It should be said that Zuckerberg realized very early that Facebook and Google would have a war, and that this day would come much sooner than a certain sentence that is familiar to Chinese people. Google conducts advertising business on the entire Web, and Facebook conducts advertising business on its own social network. Just like Tencent does not allow Baidu search to enter WeChat, Facebook also wants to develop its own search engine. In 2013, Facebook launched Graph Search, which was updated to Facebook Search in early December 2014, and then removed the Web search results from Microsoft Bing from Facebook's search.

A very important difference is that Tencent is not smaller than Baidu, while Facebook itself cannot compete with Google. From servers to data centers, Google started earlier, has a large scale, and has its own system. In order to quickly narrow the gap with Google in the infrastructure field, Facebook came up with a clever way to expand the ecosystem through open source, that is, to establish the Open Compute Project (OCP).

Figure: The Open Compute Project logo, with an “f” on the left made of server motherboards (Source: Zhang Guangbin, 2013)

As an open source hardware project, OCP not only publishes the details of Facebook's "start-from-scratch" custom data centers and servers, down to the CAD drawings of racks and motherboards, but also invites the open source community and other partners to use and improve them. That is, it is divided into two steps: first release the specifications and mechanical drawings, and then work with the community to improve them.

If we consider the hardware-like elements of Facebook and Google, we can see that even the core manufacturers of the ecosystem, such as Intel, find it difficult to have such a community-oriented mindset. Yes, the last one to do this was Google, which open-sourced Android to compete with Apple's iOS and successfully built a huge ecosystem, like a pack of wolves besieging a tiger.

In this capital- and talent-intensive industry, open source is a good way to compete for talent and has a significant advertising effect. More customers using hardware based on the OCP specification can also increase the purchase volume, helping Facebook reduce costs and play a similar role to group buying.

At that time, OpenStack had just emerged, and OCP also adopted some similar practices, such as holding a summit in the first and second half of the year, and announcing the establishment of the OCP Foundation (Open Compute Project Foundation) at the second OCP Summit held on October 27, 2011. However, the hardware design cycle is long, so starting in 2012, it was changed to once a year, and the sixth summit was held from March 9 to 11, 2015.

Figure: Facebook's infrastructure department (Source: Zhang Guangbin, 2013)

At the fifth OCP Summit held at the end of January 2014, Mark Zuckerberg and Facebook Vice President of Engineering Jay Parikh announced that in the three years since OCP was founded, open source hardware solutions have helped Facebook save $1.2 billion.

At this time, the total number of OCP members is close to 200 (including heavyweight traditional enterprise manufacturers such as Microsoft and VMware that joined in 2014), 7 solution providers represented by Quanta, a large number of verified designs, and adoption by Facebook and Rackspace... Next, I will briefly introduce the organizational structure and main achievements of OCP, an open source hardware organization, from the two aspects of the board of directors and typical projects. #p#

Board of Directors: A legacy of experience

The importance of establishing a foundation, rather than being controlled by Facebook alone, to the development of OCP is self-evident. The OCP Foundation operates under the management of the board of directors, initially with five directors from five companies.

Frank Frankovsky represents Facebook as the chairman and president of the OCP Foundation. He joined Facebook in October 2009 and served as director and vice president of hardware design and supply chain operations. Prior to that, he served as director of Dell's Data Center Solutions (DCS) department responsible for server customization business for nearly four years and was a product manager at Compaq Computer Corporation in the 1990s.

Caption: A corner of Facebook's hardware lab. In the hardware lab, this is considered quite neat (Source: Zhang Guangbin, 2013)

Mark Roenigk is the COO of Rackspace Hosting. He worked at Microsoft for 9 years, mostly in charge of OEM and supply chain operations, and was an engineer at Compaq for 7 years. Rackspace is a well-known server hosting provider with rich experience in data center construction, operation and hardware. It also co-founded OpenStack with NASA - it is the only company that has contributed to the creation of both software and hardware open source organizations.

Jason Waxman is currently the general manager of Intel's Data Center Division's High Density Computing Business, responsible for Internet data centers, blade servers, and technologies related to future dense data center architectures. He is also responsible for leading Intel's work in cloud computing and serves as a manager on the board of directors of Blade.org and the Server System Infrastructure Forum (SSI Forum). Previously, he served as director of Intel's Xeon processors, related chipsets and platform products, and their customer relationships.

Caption: Facebook's campus in Silicon Valley used to belong to Sun - a great company worth remembering, and the phone that took this photo (Source: Zhang Guangbin, 2013)

Andy Bechtolshiem comes from Arista Networks, and is better known as the "co-founder of Sun Microsystems." Andy Bechtolshiem served as Sun's chief system architect, the first investor in Google, and served as chairman of flash memory startup DSSD, which was acquired by EMC in May 2014.

Except for Don Duet of Goldman Sachs, whose main career experience is CIO, the above four people all have a deep background in the hardware industry. They have experience in everything from products to technology to supply chain. They are knowledgeable and experienced, which is crucial to controlling the development direction of open source hardware projects.

As mentioned earlier, OCP has many projects under its jurisdiction, ranging from servers to data centers, including racks, storage, networks, and hardware management. In 2014, it launched the HPC (High Performance Computing) project.

Servers: Started at Google, Became a Trend

Facebook started to customize hardware not too early, and its early servers also came from OEMs. Jay Parikh, head of Facebook's infrastructure engineering, said at the GigaOm Structure Europe conference in mid-October 2012 that the data center in Luleå, Sweden would be the first time that Facebook did not use OEM server hardware at all.

Figure: Facebook's data center cluster (public data in 2014), the front-end (FE) cluster includes a large number of web servers and some advertising servers, and a relatively small number of Multifeed servers; the service cluster (SVC) includes search, image, message and other servers, and the back-end (BE) cluster is mainly database servers. This configuration scale may change with the application of the "6-pack" core switch mentioned later.

This is obviously directly related to Amir Michael mentioned at the beginning of this chapter. He joined Facebook half a year earlier than Frank Frankovsky and is also one of the co-founders of OCP. He has served as vice chairman of the OCP Incubation Committee (IC) since January 2013 and became CEO of Coolan in April. The company has a close relationship with Facebook and OCP, and Amir Michael is also a co-founder.

Figure: Infrastructure redundancy between regional data centers. FE (front-end cluster), SVC (service cluster), and BE (back-end cluster) form a whole, which is redundant with the data center in another region (Source: Facebook)#p#

Surpassing often starts with learning and imitation, although this is not what Newton meant by "standing on the shoulders of giants". When OCP was founded, the first generation of OCP servers contributed by the Facebook data center team borrowed a lot from Google's design, and the most obvious sign was the 1.5U (66mm) server chassis. The advantage of this is that a larger diameter 60mm low-speed fan can be used, which has a significant energy-saving effect compared to the 40mm fan of a 1U server. The 450W power supply module (PSU) supports 277V AC and 48V DC input. The former reduces unnecessary voltage conversion compared to 208V, and the latter is provided by a backup battery for short-term power supply, all in order to avoid energy loss as much as possible. Heat dissipation and power supply work together to control electricity costs (save OPEX).

Figure: Comparison of power conversion and loss conditions at the Prinevill data center (Source: Facebook)

Another point is to remove the (front) panel and BMC, and there is no VGA interface to implement Facebook's "Vanity-free" (no waste) spirit. The goal is to reduce the acquisition cost as much as possible (save CAPEX), although the workmanship looks a bit rough. As Jay Parikh said, OCP servers have fewer functions than standard servers and require as few components as possible.

Figure 1: Power transmission path of a 48-volt battery cabinet (Source: Facebook)

OCP V1 servers have two dual-channel solutions: AMD (12-core Opteron 6100) and Intel (6-core Xeon 5600). The motherboard is 13×13 inches and is manufactured by Quanta. The chassis width (480mm, slightly less than 19 inches) and height unit (Rack U, or RU, 1RU is 1.75 inches, or 44.45mm) both comply with the "old rules" of industrial standards. There are three hard disk trays at the back, and both the motherboard and the motherboard can be disassembled and installed without tools.

Note: OCP server V1 (left) and V2 (right) use the same 1.5U chassis, with four 60mm fans behind the motherboard, and the hard disk tray on the right is cooled by the power supply module. V2's improvements include: front-mounted hard disks for easy maintenance; two motherboards to increase computing density, but at the expense of the number of possible hard disks; CPU performance improvement (Source: Facebook)

Before the third OCP Summit held in San Antonio in early May 2012, AMD and Intel contributed the design of the second-generation OCP motherboard. Thanks to the Xeon E5-2600, Intel began to occupy an overwhelming advantage. The Intel OCP v2.0 motherboard, code-named "Windmill", uses a dual-core Intel Xeon E5-2600 and has a long and narrow appearance (6.5×20 inches, about 165×508mm). The OCP V2 server is still a 1.5U specification, but the motherboard width is only half of the first generation, so it can accommodate two computing nodes, doubling the density in the same chassis.

In order to support two motherboards, the power supply module of the V2 server is upgraded to 700W and swapped with the hard disk so that the hard disk can be maintained directly from the front.

After two generations of servers, some problems have been exposed:

The redundancy of the power supply module is poor. Compared with the 1+1 redundant power supply of the industrial standard server, these two generations of servers only have one power supply module. The OCP V1 server can still be explained by the "livestock mode" (if there is a problem with the key component, the entire server is replaced). The failure of the power supply module of the OCP V2 server will cause two computing nodes to fail, which is a bit of an overreaction. For this reason, Facebook has also designed a high availability (HA) server solution, that is, adding a PSU and replacing a motherboard, which is equivalent to reducing the computing density.

You can use the solution described in the previous chapter to centralize the PSU at the rack level (China's Scorpio cabinets have already done this), but with the width of a 19-inch chassis, the space left after taking away the PSU is not enough to accommodate a third motherboard (6.5×3=19.5 inches).

Computing and storage are not decoupled. This is particularly evident in OCP V1 servers, where three drive bays can hold six hard disks, and if the compute node only uses one boot disk, a lot of space is wasted to retain the insufficient flexibility; OCP V2 is not so bad, because the added motherboard occupies the space of two drive bays.

A 60mm fan just isn't big enough.

The USB interface is retained to varying degrees, but there is no BMC (Baseboard Management Controller). It goes without saying which one is more valuable for management.

Except for the last point, the other points require changes in the chassis and even the rack design.

Open Rack: Redefining the data center rack

Facebook initially adopted a 19-inch triple cabinet design, called Freedom Triplet, with a width of 1713mm, slightly narrower than three EIA 310-D racks (600mm×3) side by side. There is a top of rack (ToR) switch on each of the two outer racks (racks), with 30 Open Compute servers in each row, for a total of 90. A set of triple cabinets filled with 90 servers weighs 2600 pounds (about 1179 kilograms), and two sets of triple cabinets share a backup battery cabinet.

Figure: The Freedom triple cabinet used with the first two generations of servers saves material and is more stable due to parallel connection. The height is also slightly higher than the common 19-inch rack and can accommodate 30 1.5U servers (45U) and switches (Source: OCP specification)

Facebook soon realized that the EIA 310-D standard, which was formed in the 1950s, did not meet their requirements. EIA 310-D standardized the width between the inner rails of the rack (19 inches), but left the specifications of height, depth, mounting and cabling schemes, and connectors to the manufacturers to define. Facebook believed that this led to unnecessary differentiation in server and rack design, locking customers into specific vendors and their implementations.

Figure: One DC UPS battery cabinet supports a full system of 180 servers in two sets of triple cabinets (Source: Facebook, 2010)#p#

The more critical problem is that the traditional 19-inch rack, taking into account the sides and rails, has only 17.5 inches of available width for IT equipment (servers, storage), which cannot accommodate three (6.5-inch wide) motherboards or five 3.5-inch hard drives side by side. Some have long complained about the narrowness of the rack. For example, IBM mainframes and EMC's high-end storage have racks with a width of more than 60 cm. For example, the EMC Symmetrix VMAX has a system and storage rack width of more than 75 cm (30.2 inches, or 76.7 cm), which is also to accommodate larger servers (storage controllers) or more hard drives.

However, expanding the external width may not necessarily improve efficiency. In terms of quantity, large machines and high-end storage are still niche products, and few people buy thousands of racks. Facebook's solution is to keep the external width of 600mm (nearly 24 inches) unchanged, expand the internal horizontal spacing from 483mm to 538mm (21 inches), increase it by 55mm (about 2.2 inches), and eliminate the expensive sliding rails. The space utilization rate jumped from 73% (at 17.5 inches) to 87.5%, which is a pioneering effort.

Figure: Open Rack top view (bottom front, top back), showing the expansion of the inner width, front-end maintenance & back-end power supply (Source: OCP specification)

Since the important inner width has changed, each unit is redefined, and the height is slightly enlarged from 44.5mm of the traditional Rack U (RU) to 48mm, named OpenU, or OU for short, and the rack is also named Open Rack. In order to be compatible with previous equipment, 0.5 OU is retained as the minimum unit, but it seems that no non-integer OU products have been launched since then.

Then there is the integrated power supply module, which is divided into three power supply zones. Each power supply zone has three OU power supply frames to house seven 700W PSUs (from OCP V2 servers), N+1 configuration, a total of 4.2kW, and the power supply capacity of the entire rack is 12.6kW. There are two PDUs per rack, 200-277V AC at the left rear and 48V DC at the right rear. The server draws power from three copper bars (Bus bars) evenly spaced behind the rack. The PSU output voltage is 12.5V, which just meets the server's 12V input requirement.

The Open Rack v0.5 specification was released on December 15, 2011, and was introduced at the third OCP Summit. This version recommends that each power zone be 15 OUs, 12 OUs for IT equipment, and then 2 OUs for ToR switches, with a total height of at least 47 OUs (no less than 2300mm, which seems to be a remnant of the previous Triplet vertical space allocation idea). On September 18, 2012, the Open Rack 1.0 specification was released, which mainly clarified the following points:

Focus on single-row rack design (non-triple cabinet);

The inlet temperature was increased to 35 degrees Celsius, reflecting other Open Compute designs and real data center temperatures;

The switch layout is more flexible and is not limited to the top of the power supply area;

The computing equipment (server/storage) chassis is 1-10 OpenU high and supports direct loading with L-shaped brackets. The L-shaped bracket saves space and cost significantly compared to the traditional server rails, and can be installed without tools and can be fixed in increments of 0.5 OpenU (24mm);

The maximum height depends on the power supply area, but it is recommended not to exceed 2100mm to maintain stability. A common practice is 13 OUs per power supply area, 10 OUs for IT equipment, plus 2 OUs for switches, a total of 41 OUs;

The newly designed clip makes it easier for the chassis power connector to mate with the copper busbar.

Figure: Open Rack V1 front view and side view (left front and right back), showing the vertical space allocation (Source: OCP specification)

In summary, the main features of Open Rack are:

Expanded space. The internal utilization rate has been improved innovatively, especially the width left for IT equipment has been greatly increased, the unit height has also been slightly increased, while maintaining compatibility with the original rack standards as much as possible (same external width and similar height);

Centralized power supply. Provides sharing and redundancy within the rack, and servers and other IT equipment can be directly plugged in and out to get power, eliminating the need for manual wiring when racking;

Front-end maintenance. The back-end is used for power supply and heat dissipation. Maintenance personnel can complete daily work on the cold channel side without entering the hot channel. Running between the two sides not only increases the workload, but also makes it difficult to identify equipment at the back end, which can easily lead to misoperation.

Of course, there are side effects, namely, the supporting parts on both sides become thinner, and the weight of the internal IT equipment may increase (the Open Rack V1.1 specification has reached 950 kg, close to the triple cabinet mentioned at the beginning of this section), which poses a challenge to the strength of the rack. This is especially true during the delivery of the entire cabinet, and early Open Racks had to add diagonal beams at the rear end to prevent deformation.

However, in the current Open Rack V2 specification, the basic rack configuration supports 500 kg of IT equipment in a dynamic environment. By adding fastening bolts and other means, the heavy rack configuration (Heavy Rack Config) can support 1,400 kg of IT equipment. In comparison, James Hamilton revealed at the re:Invent 2014 conference that AWS's storage-optimized rack can accommodate 864 (3.5-inch) hard drives and weighs 2,350 pounds (about 1,066 kg). How to install this density is also a science.

It is still more stable with a triple cabinet approach (Source: OCP Engineering Workshop)

Note: Open Rack V2 also has important improvements such as reorganizing the power supply layout and removing the separate battery cabinet, which will be introduced in the following chapters. #p#

Open Vault: Separating storage from servers

Thanks to Open Rack, the third-generation OCP server (codenamed Winterfell) unveiled at the fourth OCP Summit has a qualitative leap in design:

The motherboard is still v2.0, but the server height is increased to 2 OU, and it is specifically emphasized that it is not 1.5 OU. The efficiency of the 80mm fan is further improved;

Larger vertical space is conducive to accommodating full-size GPGPU, supporting two full-height PCIe cards and a 3.5-inch drive slot, all serviceable from the front;

There is no PSU in the server chassis, just three PSUs (2 80mm fans) can be placed side by side, and they are powered by the copper busbars at the rear, which further increases the density (2 OU3) and are independent of each other;

In terms of appearance, the workmanship is much more refined, and the exposed parts are also better treated. Overall, it is not inferior to the level of general commercial servers.

Figure: Top view and triple-mounted OCP server (Winterfell) for Open Rack V1 (occupying 2 OU rack space in total) (Source: Internet picture combination)

The current OCP server motherboard has developed to V3.1. The size remains unchanged, supports Intel Xeon E5-2600 V3, 16 DIMM/NVDIMM, plus BMC, and supports Open Rack V1 and V2. The three 75W PCIe x8 slots have squeezed the space for the hard drive and replaced it with onboard mSATA/M.2 (2260, 60mm long). Previously, only mSATA was supported and an adapter was required.

Hard disks were marginalized first, and then even the job of installing operating systems was taken away by SSDs. So, what about large-capacity storage?

Figure: Facebook's 6 server types when there was no storage project. Type II was merged with Type VI (weak AMD), so it is not included in most public information. The storage configuration of Type IV and V looks very similar to the so-called 2U "storage server" (Source: Facebook)

We often say that Internet companies do not buy storage (equipment), which refers to traditional enterprise-level arrays such as SAN and NAS, not that there is no demand for large-capacity storage. The AWS storage-optimized rack mentioned above is an example.

The OCP V1 server supports up to 6 3.5-inch hard disks. If they are all full, it is not too much; if only one or two are used, the remaining space cannot be used for other purposes. To maintain flexibility, you have to pay the price of wasting space, but the problem is that there is not much flexibility.

At that time, Amir announced a project design for storage-intensive applications. It looked like a 4U device that supported 50 hard drives, distributed to two controllers, and could be connected to multiple servers to provide a variable computing and storage ratio.

At the third OCP Summit, the defeated AMD established a project codenamed Roadrunner based on its dual-slot Opteron 6200 motherboard, including four specifications: 1U (HPC option), 1.5U (general purpose), 2U (cloud option), and 3U (storage computing option). 2U supports 8 3.5-inch or 25 2.5-inch drives, and 3U supports 12 3.5-inch or 35 2.5-inch drives. In terms of the density of 3.5-inch hard drives alone, it is not as good as the servers launched by OEM manufacturers. After Open Rack was put into practical use, this project became less and less popular, and AMD also joined the ARM camp, mainly using the Micro-Server Card to make its presence felt in the OCP project.

In general, Amir's idea of ​​separating computing and storage (decoupling, disaggregation) is more reliable. With the efforts of Per Brashers, the then hardware engineering manager, and Yan Yong, a Chinese engineer, Facebook's Open Vault (codenamed Knox) ​​released at the same summit was successful. This is a JBOD (Just a Bunch of Disks, a simple collection of hard disks, without processing power, and must be used with computing nodes) that is compatible with Open Rack in width and height (2 OU). There are 30 3.5-inch hard disks in total, divided into two layers, each with 15 hard disks and a pair of redundant "controllers". The circuit logic is much simpler than the server motherboard. It was basically designed by Facebook alone, first produced by Quanta, and then contributed to OCP. Like OCP servers, there are versions produced by other providers (such as Hyve Solutions and Wiwynn).

Figure caption: An Open Vault with 15 hard disks in one tray. The 2 OU devices above the power supply area of ​​the rack in the background are Quanta's JBR, also JBOD (Source: Zhang Guangbin, 2013)

Open Vault is a very classic design, and there will be a special chapter for analysis later.

Figure caption: In addition to the natural updates of CPU, memory and hard disk configurations, in 2013 Facebook's Hadoop (type 4) and Haystack (type 5) servers all used Open Vault, and cold storage racks became a new server type (7). From a hardware architecture perspective, this can also be understood as a low-performance storage system consisting of a single controller and eight JBODs (Source: Table based on Facebook data)

Now, Facebook servers that need large-capacity storage, such as Type IV (for Hadoop) and Type V (for Haystack, Facebook's photo application), are all provided with storage by Open Vault, and an OCP server with 8 Open Vaults (240 hard drives) of cold storage type has been added - a total of 18U, occupying half a rack. #p#

Data Center: RDDC and Water...

As mentioned earlier, the birth of OCP is closely related to data center construction. The data center electrical and mechanical design specifications contributed by Facebook based on the Prineville data center practice are one of the earliest documents of OCP; the cold storage hardware design specifications contributed by Facebook to OCP include recommendations for the ground layout of cold storage data centers. The cold storage server is the aforementioned configuration.

[[129240]]

Caption: Facebook's Luleå data center, located on the edge of the Arctic Circle, looks a bit like Google's Hamina data center in Finland, which was introduced in the previous chapter? The Maevaara wind farm that provides electricity for the Hamina data center is not far north of Luleå... (Image source: Facebook)

In early March 2014, Marco Magarelli, a design engineer from Facebook's data center design team, wrote on the OCP official website that the second data center building (Luleå 2) in Sweden's Luleå campus will be built modularly using the concept of "Rapid Deployment Data Center" (RDDC). RDDC includes two methods. The second "flat pack" method claims to be an IKEA imitation. However, the real "adapting to local conditions" is to adapt to the cold climate of Sweden (Luleå is less than 100 kilometers from the Arctic Circle) - Facebook mechanical and cooling engineer Veerendra Mulay said in an exchange with me that it takes 11 to 12 months to build a data center using traditional methods (see Prineville), and RDDC can be shortened to 3 to 8 months, thus avoiding the snowy season in Luleå as much as possible (Tencent's Tianjin data center was also blocked by blizzards during the construction process).

Figure 1: Different types of modules in chassis mode (Source: Facebook)

The first "chassis" approach is derived from a pre-assembled steel frame that is 12 feet wide and 40 feet long. It is similar to the concept of assembling a car chassis: build the frame and then attach the parts on the assembly line. Cable troughs, power bars, control panels and even lighting are pre-installed in the factory. In contrast, this modular approach is like building Lego blocks.

Figure: Flat pack assembly (Source: Facebook)

As the names suggest, the essence of both methods reflects the transformation from traditional engineering projects to factory prefabricated products and on-site modular assembly. By deploying pre-installed assemblies and prefabricated unit modules and delivering predictable and reusable products, RDDC can achieve the goals of site-independent design, reduce on-site impact, improve execution and process, speed up data center construction, increase utilization and make it easy to replicate to other regions. Improving efficiency is ultimately about serving business needs.

Figure: The cooling design of the first Prineville data center. The upper ceiling (compare with the frame structure photo of the Altoona data center above) processes the external cold air and the return hot air, mixing them in a certain proportion.

RDDC benefits greatly from Facebook's efforts to promote fresh air cooling. It has no air conditioners (Chiller-less) and cooling water pipes, which facilitates the modularization of the data center. Another benefit is the very low PUE (about 1.07). In comparison, although Google's data center is highly modular, the cooling water pipes are somewhat of an obstacle, and the PUE is also slightly disadvantaged (about 1.12). However, because it relies on spraying water mist to adjust the temperature and humidity, Facebook's data center is slightly less secure.

[[129244]]

Caption: Google's Dalles data center in Oregon. The blue one is the cold water supply pipe, and the red one is the warm water return pipe. Laying water pipes is a typical engineering project, which is time-consuming and labor-intensive and difficult to modularize. (Source: Google official website)

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Figure Note: Comparison of basic design indicators of Facebook's three major data centers (Prineville, Forest City, Luleå) (Source: Facebook)#p#

Network: From the edge to the core

Intel pushes the Mezzanine Card design in the Xeon E5-2600 reference platform, especially the network card, so that high-density machines can obtain the flexibility that is close to standard (PCIe) cards. This idea is well reflected on the OCP Intel V2.0 motherboard, which is also based on the Xeon E5-2600. The mezzanine card designed in accordance with the OCP Mezzanine Card 1.0 specification is installed at the front end of the motherboard (cold channel side) for easy maintenance.

As for standard rack servers, the urgency of using mezzanine card design for network cards is not high and will increase costs, so the response of OEM manufacturers is not very enthusiastic. Supporters such as Dell regard flexibility as the main selling point, mainly Broadcom or Intel's network card modules, hoping to promote traditional enterprise users to accelerate the upgrade to 10G network card. OCP servers use Mellanox's 10G mezzanine card in large quantities. Rich features such as RoCE (RDMA over Ethernet, Ethernet remote memory direct access) that can reduce transmission delay and SR-IOV (Single Root I/O Virtualization) are also its selling point. Even domestic OEM server manufacturers such as Lenovo have adopted this mezzanine card in their Scorpio 2.0 server nodes. This "take-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-it-

Figure Note: The server node of Lenovo Scorpio 2.0 uses the 10 Gigabit OCP mezzanine card CX341A, the single-port 10GbE network card of the Mellanox ConnectX-3 EN family, and is originally produced in Israel (Source: Zhang Guangbin)

The OCP Intel V3.0 motherboard has added support for OCP Mezzanine Card 2.0. The 2.0 version of the mezzanine card has added an optional second connector to meet the needs of future high-speed networks (such as 100GbE). At present, the more important change is to expand the on-board space, and the supported interface modules have also been upgraded from 2 SFP+ of 1.0 to 2 QSFPs, 4 SFP+ or 4 RJ45/10GBASE-T options.

Figure Note: The OCP mezzanine card V2 has three main improvements: adding connector B, expanding the space on the board, and optional I/O area (Source: OCP Engineering Workshop)

Speaking of this, it is necessary to point out that the mezzanine card belongs to a server project. OCP started relatively late in network projects, and was only standardized in 2013 and gradually grew stronger in 2014.

According to the official OCP website, the initial goal of a network project is to develop edge (leaf (toR) switches, followed by backbone (spine, equivalent to Aggregation) switches and other hardware and software solutions.

Figure Note: There is a certain correspondence between Aggregation/Access (access, such as ToR) of Layer 3 network and Spine/leaf (leaf) of Layer 2 network (Source: Cumulus Networks)

The homogeneity between network devices and servers is not as high as storage devices. With the ratio of switches and servers, the density is no longer the same level, and expanding space is not a priority. Several existing OCP custom switches are very conventional in appearance size, standard RUs can be installed in a 19-inch rack, and the layout of power supplies and fans is also very traditional, which helps to be accepted by the enterprise market. At present, OCP network hardware pursues a server-like experience and even life cycle, including the highly modularization of the control plane and the data plane, and the decoupling of software and hardware to achieve customization flexibility (DIY) and avoid being locked by suppliers.

Figure Note: The phased goals of the OCP network project are to decouple from traditional monolithic switches to software and hardware, and then further modularize (Source: Facebook)

The core of the data plane is ASIC (such as Broadcom) or FPGA, and there are many solutions that support 40GbE; the CPU of the control plane can be x86 (such as AMD's embedded SoC, or Intel Atom), PowerPC (such as Freescale multi-core PPC), MIPS (such as Broadcom multi-core MIPS) or ARM. As of the end of February 2015, OCP has disclosed designs of six switches (1 each of Accton, Broadcom/Interface Masters, Mellanox and Intel, and 2 of Alpha Networks), and half of them can be configured as ToR or aggregation switches as needed.

Software and hardware decoupling, ONIE is the key and is also the key task in the early stages of the OCP network project. ONIE, namely Open Network Install Environment, is an open source project that defines an open "installation environment" for bare metal network switches. Traditional Ethernet switches have pre-installed operating systems, which can be used when used and directly managed, but will lock users; the so-called white-box network switch provides the freedom to choose hardware, but different CPU architectures lead to heterogeneous management subsystems, which creates difficulties for the above network operating systems. #p#

ONIE defines an open source "installation environment", combining boot loader (boot loader) with modern Linux kernel and BusyBox, providing an environment where any network operating system can be installed, helping automate the allocation of switches (thousands of units) in large data centers, allowing users to manage switches like managing Linux servers.

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Figure Note: The hardware design of Wedge switches released in June 2014, dual redundant power supply units, 4 fans (Source: Facebook)

Yes, Facebook is advancing towards core switches. In June 2014, Facebook showed off its new ToR switch (codenamed Wedge) designed with up to 16 40GbE ports, Intel, AMD and ARM CPUs, and Linux-based operating systems (codenamed FBOSS).

Figure Note: The appearance of the 6-pack hardware platform. Due to the concentrated PSU, the Wedge switch width is more compact and placed in parallel (Source: Facebook)

On February 11, 2015, Facebook announced the launch of its first open hardware modular switch "6-pack", a 7RU chassis, which is equipped with 8 Wedge-based switches and 2 fabric cards, a total of 6 layers, and a layer of power and fan underneath. As the core of Facebook data center fabric, 6-pack will enable Facebook to form larger clusters instead of dividing the cluster into multiple and limiting the size of the cluster due to network links between clusters.

Figure Note: 6-pack internal network data path topology (Source: Facebook)

Both Wedge and 6-pack will be open for design specifications through OCP.

Feedback and change: support from traditional manufacturers

2014 was a year of great changes in OCP. Although I encountered some confusion, the ecosystem has grown significantly, especially reflecting its attractiveness to traditional software and hardware manufacturers.

At the fifth OCP Summit held at the end of January, Microsoft announced its high-profile participation in OCP, which clearly overshadowed IBM, Yandex, Cumulus Networks, Box, Panasonic, Bloomberg, IO, and LSI (which have been acquired by Avago). Compared with IBM, which seems to be entering the internal inquiry, Microsoft is full of sincerity - contributing the design of Open CloudServer (OCS) for global cloud services (such as Windows Azure, Office 365 and Bing) as a "vote".

In terms of the size of the data center, Microsoft should be bigger than Facebook and IBM/SoftLayer (also a Tier 2 Internet customer with 100,000+ servers) that are still rushing to make progress together. It is already a great news to replace the procurement of new hardware with OCP. It will contribute a set of hardware design specifications and management software source codes. Staya Nedella has been amnesty before she took office?

Obviously not that simple, Microsoft also has similar ideas to Facebook.

Now in the OCP server specification and design page, the information of the open cloud server is listed at the top, and it is also the highlight of the server part in Engineering Workshop in 2014. OCS's 12U chassis is designed for EIA 310-D 19-inch rack, with half-width computing and storage blades, two nodes per U (1U2), centralized fan, PSU and management unit (Chassis Manager), which is not an Open Rack, and is more like the 12U Scorpio 1.0 cabinet (introduced in the next chapter). In this way, it is indeed not a technical issue to incorporate the Scorpio project into OCP - as long as BAT is willing...of course, before the establishment of the Open Data Center Committee.

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

Figure Note: The chassis component of the open cloud server. The chassis management card is similar to the RMC of the Scorpio cabinet. It is characterized by running Windows Server 2012 R2 and Microsoft's open source chassis management software code (Source: OCP Engineering Workshop)

Therefore, the OCS V2 chassis has also been upgraded. First, the 6 PSUs have been replaced from 1400W to 1600W. The total capacity is 8kW in N+1 configuration, and 24 computing blades are supported, and 4.8kW in N+N configuration. The price is that the power supply pause time is doubled from 10 milliseconds to 20 milliseconds, and the blade energy consumption is matched with a new fan.

The improvement of blade performance also puts higher requirements on I/O bandwidth. The I/O of OCS V2 per-layer tray (Tray) has been upgraded from V1's dual 10GbE and dual 6Gb SAS (x4) to 10/40GbE and dual 12Gb SAS (x4), and a PCI Express 3.0 x16 mezzanine card has been added.

Figure Note: The server rack in Microsoft IT-PAC in 2011 seems to be the predecessor of open cloud servers. It is estimated that the rack height should be above 50U

The storage blade is a JBOD that can accommodate 10 3.5-inch hard drives, and V2 has also been upgraded from V1's 6Gb SAS to 12Gb SAS. In terms of hard disk density alone, it can reach 800 hard drives per rack. V1 JBOD can still be used in V2 chassis, and each compute blade comes with 4 3.5-inch hard drives (V1 also supports 2 2.5-inch SSDs, V2 increases to 4, and 8 110mm M.2 PCIe NVMe modules). Each compute blade can be connected to 1-8 JBODs, that is, 14-84 hard drives are supported.

[[129249]]

Picture Note: Facebook's PB-level Blu-ray Archive Storage System (Source: The Register, 2014)

Facebook's Blu-ray disc archive storage system was also displayed at the 5th OCP Summit. The 42U space can accommodate 10,000 three-layer 100GB discs with a capacity of 1PB, and it is said that it can save information for 50 years. Facebook's predecessor Google uses tapes with larger capacity in a single disk, which also has historical factors, and Facebook believes that discs represent the future.

[[129250]]

Photo Note: The tape backup system of Google's South Carolina Berkeley County data center. This photo was previously misrepresented as Google's server (Source: Google official website)

From the perspective of offline storage, tape and CD have their own advantages, and it is difficult to decide the winner in the short term. Soon, in late March 2014, Frank Frankovsky announced that he had left Facebook to be a cold storage startup based on CDs, but he remained on the board of directors of the OCP Foundation as an independent person and continued to serve as the foundation chairman and president. There must be Facebook's spokesperson on the board, so he added Facebook's infrastructure director Jason Taylor, and Microsoft's vice president of cloud and enterprise business, Bill Laing, to expand to seven people.

Figure Note: Adjusted OCP organizational structure (Source: OCP official website)

The veteran storage manufacturer EMC announced its joining at the 4th OCP Summit held in January 2013, but the limelight was overwhelmed by the ARM that joined OCP. Therefore, when EMC World released an ECS (Elastic Cloud Storage) device based on x86 commercial server hardware at 2014, it was asked whether it was related to OCP. In contrast, EMC's subsidiary VMware was much more refreshing. It announced its joining OCP at VMworld 2015 held at the end of August 2014. EVO: RACK, which is still in the technical preview stage, clearly stated that it is based on OCP hardware - after all, VMware itself has no hardware burden.

<<:  Eleven things to note when using natural cooling technology in data centers

>>:  Breakthrough in electricity usage could lead to big savings for data centers

Recommend

Migrate to the cloud safely? See how Neusoft Cloud Start (NCSS) does it

Today, the development of cloud computing has rea...

RedCap chip debuts as scheduled, accelerating the 5G IoT industry

Recently, Qualcomm announced the launch of the wo...

H3C Launches Telecom-Grade Cloud Platform at MWC Shanghai

On June 28, H3C Group made its debut at the Asian...

In the 5G era, what is the United States worried about?

Today I want to talk to you about a technical top...

5G is coming, will you still port your number to another network?

According to the unified deployment of the Minist...

The Best Open Source Network Monitoring Tools of 2017

The demand for open source software continues to ...

Do you know which city has the fastest Wi-Fi speed in the world?

Since the coronavirus crisis, fast internet has b...

5G, edge computing and IoT are expected to reshape networks

5G provides wireless cellular connectivity with h...

5G+Industrial Internet, making manufacturing "smart" is no longer a dream

Exploring new paths for industrial development [[...

LOCVPS 20% off: 29.6 yuan/month - 1GB/30GB/400GB@100Mbps/Osaka, Japan

LOCVPS is a domestic hosting company founded in 2...