A look back at five major outages in 2019

A look back at five major outages in 2019

Any time a network service outage occurs, it can have a huge impact and disruption on global businesses, and can also lead to significant losses in revenue and reputation. While application delivery relies on many network service providers (ISPs), it also increasingly relies on a large and complex ecosystem of network-facing services, such as CDN, DNS, DDoS mitigation, and public cloud. Together, these services provide users with an exceptional digital experience, and even a short outage can have a significant impact.

[[313495]]

At the same time, enterprises are increasingly relying on Internet transport to connect their sites and access business-critical applications and services. Gone are the days when applications were now fully hosted in private data centers and office locations, and those were primarily connected via MPLS. As enterprises gradually adopt SD-WAN technology, the Internet is replacing/supplementing services such as MPLS. As a result, the Internet is now the de facto backbone of enterprises, and as a "best effort" transport method, it can have significant and unforeseen consequences for enterprises.

Over the past year, several large-scale outages have had a ripple effect on the global Internet, affecting businesses and consumers to varying degrees. We have compiled some of the more serious outages. Here are the most destructive outages of 2019 in chronological order:

May 13, 2019: China Telecom outage reveals its global reach

While this was not the most disruptive outage of 2019, it was a reminder that China Telecom’s reach extends far beyond mainland China. On May 13, 2019, China Telecom experienced a major outage that lasted nearly five hours and then several more hours. China Telecom suffered severe packet loss on its backbone network, primarily affecting network infrastructure in mainland China, but also affecting multiple China Telecom nodes in Singapore and the United States, including Los Angeles, with more than a hundred services disrupted worldwide.

During the entire long-term outage, all traffic forwarded to the affected faulty nodes was discarded, which means that a large number of foreign websites accessed by some users in China and abroad using browsers or applications were interrupted. Chinese users trying to access websites hosted abroad were affected, and at the same time, foreign users trying to access websites in China were also affected.

At the same time, this outage also affected website services in the United States, such as Apple, Amazon, Microsoft, Slack, Workday, SAP, etc. The figure below shows some of the websites and services that were affected by the network failure wave.

This incident illustrates some of China's influence on the global Internet. At the same time, Chinese telecommunications network providers maintain the global Internet interconnection and maintain connections with network providers in many parts of the world.

June 2, 2019 - The "Summer of Downtime" Begins at Google Cloud

On June 2, 2019, Google Cloud Platform experienced a severe network outage that affected hosting services in the US West, US East, and US Central regions. The outage also affected Google's own applications, including GSuite and YouTube. The outage lasted more than four hours, and Google released an official report on the incident a few days later. ThousandEyes' advantage is the ability to view the outage in real time and effectively reveal the characteristics and scale of the outage before more detailed information is made public.

Beginning at approximately 9 a.m. ET, we observed 100 percent packet loss for global monitors attempting to connect to services hosted in GCP us-west2-a. Sites hosted in several GCP US East regions, including us-east4-c, saw similar losses.

It turns out that the complete unavailability of parts of the Google network was caused by Google's network control plane accidentally going offline. Google later revealed that during the outage, a set of automated policies determined which services could or could not be accessed on the unaffected network.

The most important lesson learned from the cloud outage is that it is critical to ensure that any cloud architecture has sufficient resilience measures (whether on a multi-region basis or a multi-cloud basis) to prevent future outages. It can be said that even in the cloud, IT infrastructure and services will sometimes experience outages.

June 24 - Cloudflare users fall victim to routing disaster

Just weeks after WhatsApp users suffered a massive routing leak, the internet has experienced another routing-related incident that was far more damaging.

Cloudflare is a CDN service provider. On June 24, 2019, for nearly two hours, a major BGP routing error had a serious impact on users trying to access Cloudflare services, including gaming platforms Discord and Nintendo Life. Analysis found that the BGP routing leak came from multiple factors. DQE, a transit provider, was the source of the leak, which was propagated through Allegheny Technologies, a customer of DQE and Verizon. Unfortunately, Verizon further propagated the routing leak, thereby expanding the impact.

The major outage affected about 15% of Cloudflare's global traffic and affected services such as Discord, Facebook, and Reddit for about two hours. The routing leak also affected access to some AWS services.

The root cause of the incident can be attributed to the BGP optimization software used by DQE, which created routes to Cloudflare services that were only available within DQE's internal network. When these routes were accidentally leaked to one of its customers, chaos ensued.

This incident is yet another reminder that in a cloud-centric world, enterprises must have visibility into their networks if they are to successfully deliver services to users.

July 4th - Apple services affected on July 4th

On July 4, 2019, users connecting to Apple's website and some of its services, such as Apple Pay, experienced severe packet loss for more than 90 minutes. This issue prevented many users from successfully connecting to Apple. The packet loss was caused by BGP route flapping. BGP routing issues occur when a routing announcement is sent out and withdrawn in rapid succession, often repeatedly.

The lesson from this incident, which Apple successfully prevented from escalating early on, is that outages don’t happen in a vacuum, and sometimes even serious outages can go unnoticed (or, conversely, cause a major commotion when they aren’t serious) simply based on their timing and circumstances.

September 6 - DDoS attackers target Wikipedia

On September 6, 2019, access to Wikipedia sites around the world was disrupted for nearly nine hours due to a large and sustained distributed denial of service (DDoS) attack. DDoS attacks can overwhelm the infrastructure of a target network and create congestion within service provider networks, leading to packet loss.

During the event, HTTP server availability around the world dropped significantly, and HTTP response times increased dramatically. Users in many regions were unable to establish Internet connections and communicate with Wikipedia servers continuously. The attack caused up to 60% packet loss, which further blocked access to Wikipedia sites.

While DDoS incidents occur on the Internet from time to time, organizations should proactively understand the scope and impact of these incidents and verify that DDoS mitigation measures are effective.

<<:  The Ministry of Industry and Information Technology has indicated 10 important directions for 5G development in 2020!

>>:  With the advent of 5G networks, will 4G phones become obsolete? Not necessarily

Recommend

Amid the epidemic crisis, many countries are planning to break through with 5G

For the global 5G industry, the first quarter of ...

5G will become a necessity in our future

Most of the 5G networks currently available are n...

From IP to IP, let's talk about the "useless" knowledge in computer networks

Web development is inseparable from computer netw...

How does Spanning Tree Protocol prevent network loops and ensure security?

Spanning Tree Protocol (STP) is one of the key me...

The significance of SDN deployment in developing countries

If you haven't been to Brazil, you should go ...

10gbiz: $2.75/month KVM-1GB/50GB/2M unlimited/Hong Kong (CN2 GIA) data center

10gbiz is a hosting company established in 2020. ...

How to enable owners and facility managers to realize smart buildings

Building owners and facility managers are turning...

Cisco HyperFlex, the world's most complete hyper-convergence architecture

[51CTO.com original article] Hyper-convergence is...

Yunnan Telecom will gradually shut down its 3G network starting June 1

Yunnan Telecom recently issued an announcement st...

Network Slicing "Hot Pot Theory": Same Pot, Different Dreams

In the dog days of summer, when people are "...

New Development Trends of Cultural Industry in the 5G Era

5G technology has the characteristics and advanta...