From "cloud" to "fog": cloud computing will die, replaced by distributed peer-to-peer networks

From "cloud" to "fog": cloud computing will die, replaced by distributed peer-to-peer networks

When cloud computing was booming, Viktor Charypar, technical director of the British digital consulting company Red Badger, wrote an article on VentureBeat saying that cloud services would face their doomsday, and pointed out that peer-to-peer networks would be the future development direction.

The cloud is coming to an end. I know, that’s a bold statement and it may sound a little crazy. But bear with me and let me go.

There has always been such a traditional view: applications running servers, whether web applications or mobile application backends, will be in the cloud in the future. Amazon, Google, and Microsoft have added a variety of tools to their cloud services, making it easier and more convenient to run software services in them. Therefore, hosting code on AWS, GCP, or Azure is the best thing you can do - it is convenient, cheap, easy to automate, and you can flexibly control the scale...

[[209334]]

So why do I predict this will all end? Here are a few reasons:

First, it cannot meet long-term expansion requirements

Building a scalable, reliable, and highly available web application, even in the cloud, is quite difficult. If you do it well and make your application a huge success, the sheer scale will drain your money and energy. Even if your business is very successful, you will eventually reach the limits of cloud computing: the speed of computers and storage capacity will grow faster than the bandwidth of the network.

Ignoring the net neutrality debate, this may not be an issue for most people (except Netflix and Amazon), but it will soon be. As our video quality goes from HD to 4K to 8K, the amount of data we are requiring is growing dramatically, and, soon, VR datasets will follow.

This is a problem mainly because of the way we organize the web. There are many users who want to get content and use programs, and relatively few servers that have those programs and content. For example, when I see a funny photo on Slack and want to share it with the 20 people sitting next to me, they all have to download it from the server hosting the service, which has to send the photo 20 times.

As servers move to the cloud, like Amazon or Google computers in their data centers, the networks close to those places need to have incredible throughput to handle all that data. Plus, there have to be a lot of hard drives to store every single person and CPU data, and then transmit that data over the network to everyone who wants it. And with the rise of streaming services, things have gotten even worse.

All of this activity requires a lot of energy and cooling, making the entire system inefficient, expensive, and environmentally unfriendly.

Second, it is concentrated and fragile

Another problem with centrally storing our data and programs is availability and durability. What if Amazon's data center is hit by an asteroid, or destroyed by a tornado? Or what if it loses power for a period of time? The data stored on its machines cannot be temporarily accessed or even permanently lost.

We usually mitigate this problem by storing data in multiple locations, but that just means more data centers. This might greatly reduce the risk of accidental loss, but what about the data you really, really care about? Your wedding video, photos of your children growing up, or important public information sources like Wikipedia. All of this information is now stored in the cloud - on sites like Facebook, Google Drive, iCloud, or Dropbox. What happens to the data when these services stop operating or lose funding? Even if they don't go that far, it limits the way you can access your data, you have to use their service, and when you share with friends, they have to go through that service too.

Third, it requires trust, but it cannot provide guarantees

With cloud services, you have to convince your friends that the data they are getting is from you and that it is being passed through a trusted middleman. In most cases, this works well and is acceptable, but the websites and networks we use must be registered to operate legally, and regulators have the power to force them to do a lot of things. In most cases, this is a good thing and can be used to help solve crimes or remove illegal content from the web, but there are many cases where this power is abused.

Just weeks ago, the Spanish government did everything it could to prevent a referendum on independence in Catalonia, including blocking information websites telling people where to vote.

Fourth, it makes our data more vulnerable to attacks

The really scary side of a highly centralized internet is the centralization of personal data. The big companies that provide services to us all have a ton of data on us—data that contains enough information to predict what you’ll buy, who you’ll vote for, what house you’re likely to buy, and even how many children you’re likely to have. This information is enough to get a credit card, a loan, or even buy a house in your name.

And, you might agree. After all, you chose their service and you can only trust them. But that's not what you need to worry about. It's other people you need to worry about. Earlier this year, the credit reporting agency Equifax lost the data of 140 million customers in one of the largest data breaches in history. That data is now public. We can think of this as a once-in-a-decade event that could have been avoided if we had been more careful, but it's becoming increasingly clear that data breaches like this are difficult to completely avoid. And when they do occur, they are too dangerous to tolerate. The only way to truly prevent these types of incidents from happening again is to not collect data on such a large scale in the first place.

So, what will replace the cloud?

The internet, powered primarily by client-server protocols like HTTP and security based on trust in central authorities like TLS, is flawed and leads to problems that are largely difficult or impossible to solve. It's time to find something better - a model framework where no one else stores your personal data at all, where large media files are spread across the web, and where the entire system is completely peer-to-peer and serverless (I don't mean "serverless" in the cloud-hosted sense, I mean literally no servers).

I have read a lot of literature in this area and have become quite convinced that peer-to-peer is the inevitable direction of our future. Peer-to-peer network technology replaces the building blocks of the network as we know it with protocols and strategies that solve most of the problems I mentioned above. The goal is a fully distributed, permanently redundant data storage, where every user participating in the network is storing a copy of some of the available data.

If you've heard of BitTorrent, then the following should sound familiar. On BitTorrent, network users can break large data files into smaller chunks or fragments (each chunk has a unique ID) without the need for authorization from any central authority. To download a file, all you need is a "magic" number, a hash, which is a fingerprint of the content. Then, your BitTorrent client will use the "content fingerprint" to find users who have the file fragments, and download the file fragments from them one by one until you have all the fragments.

An interesting point is how to match users. BitTorrent uses a protocol called Kademlia. In Kademlia, each peer on the network has a unique ID number that is the same length as the unique block ID. It will store a block with a specific ID on a node whose ID is "closest" to the ID of the block. The distribution of blocks and random IDs of network peers should be fairly consistent across the network. However, block IDs do not need to be randomly selected, but rather use a cryptographic hash - a unique fingerprint of the contents of the block itself, which has the benefit of ensuring that the blocks are addressable. This also makes it easier to verify the contents of the block (by recalculating and comparing fingerprints) and ensures that users cannot download data other than the original data.

Another interesting feature is that by embedding the ID of one block into the content of another block, you can link the two together in a way that cannot be tampered with. If the content of the linked block changes, its ID will change and the link will be broken. If the embedded link is modified, the ID of the containing block will also change.

This mechanism of embedding the ID of one block into another makes it possible to create blockchains like the one that powers Bitcoin and other cryptocurrencies, or even more complex structures, often called Directed Acyclic Graphs, or DAGs for short. (They are often called "Merkle links" after Ralph Merkle, who invented them. So if you hear someone talking about Merkel DAGs, you probably know what they're talking about.) A common example of a Merkle DAG is a Git repository. Git keeps the commit history and all the directories and files in a giant Merkle DAG.

This leads to another interesting property of content-addressed distributed storage: it is immutable. The content cannot be changed. Instead, new revisions are stored next to existing revisions. Blocks that have not changed between revisions are reused because, by definition, they have the same ID. This also means that the same file cannot be duplicated in such a storage system, translating into efficient storage. So on this new network, every unique funny picture will only exist once (although there will be multiple copies across the swarm).

Protocols like Kademlia, Merkle chains, and Merkle DAGs give us the tools to model file hierarchies and revision timelines, and share them across a large P2P network. There are already some protocols using these techniques to build distributed storage that fits our needs. One that looks promising is IPFS.

Name and sharing issues

Well, with all of this, we can solve some of the problems I raised at the beginning: we get distributed, highly redundant storage on devices connected to the network that can record the history of files and keep all versions when needed. This (almost) solves the problems of availability, capacity, persistence, and content verification. It also solves the bandwidth problem - because data is transmitted peer-to-peer, there will be no situation where the server is overwhelmed.

We also need a scalable computing resource, but this isn’t hard: everyone’s laptops and phones these days are more powerful than most applications need (including fairly complex machine learning calculations), and computing is generally scalable. So as long as we can get each device to do the necessary work for the user, there won’t be a big problem.

So now the funny pictures I see on Slack can come from the colleague sitting next to me, not from Slack's servers (no "ocean" has been crossed in the process). However, in order to post a funny picture, I need to update a channel (that is, the channel will no longer be the same as before I sent the message, it will change). This sounds quite simple, but it is the hardest part of the whole system.

The hardest part: real-time updates

The idea of ​​an entity changing over time is really just a human idea that gives the world a sense of order and stability in our heads. We can also think of such an entity as an identity or name that takes on a series of different values ​​over time (this is static and unchangeable). (Rich Hickey explains this very well in his talk, click here to watch). Modeling information in a computer is a more natural way to do it, and produces more natural results. If I tell you one thing, I can never change what I said to you again, and there is no way to make you forget it. For example, who the President of the United States is does not change over time; it just gets replaced by other facts (people) of the same nature (identity and name). In the Git example, a ref (branch or tag) can point to (hold an ID and a value) different commits at different points in time, and a commit is made to replace the value currently held. Slack channels also represent an identity, and its value continues to grow over time.

The real problem is that we are not the only ones with a channel. Many people are trying to post messages and change channels, sometimes simultaneously, and someone needs to decide what the outcome should be.

In a centralized system, which is the case in almost all current web applications, there is a central entity that decides this outcome and serializes events. However, in a distributed system, everyone is equal, so there needs to be a mechanism to ensure that consensus can be reached on the network.

The hardest problem to solve for a truly distributed web is the one that all the applications we use today are using. It affects not only concurrent updates, but also other updates that need to be updated in “real time” — the “single source of truth” is changing over time. This problem is especially hard for databases, and it affects other critical services like DNS as well. Registering a person’s name for a specific block ID or range of IDs in a decentralized way means that everyone involved needs to agree that an existing name has a specific meaning, otherwise two different users could see two different files under the same name. Content-based addressing solves the problem for machines (remembering that a name can only point to a specific matching content), but not for humans.

There are a few main strategies for handling distributed consensus. One of them is to choose a relatively small "group" of managers whose mechanism is to elect a "leader" who decides the truth (check out the Paxos and Raft protocols if you're interested). All changes go through these managers. This is essentially a centralized system that compensates for the loss of a central decision-making entity or outages ("partitions") in the network.

Another approach is a proof-of-concept system like the Bitcoin blockchain, where consensus is achieved by having users solve a "puzzle" to write an update (e.g., add a valid block to a Merkle chain). The "puzzle" is hard to solve, but easy to check, and if a conflict still exists, some additional rules are required to resolve it. Several other distributed blockchains use proof-of-concept consensus while reducing the energy requirements required to solve the puzzle. If you're interested, you can read about the proof-of-concept in this whitepaper from BitFury.

Another approach to specific problems is around CRDTs - conflict-free replicated data types, which in certain cases don't suffer from consensus problems at all. The simplest example is an incrementing counter. If all updates are just "add a row", as long as we make sure each update is only applied once, the order doesn't matter and the result will be the same.

There doesn't seem to be a single clear answer to this question, and there probably never will be one, but there are a lot of smart people working on it, and there are already a lot of interesting solutions to choose from. You can only make trade-offs. The trade-off usually manifests itself in the size of the group you "target" and picking which property of consensus you're willing to give up - availability or consistency (or, technically, network partitioning, but that seems hard to avoid in a highly distributed system like the one we're discussing). Most applications seem to favor availability over instant consistency - as long as the state is consistent in a reasonable amount of time.

Privacy issues in public file networks

An obvious problem to solve is privacy. How do you store content in a distributed cluster without making it public? If it's possible to hide things, then content address storage is a good choice, because in order to find something, you need to know the hash of its content. So essentially we have three levels of privacy: public, hidden and private. The answer to the third problem seems to be in cryptography - strong encryption of the stored content and having an "external" shared key (such as sharing with paper, transmitting with NFC devices, or scanning QR codes, etc.).

Relying on encryption might sound risky at first (after all, hackers always find loopholes), but it's actually no worse than what we do today. In fact, it's likely to be better in practice. Businesses and governments often store sensitive data (including the individuals it relates to) in such a way that it can't be shared with the public. Instead, it's only accessible to a small number of employees employed by the organization that owns the data, and it's protected, at least password-wise. Typically, if you have access to the systems where that data is stored, you have all of it.

But if we move to storing private data in an essentially public way, then we have to protect it (using strong cryptography) and that makes it bad for anyone who gains access. This is the same idea as why developers of security-related software open source their code so anyone can look at it and find problems. Knowing how a security system works shouldn't help you break it.

An interesting property of this control over access is that once you give someone access to some data, they can modify it forever. Of course, you can always change the encryption keys. This is no worse than where we are today, although it may not be obvious: anyone who gains access to some data can copy it.

The interesting challenge in this space is building a good system for verifying identity and sharing private data among a group of people whose identities may change over time. For example, a group of collaborators in a private Git repository. This is definitely possible with some combination of private key passphrases and rotating keys, but making it a smooth experience for users can be challenging.

From Clouds to Fog

Despite some challenges to work out, our move away from the cloud is going to be a very exciting future. First, on the technical side, we should see quite a few improvements from peer-to-peer networks. Content-addressable storage can provide cryptographic verification of the content itself without trusted authorities, and be hosted forever (as long as someone is interested in it), and we should see significant speed improvements, even at the edge of the developing world (or even on another planet) far from a data center.

At some point, even data centers may become a thing of the past. Consumer devices have become so powerful and ubiquitous that computing power and storage space are available almost everywhere.

For businesses running web applications, this change will result in huge cost savings. Businesses will also be able to focus less on the risk of downtime and more on increasing customer value, benefiting everyone. We will still need cloud-hosted servers, but they will be just one of many. We may also see more diverse applications where not all applications are the same - in the same application, there are consumer-facing and back-end, and the difference is only in the access rights.

[[209336]]

Another huge benefit for both the business and the customer is what happens to customer data. When there is no longer a need to store large amounts of customer information centrally, the risk of losing that data is reduced. Leaders in the software engineering community (such as Joe Armstrong, the creator of Erlang, whose talk is well worth watching) have long argued that the design of customers sending data to a business's programs over the Internet is a degradation, and that businesses should send programs to customers that allow them to execute private data that is not directly shared. Such a model seems more secure, and does not in any way prevent businesses from collecting the useful user metrics they need.

Furthermore, there is currently nothing to prevent hybrid service models that are opaque and retain private data.

This type of application architecture seems like a more natural way to provide large-scale computing and software services, and is also closer to the idea of ​​open information exchange, where anyone can easily forward content to others and control what can be published and accessed, rather than being controlled by private entities that own servers.

For me, this is very exciting. That's why I want to put together a small team and build a simple mobile app in a few weeks with some of the technologies mentioned above, to prove the concept and show what can be done with peer-to-peer networks. The only idea I have at the moment that is small enough, relatively quick and interesting enough to prove the features of this approach is a peer-to-peer, truly serverless clone of Twitter, but that's not particularly exciting.

<<:  Runqian Software and Yonghong Technology signed a strategic cooperation agreement

>>:  Riverbed Digital Experience Management

Recommend

Understand HTTP and HTTPS protocols in ten minutes?

[[276795]] 1. What is a protocol? A network proto...

Differentiate switches based on network coverage

As the number of switches increases, there are ma...

10 Useful HTML File Upload Tips

[[351004]] The ability to upload files is a key r...

4 Ways to Save Money in Your Data Center

When data center downtime costs an average of nea...

Low Power Wide Area Network (LPWAN) Shapes the Future of IoT

IoT wireless connectivity networks are booming to...

Five-minute technical talk | A brief discussion on WebSocket protocol-RFC 6455

01 Introduction WebSocket is a network communicat...

TCP and UDP, 123 things you need to know (TCP)

Preface As a network operation and maintenance pe...

What does the increasingly popular 5G public network dedicated service mean?

[[426454]] This article is reprinted from the WeC...

A complete history of web crawlers

[[415987]] The well-known research organization A...

What will happen when 5G network falls in love with public cloud?

[[410935]] Recently, AT&T, the second largest...

National Cyber ​​Security Awareness Week 2017

[51CTO.com Shanghai report] The 2017 National Cyb...

What exactly does the Communications Design Institute do?

Speaking of the Communications Design Institute, ...