After working for more than 6 years, I still don’t understand the principles and techniques of coroutines

After working for more than 6 years, I still don’t understand the principles and techniques of coroutines

[[432311]]

Preface

Hello, my friends!

Dabai has worked in the backend for more than 6 years and has written C/C++, Python, and Go. Whenever he talks about coroutines, the only keywords that come to his mind are yeild, async, go, etc.

However, my understanding of coroutines has always been vague, so I decided to figure it out.

It is estimated that it will take 10 minutes to read the full text. You can spend less time watching a few short videos and learn more knowledge. It is a good deal just thinking about it!

The birth of the concept of coroutines

Let me first give a rough conclusion: coroutines are generally a design concept, and what we often talk about is just the specific implementation.

If you understand the idea well, the technical points will be very simple. The difference between coroutines and techniques is as follows:

Ancient Artifact COBOL

The concept of coroutines appeared earlier than threads, and can even be traced back to the 1950s. When talking about coroutines, we must talk about COBOL, one of the earliest high-level programming languages ​​with extremely strong vitality.

At first I thought that COBOL had long disappeared in the long river of history, but I was wrong.

COBOL is a process-oriented high-level programming language, mainly used for data processing. It is the most widely used high-level language in the world. COBOL is the abbreviation of Common Business-Oriented Language, which originally means a common language for business.

As of this year, there are approximately 10,000 mainframes in the world, and about 200 billion lines of code in 38,000+ legacy systems are written in COBOL, accounting for as high as 65%. At the same time, many government and corporate organizations in the United States are built based on COBOL, which has a huge influence.

Back in 1958, American computer scientist Melvin Conway began to study the compiler optimization problem of COBOL based on tape storage. This was a very hot topic at the time, and many young talents plunged into it, including Turing Award winner Professor Donald Ervin Knuth, who also wrote an optimized compiler.

Looking at the profiles of these two people, I was silent:

Melvin Conway is also a super boss and the famous originator of Conway's Law.

Donald Irvin Knuth is a pioneer in algorithms and programming techniques, winner of the 1974 Turing Award, and inventor of the computer typesetting system TeX and the font design system METAFONT. He is renowned worldwide for these achievements and a large number of creative and far-reaching works. His "The Art of Computer Programming" was listed by American Scientist magazine as one of the 12 most important physical science monographs of the 20th century.

[[432313]]

So what is the problem that made these geniuses devote so much energy? Come and take a look!

Technical Difficulties of COBOL Compilers

We all know that high-level programming languages ​​​​need to use a compiler to generate binary executable files. The basic steps of the compiler include: reading character streams, lexical analysis, syntax analysis, semantic analysis, code generator, code optimizer, etc.

This pipeline process uses the output of the previous step as the input of the next step, and stores the intermediate results in memory. This is easy to do on modern computers, but it was difficult in the COBOL language decades ago due to the limitations of software and hardware.

In 1958, storage was not yet well developed. Magnetic tapes were first used as storage in computers in 1951, so COBOL of that era was heavily dependent on magnetic tapes.

In fact, I searched a lot of information online to see what problems the compiler had at the time, but I only found one thing: the compiler could not complete the entire compilation process by reading the tape once, which means that the so-called one-pass compiler had not yet been produced.

The COBOL program at that time was written on a tape, but the tape did not support random reading and writing, and could only be read sequentially. At the same time, the memory at that time could not hold the entire contents of the tape, so if the program was not compiled in one reading, it had to be read from the beginning again.

So I imagined two possible multi-pass interactions between the COBOL compiler and the tape:

Possible scenario 1

For COBOL compilers, to complete lexical analysis and syntax analysis, the source code of the program must be read from the tape. In previous compilers, lexical analysis and syntax analysis are independent of each other, which means:

  • Lexical analysis requires running the tape from beginning to end
  • Syntax analysis requires running the tape from beginning to end

Possible situation 2

Friends who have listened to tapes must know the two basic operations of tapes: rewind and fast forward.

When completing the compiler's lexical analysis and syntax analysis, the tape needs to be rewound and fast-forwarded repeatedly to find the parts required for the two types of analysis, similar to disk seek. The head needs to move horizontally repeatedly, and the tapes at that time did not necessarily support random reading and writing.

From some information, we can see that the various links of the COBOL compiler at that time were independent of each other. This combined limitation of software and hardware made it impossible to achieve one-pass compilation.

Collaborative Solutions

In Melvin Conway's compiler design, lexical analysis and syntax analysis are run together, instead of being independent of each other like other compilers. The two modules run intertwined, and the compiler's control flow switches back and forth between lexical analysis and syntax analysis:

  • When the lexical analysis module generates enough lexical units based on morphemes, the control flow is transferred to the syntax analysis module.
  • When the syntax analysis module has processed all the lexical units Token, the control flow is transferred to the lexical analysis module
  • Lexical analysis and syntax analysis each maintain their own running status and have the ability to actively give up and recover

It can be seen that the core idea of ​​this solution is:

The collaborative working mechanism constructed by Melvin Conway requires participants to remember their own status when yielding control flow, so that they can resume execution from the last yielded position when control flow returns. In short, the whole spirit of coroutines lies in the active yielding and resumption of control flow.

This collaborative task flow is very similar to computer interrupts. Under the constraints of the conditions at the time, the yield/recovery mode collaborative program proposed by Melvin Conway was considered to be the earliest concept of coroutines, and a new COBOL compiler could be created based on this idea.

In 1963, Melvin Conway also published a paper to illustrate his idea. Although more than half a century has passed, I am fortunate to have found this paper:

https://melconway.com/Home/pdf/compiler.pdf

To be honest, this paper is really difficult. It was written too long ago and it is hard to resonate with it. In the end, I gave up. Otherwise, I might have figured out the specific problems of the previous compiler.

[[432318]]

Underappreciated coroutines

Although the concept of coroutines appeared earlier than threads, coroutines have never been officially put on the stage, which really feels a bit like a talent that is not appreciated.

When we were in school, our teachers taught us some software design ideas, among which the mainstream languages ​​advocated top-down programming ideas:

Decompose the tasks to be completed, define, design, program and test the problems at the highest level first, and put the unresolved problems as a subtask to be solved at the next level.

By defining, designing, programming, and testing layer by layer, one by one, until problems at all levels are solved by practical programs, a hierarchical program can be designed.

The C language is a typical representative of the top-down thinking. With the main function as the entry point, each module forms a hierarchical calling relationship in turn. At the same time, each module has lower-level sub-modules, which also have a hierarchical calling relationship.

However, the idea of ​​mutual cooperative scheduling of coroutines is inconsistent with top-down. There is a strong coupling relationship between the modules in the coroutine, which does not conform to the programming idea of ​​high cohesion and low coupling. In contrast, top-down makes the program structure clear, the hierarchical scheduling clear, and the code readability and maintainability are very good.

Compared with threads, cooperative task systems allow callers to decide when to give up, which takes much less time than the preemptive scheduling of the operating system. The latter saves a lot of state when switching threads in order to restore the scene, and switches very frequently, consuming more resources.

In general, coroutines are completely user-mode behaviors. The programmer decides when to give up control. The resources used for saving the scene and switching and restoring are very few, which is completely in line with improving processor efficiency.

So we can't help but ask: coroutines look good, why haven't they become mainstream?

  • The idea of ​​coroutines was not in line with the mainstream at that time
  • Preemptive threads can solve most problems, and users will not feel much pain.

In other words: threads do the things that coroutines can do well, but users don’t need the things that threads do poorly for the time being, so coroutines are left unappreciated.

In fact, although coroutines have not caused much waves on the x86 architecture, since the preemptive task system relies on the support of CPU hardware and has relatively high hardware requirements, collaborative scheduling is perfect for some embedded devices, so coroutines have also been used in another field.

The rise of coroutines

We never stop squeezing the CPU.

For the CPU, tasks fall into two categories: compute-intensive and IO-intensive.

Compute-intensive tasks can already maximize the use of the CPU, but IO-intensive tasks have always been a difficult point in improving CPU utilization.

The pain of IO-intensive tasks

For IO-intensive tasks, there is also a corresponding solution in preemptive scheduling: asynchronous + callback.

That is, if there is IO blocking, such as downloading a picture, it will return immediately, wait for the download to be completed, process the result through callback, and deliver it to the initiator.

Just like you often go to a breakfast shop, and the fried dough sticks are not ready yet, you are familiar with the boss, so you pay the money first and go to your seat to play with your phone. When your fried dough sticks are ready, the waiter will bring them to you. This is a typical asynchrony + callback.

Although asynchrony + callback looks simple in real life, it is a headache in program design. In some scenarios, it makes the readability of the entire program very poor and difficult to write. On the contrary, although synchronous IO is inefficient, it is easy to write.

Let's take asynchronous image download as an example. The image service middle platform provides an asynchronous interface, which returns immediately after the initiator's request. The image service then gives the initiator a unique identification ID. After the image service completes the download, it puts the result in a message queue. At this time, the initiator needs to continuously consume this MQ to get the download result.

Compared with synchronous IO, the entire process is a process in which the original overall logic is split into several parts, and each sub-part has stateful migration. For most programmers, maintaining the state is simply a nightmare, and it is bound to be a high-incidence area for bugs in the future.

User-mode collaborative scheduling

With the development of network technology and high concurrency requirements, the inefficiency of preemptive scheduling in handling IO-type tasks has gradually attracted attention, and finally the opportunity for coroutines has come.

The coroutine gives the programmer the right to handle IO. When IO is blocked, it hands over control to other coroutines, and then hands over control back after other coroutines have finished processing.

The relationship between multiple coroutines that transfer execution rights through yield is not that of caller and callee, but that of equality, symmetry, and cooperation.

In addition to the contradiction in design ideas, there are some other reasons why coroutines have not prevailed. After all, coroutines are not a silver bullet. Let's take a look at the problems of coroutines:

  • Coroutines cannot take advantage of multiple cores and need to be used in conjunction with processes to work on multiple CPUs
  • The thread callback mechanism still has great vitality, and coroutines cannot completely replace it
  • The need to transfer control may cause starvation of some coroutines, and preemptive is fairer
  • The control of the coroutine is determined by the user state and may be transferred to some malicious code. The preemptive scheduling by the operating system is safer.

To sum up, coroutines and threads are not contradictory. The power of coroutines lies in IO processing, which happens to be the weak point of threads. Only by transforming opposition into cooperation can a new situation be opened up.

Programming languages ​​that embrace coroutines

Heavy IO operations such as network operations, file operations, database operations, and message queue operations are problems that cannot be avoided in any high-level programming language and are also the key to improving program efficiency.

Old languages ​​like Java, C/C++, and Python have also begun to use third-party packages to support coroutines to solve the shortcomings of their own languages.

New players like Golang natively support coroutines at the language level, which can be said to be a complete embrace of coroutines, which also creates Go's high concurrency capabilities.

Let's take a look at how they implement coroutines and what are the key points of implementing coroutines.

Python

Python's support for coroutines has also gone through multiple versions, evolving from partial support to complete support:

  • Python 2.x has limited support for coroutines, and the generator yield implements some but not all of them.
  • The third-party library gevent has a better implementation of coroutines, but it is not official.
  • Python 3.4 added the asyncio module
  • In Python 3.5, async/await syntax support is provided
  • The asyncio module in Python 3.6 is more complete and stable
  • Python 3.7 starts with async/await becoming reserved keywords

We use the latest async/await to illustrate how Python coroutines are used:

  1. import asyncio
  2. from pathlib import Path
  3. import logging
  4. from urllib.request import urlopen, Request
  5. import os
  6. from   time import time  
  7. import aiohttp
  8.   
  9. logging.basicConfig( level =logging.INFO, format= '%(asctime)s - %(name)s - %(levelname)s - %(message)s' )
  10. logger = logging.getLogger(__name__)
  11.   
  12.   
  13. CODEFLEX_IMAGES_URLS = [ 'https://codeflex.co/wp-content/uploads/2021/01/pandas-dataframe-python-1024x512.png' ,
  14. 'https://codeflex.co/wp-content/uploads/2021/02/github-actions-deployment-to-eks-with-kustomize-1024x536.jpg' ,
  15. 'https://codeflex.co/wp-content/uploads/2021/02/boto3-s3-multipart-upload-1024x536.jpg' ,
  16. 'https://codeflex.co/wp-content/uploads/2018/02/kafka-cluster-architecture.jpg' ,
  17. 'https://codeflex.co/wp-content/uploads/2016/09/redis-cluster-topology.png' ]
  18.   
  19.   
  20. async def download_image_async(session, dir, img_url):
  21. download_path = dir/os.path.basename(img_url)
  22. async with session.get(img_url) as response:
  23. with download_path. open ( 'wb' ) as f:
  24. while True :
  25. chunk = await response.content.read (512)
  26. if not chunk:
  27. break
  28. f.write(chunk)
  29. logger.info( 'Downloaded: ' + img_url)
  30.   
  31.   
  32. async def main():
  33. images_dir = Path( "codeflex_images" )
  34. Path( "codeflex_images" ).mkdir(parents= False , exist_ok= True )
  35.   
  36. async with aiohttp.ClientSession() as session:
  37. tasks = [(download_image_async(session, images_dir, img_url)) for img_url in CODEFLEX_IMAGES_URLS]
  38. await asyncio.gather(*tasks, return_exceptions= True )
  39.   
  40.   
  41. if __name__ == '__main__' :
  42. start = time ()
  43.       
  44. event_loop = asyncio.get_event_loop()
  45. try:
  46. event_loop.run_until_complete(main())
  47. finally:
  48. event_loop.close ( )
  49.   
  50. logger.info( 'Download time: %s seconds' , time () - start)

This code shows how to use async/await to implement concurrent image downloading.

Adding the async keyword before a normal function def turns it into an asynchronous/coroutine function. Calling this function does not run, but returns a coroutine object, which is then executed in event_loop

Await means waiting for the task to complete, that is, yeild gives up control. At the same time, asyncio uses the event loop event_loop to implement the whole process. Await needs to be used in the function marked with async

The event_loop acts as a manager, switching control between several coroutine functions.

C++

The coroutine framework was introduced in C++20, but it is very immature. In other words, it is the lowest-level thing used by big guys who write coroutine libraries. It is very complicated to use and has a high threshold.

As the uncrowned king of high-performance server development languages, major companies have also made many attempts to use coroutine functions, such as boost.coroutine, WeChat's libco, libgo, and Yunfeng's coroutine library implemented in C.

To be honest, things related to C++ coroutines are a bit complicated. I will write about it later and will not expand on it here.

Go

The coroutine in Go is called goroutine, which is considered to be a lighter-weight thread in user mode. The coroutine is transparent to the operating system, which means that the operating system cannot directly schedule the coroutine, so there must be an intermediate layer to take over the goroutine.

Goroutine is still implemented based on threads, because threads are the basic unit of CPU scheduling. A set of data structures and N threads are maintained inside the go language. The code of the coroutine is put into the queue and scheduled and executed by the thread. This is the famous GMP model.

  • G:Goroutine

Each Gotoutine corresponds to a G structure, which stores the Goroutine's running stack, status, and task function. The reusable function entity G needs to be saved to the P queue or the global queue before it can be scheduled for execution.

  • M:machine

M is an abstraction of a thread, representing the resources that actually perform the calculation. After binding a valid P, it enters the scheduling execution loop, and M will be executed from the local queue of P.

  • P:Processor

P is an abstract concept, not a physical CPU but a logical processor. When a P has a task, it needs to create or wake up a system thread M to process the task in its queue.

P determines the number of tasks that can be executed simultaneously, and GOMAXPROCS limits the number of user-level tasks that can be executed by system threads.

For M, P provides the relevant execution environment, including memory allocation status, task queue, etc.

The basic process of GMP model operation:

First, a G object is created, and then G is saved in P's local queue or global queue

At this time, P will wake up an M, M will find an idle P to move G to itself, and then M will execute a scheduling loop: call G object -> execute -> clean up thread -> continue looking for Goroutine.

During the execution of M, context switching may occur at any time. When switching occurs, the execution site of the task needs to be protected so that the site can be restored in the next scheduled execution.

The stack of M is stored in the G object, and only the registers required for on-site recovery (SP, PC, etc.) need to be saved to the G object.

Summarize

This article introduces the one-pass problem of the COBOL language compiler in 1960, allowing everyone to see the earliest background of collaborative programs and the important concept of active yield/recovery.

Next, the contradiction between the mainstream top-down software design idea and the coroutine idea was introduced, as well as the booming development of preemptive program scheduling and the existing problems.

It goes on to introduce how IO-intensive tasks hinder the improvement of CPU efficiency, the asynchronous + callback solution of preemptive scheduling for IO-intensive problems, and the processing of coroutines, demonstrating the significant advantages of coroutines in processing IO-intensive tasks.

Finally, the current solution of preemptive scheduling + coroutine IO-intensive processing is explained, including the support and implementation of coroutines at the language level of Python, C++ and Go.

This article does not contain much specific content. It aims to introduce the idea of ​​coroutines and their advantages, but does not elaborate on the implementation details of coroutines in various languages.

Finally, thank you all for your patience in reading. See you next time!

<<:  Flink's general method for calculating Pv and Uv

>>:  5G+Industrial Internet, how is this addition “calculated”?

Recommend

How long will it take for 5G to be fully commercialized? Why?

Recently, a netizen asked, how long will it take ...

Tencent Cloud Lighthouse Care, help you get up to 200 yuan in coupons

Tencent Cloud recently launched a lightweight wor...

New data transmission system developed: 10 times faster than USB

A new data-transfer system is here that's 10 ...

Vinton Cerf, the 'Father of the Internet', Infected with Coronavirus

[[320474]] According to the latest news from fore...

After 4 years, 5G has blossomed

In June 2019, my country officially issued 5G com...

SpartanHost Seattle VPS restock, $8/month-2GB/30G NVMe/3TB/10Gbps bandwidth

SpartanHost has updated its inventory again. Some...

Review of the top ten events in the Internet industry in 2016

[51CTO.com original article] As 2016 enters the c...

Why is your internet speed still so slow after using CDN? Here are 4 reasons!

CDN, or Content Delivery Network, is designed to ...

Is it impossible for non-middlemen to hijack TCP?

TCP initial sequence number Hi, my name is Robert...

This article will help you understand the technical principles of CDN!

Hello everyone, I am Brother Shu! I believe every...