Rate limiting is a key control mechanism used to manage the flow of requests to an API, much like a throttle. Rate limiting is more than just controlling the total number of requests, it's also about how and where those limits are applied. Depending on the needs of your API, rate limiting can be implemented based on a variety of factors, such as user ID, IP address, or specific types of API calls. For example, a social platform might implement strict rate limits to prevent spam while allowing more frequent requests to read content. Similarly, a service can apply different limits to requests from known users and anonymous traffic, using user IDs or IP addresses to differentiate. This flexibility makes rate limiting a versatile tool not only to prevent overload but also to create a fair and balanced user experience. 1. The main function of API rate limitingAPI rate limiting can prevent DoS attacks and ensure that APIs are open to legitimate users. At the same time, it can fairly allocate resources, reduce operating costs, and effectively manage billing and quotas for third-party APIs to avoid unexpected charges.
So, what are the API rate limiting methods we commonly use? 2. Token BucketA token bucket is a method for managing network traffic where tokens are added at regular intervals. Each request to an API requires a token. If the bucket does not have a token, the request will be rejected, thus ensuring that the API is not overloaded. picture Each token represents permission to send a certain amount of data (such as an API request). When a request arrives, it can only be processed if a token is available, and then the token is removed from the bucket. If the bucket is empty, the request must wait until a new token is added. For example, if the bucket adds 5 tokens per second, and each token allows one API request, a maximum of 5 requests per second can be processed. If the request activity is low, tokens accumulate, and as long as there are enough tokens, more than 5 requests per second can be processed, which allows occasional bursts. Token buckets allow us to control the number of requests a user can make in a given period of time, thereby preventing server overload. They can also help distribute available bandwidth among users or applications, ensuring fair use and preventing congestion. The limitations of token bucket are as follows:
3. Leak bucketLeaky bucket is also a method for network traffic management and API rate limiting, focusing on maintaining a consistent output rate. We can imagine the bucket as leaking requests at a steady rate. If the requests (water) come too fast and the bucket is filled, the excess requests will overflow and be lost, similar to being rejected or queued. picture In contrast, a token bucket allows traffic in as long as there are tokens in it, and tokens accumulate over time, allowing flexibility to handle sudden increases in traffic. A leaky bucket is more restrictive, leaking requests at a constant rate. It doesn't matter how full the bucket is, the goal is for the output rate to remain constant. This can lead to underutilization during low traffic periods, but not being able to accommodate bursts of traffic. For example, an API using a leaky bucket might handle requests at a fixed rate of 5 per second, regardless of how many requests are waiting. In contrast, with a token bucket, it can handle a burst of 20 requests in a second if there are enough tokens, and then return to a slower rate when tokens are exhausted. Leaky buckets are very useful for networks that require consistent data flow and can be used for network traffic shaping. Although not as flexible as token buckets, they are suitable for APIs with stable traffic. At the same time, they avoid sudden traffic peaks and help prevent congestion. Limitations of the leaky bucket:
4. Fixed Window CounterFixed window counter is a rate limiting strategy for managing API requests and network traffic based on setting a fixed limit on the number of requests that can be made within a specified time window. In this approach, time is divided into fixed intervals or windows, and the maximum number of requests allowed is set for each window. Once the limit within that window is reached, no more requests will be accepted until the next window starts. picture While token buckets and leaky buckets allow some flexibility in the rate at which requests are processed, fixed window counters are more restrictive in the sense that requests are reached quickly, no more requests will be processed until the next window, regardless of actual capacity or demand. Consider an API with a rate limit of 100 requests per hour. Under fixed window counters, if 100 requests are received in the first 30 minutes, no further requests will be processed for the remaining 30 minutes of that hour, regardless of actual capacity or demand on the server. In e-commerce, by using fixed window counters, we can ensure the security of the transaction process by limiting the number of transactions that a user can initiate within a set time frame, thereby enhancing security against fraud. In cloud services, resource usage can be controlled by setting limits on API calls for operations such as starting or stopping virtual machines, thereby ensuring fair resource allocation. We can also manage data transfer from IoT devices to servers, which is critical to preventing server overload and facilitating interval data analysis. Limitations of fixed window counters:
5. Sliding Window LogSliding window log is a sophisticated approach to API rate limiting and network traffic management. Unlike fixed windows, this approach takes into account the timing of each individual request, providing a more dynamic approach. It keeps a log of the timestamp of each incoming request. The rate limit is then determined based on the number of requests in the current sliding window (a continuously moving time frame). If the number of requests in this window exceeds a threshold, new requests are rejected or queued. picture Fixed window counters impose strict limits on static time windows, resulting in potential bursts at the edge of each window. Sliding window logs offer a more dynamic approach that continuously adjusts over time. This prevents bursts of traffic that are common at the reset point of fixed windows. For example, an API has a limit of 100 requests per minute. In a sliding window log, this limit is constantly evaluated over the past minute. If a request comes in, the sliding window log checks all requests in the last 60 seconds. If that number is less than 100, the request is allowed; otherwise, the request is denied. Limitations of sliding window logging:
6. Sliding Window CounterThe sliding window counter combines elements of the fixed window counter and sliding window log methods, aiming to manage network traffic and API requests in a more balanced way. It tracks the number of requests in a rolling time frame, which is different from a fixed time interval. It counts requests in the current window while also considering partial requests from the previous window, providing smoother transitions between time intervals. picture Sliding window logs maintain a log of individual request timestamps, which can be resource intensive. Sliding window counters simplify this process by counting requests in a rolling window, reducing the computational overhead compared to recording the timestamp of each request. Sliding window counters are ideal for APIs that experience steady traffic with occasional bursts, as it prevents sudden rejections when managing peak loads. They are also very useful in CDNs with variable request rates to ensure efficient content delivery without overloading the network. Limitations of sliding window counters:
7 Speed Limit Characteristics and Countermeasures in Model ApplicationIf you receive an HTTP status code 429 error in a large model application, it means that we are subject to the rate limit constraints of the large model API. For example, Azure OpenAI imposes limits on "tokens per minute" (TPM) and "requests per minute" (RPM). Understanding and managing these limits is important to maintain smooth operations and avoid interruptions. 7.1 Basic concepts related to large model speed limitTo understand the reason for the rate limit, we need to review some basic concepts in large model applications (taking Azure OpenAI as an example).
If a large model application requires large tokens or high completion tokens, the server will throttle even if the RPM cannot be met. If the workload requires short completion or prompts, but requires a large number of API requests, the service will throttle. TPM evaluation factors are as follows:
7.2 Expected Calculation of Speed LimitAssuming a TPM quota of 100,000 tokens per minute and an RPM quota of 600 requests per minute (based on a conversion metric of approximately 6 RPM per 1,000 TPM), then (1) For scenarios where tokens are used extensively:The application processes large documents and requires a large number of tokens for each request. If each request uses approximately 1,000 tokens, using the hypothetical TPM quota, a maximum of 100 requests per minute can be made (100,000 tokens / 1,000 tokens per request). If the application attempts to process all 100 requests in the first 10 seconds, the server will throttle the requests, resulting in an HTTP 429 error. This is because rate limiting is calculated over a shorter period of time (1 or 10 seconds) to ensure an even distribution. (2) For scenarios with a large number of requests:The application handles short prompts and requires fewer tokens per request, using about 100 tokens per request. If the TPM quota is 100,000, then up to 1,000 requests can be made per minute (100,000 tokens per request). Despite the high number of requests, the RPM quota will limit it to 600 requests per minute [6 RPM per 1,000 TPM]. If the application exceeds this limit, the server will throttle requests even if the total token usage is within the TPM quota. 7.3 Response to speed limitThe general approach to speed limiting for large model APIs is as follows:
Moreover, we can use SDKs like LangChain to implement load balancing at the client level. This approach distributes API requests across multiple clients, reducing the possibility of hitting rate limits. In addition, use Azure API Management (APIM) to create custom policies to manage and distribute load more effectively. You can learn how to implement load balancing with APIM here. 8. SummaryIn the fast-paced API world, rate limiting is like a pace setter, ensuring everything runs smoothly and efficiently. The key is to choose the right approach, whether it is the consistency of fixed windows, the precision of sliding window days, or the balance of sliding counters. Simple and effective, the right rate limiting strategy is the key to a smooth digital experience, keeping services fast and reliable, and ready for the future. Similarly, we need to improve our response to API rate limiting in large model applications. |
>>: The most powerful remote access tool, no objection
I think everyone is still curious about this ques...
[Shenzhen, China, July 30, 2020] Today, at the Cl...
It is time for operators to release their monthly...
On June 6, 2019, the Ministry of Industry and Inf...
Recently, the Ministry of Industry and Informatio...
Author: Han Binjie and Lu Dan, unit: Hebei Mobile...
On April 2, Ruijie Networks held a new product la...
TNAHosting recently held a Happy Near Year event,...
During the Dragon Boat Festival holiday, there ar...
LiCloud.io is a very new hosting company, which w...
I searched the blog and it seems that no one has ...
At present, a new round of world scientific and t...
0. Summary of the previous situation During a fli...
Recently, with the rapid development of China'...
Hello everyone, I am Xiaolin. I have posted this ...