Performance Agreement: API Rate Limit

Performance Agreement: API Rate Limit

Rate limiting is a key control mechanism used to manage the flow of requests to an API, much like a throttle. Rate limiting is more than just controlling the total number of requests, it's also about how and where those limits are applied. Depending on the needs of your API, rate limiting can be implemented based on a variety of factors, such as user ID, IP address, or specific types of API calls.

For example, a social platform might implement strict rate limits to prevent spam while allowing more frequent requests to read content. Similarly, a service can apply different limits to requests from known users and anonymous traffic, using user IDs or IP addresses to differentiate. This flexibility makes rate limiting a versatile tool not only to prevent overload but also to create a fair and balanced user experience.

1. The main function of API rate limiting

API rate limiting can prevent DoS attacks and ensure that APIs are open to legitimate users. At the same time, it can fairly allocate resources, reduce operating costs, and effectively manage billing and quotas for third-party APIs to avoid unexpected charges.

  • Preventing Denial of Service (DoS) Attacks: Just as a shopping mall might restrict entry to prevent overcrowding, rate limiting prevents malicious users from flooding an API with too many requests, causing a denial of service attack. This keeps the API available to legitimate users.
  • Fair distribution of resources: It ensures that the resources of the API are evenly distributed among all users. Without rate limiting, a few heavy users may consume more than their fair share of resources, thereby degrading the quality of service for other users.
  • Managing operational costs: Especially in cloud-based environments, processing a large number of API requests can be costly. Rate limits help control these costs and prevent overuse.
  • Third-party API billing: When an API is used as part of a third-party service, rate limiting is critical to managing billing and usage quotas. It ensures that users stay within the allocated usage limits and avoid unexpected charges.

So, what are the API rate limiting methods we commonly use?

2. Token Bucket

A token bucket is a method for managing network traffic where tokens are added at regular intervals. Each request to an API requires a token. If the bucket does not have a token, the request will be rejected, thus ensuring that the API is not overloaded.

picture

Each token represents permission to send a certain amount of data (such as an API request). When a request arrives, it can only be processed if a token is available, and then the token is removed from the bucket. If the bucket is empty, the request must wait until a new token is added. For example, if the bucket adds 5 tokens per second, and each token allows one API request, a maximum of 5 requests per second can be processed. If the request activity is low, tokens accumulate, and as long as there are enough tokens, more than 5 requests per second can be processed, which allows occasional bursts.

Token buckets allow us to control the number of requests a user can make in a given period of time, thereby preventing server overload. They can also help distribute available bandwidth among users or applications, ensuring fair use and preventing congestion.

The limitations of token bucket are as follows:

  • Complexity in high-speed networks: In extremely high-speed or complex network environments, the overhead of maintaining token buckets can be significant.
  • Does not prioritize traffic: It treats all requests equally, which may not be ideal in some systems where requests should have priorities.
  • Inflexible burst handling: Setting the right token rate and bucket size requires understanding typical traffic patterns, which may not be trivial.

3. Leak bucket

Leaky bucket is also a method for network traffic management and API rate limiting, focusing on maintaining a consistent output rate. We can imagine the bucket as leaking requests at a steady rate. If the requests (water) come too fast and the bucket is filled, the excess requests will overflow and be lost, similar to being rejected or queued.

picture

In contrast, a token bucket allows traffic in as long as there are tokens in it, and tokens accumulate over time, allowing flexibility to handle sudden increases in traffic. A leaky bucket is more restrictive, leaking requests at a constant rate. It doesn't matter how full the bucket is, the goal is for the output rate to remain constant. This can lead to underutilization during low traffic periods, but not being able to accommodate bursts of traffic. For example, an API using a leaky bucket might handle requests at a fixed rate of 5 per second, regardless of how many requests are waiting. In contrast, with a token bucket, it can handle a burst of 20 requests in a second if there are enough tokens, and then return to a slower rate when tokens are exhausted.

Leaky buckets are very useful for networks that require consistent data flow and can be used for network traffic shaping. Although not as flexible as token buckets, they are suitable for APIs with stable traffic. At the same time, they avoid sudden traffic peaks and help prevent congestion.

Limitations of the leaky bucket:

  • No burst handling: Unlike token bucket, it cannot handle sudden traffic spikes efficiently.
  • Potential underutilization: May result in redundant capacity during periods of low traffic.
  • Consistent but inflexible: While it ensures a steady rate, it lacks adaptability to changing traffic, which may be necessary for some applications.

4. Fixed Window Counter

Fixed window counter is a rate limiting strategy for managing API requests and network traffic based on setting a fixed limit on the number of requests that can be made within a specified time window. In this approach, time is divided into fixed intervals or windows, and the maximum number of requests allowed is set for each window. Once the limit within that window is reached, no more requests will be accepted until the next window starts.

picture

While token buckets and leaky buckets allow some flexibility in the rate at which requests are processed, fixed window counters are more restrictive in the sense that requests are reached quickly, no more requests will be processed until the next window, regardless of actual capacity or demand. Consider an API with a rate limit of 100 requests per hour. Under fixed window counters, if 100 requests are received in the first 30 minutes, no further requests will be processed for the remaining 30 minutes of that hour, regardless of actual capacity or demand on the server.

In e-commerce, by using fixed window counters, we can ensure the security of the transaction process by limiting the number of transactions that a user can initiate within a set time frame, thereby enhancing security against fraud. In cloud services, resource usage can be controlled by setting limits on API calls for operations such as starting or stopping virtual machines, thereby ensuring fair resource allocation. We can also manage data transfer from IoT devices to servers, which is critical to preventing server overload and facilitating interval data analysis.

Limitations of fixed window counters:

  • Inflexibility to traffic bursts: Unlike token bucket, it cannot adapt to sudden surges in traffic within the window.
  • Potential for inefficiency: Can result in underutilized cycles, especially if limits are reached early in the window.
  • "Window reset" issue: Users may experience a sudden influx of permission requests when a new window starts, which may create an uneven server load.
  • Window edge bursting problem: A significant flaw is susceptibility to bursts of traffic at the edge of the window. Consider a scenario where a large number of requests come in before a window reset, and a similar surge occurs immediately after the reset. This can result in more requests than the expected quota being processed in a short period of time, potentially overwhelming the system. Such bursts at the "edge of the window" can create spikes in server load and reduce the effectiveness of rate limiting policies.

5. Sliding Window Log

Sliding window log is a sophisticated approach to API rate limiting and network traffic management. Unlike fixed windows, this approach takes into account the timing of each individual request, providing a more dynamic approach. It keeps a log of the timestamp of each incoming request. The rate limit is then determined based on the number of requests in the current sliding window (a continuously moving time frame). If the number of requests in this window exceeds a threshold, new requests are rejected or queued.

picture

Fixed window counters impose strict limits on static time windows, resulting in potential bursts at the edge of each window. Sliding window logs offer a more dynamic approach that continuously adjusts over time. This prevents bursts of traffic that are common at the reset point of fixed windows. For example, an API has a limit of 100 requests per minute. In a sliding window log, this limit is constantly evaluated over the past minute. If a request comes in, the sliding window log checks all requests in the last 60 seconds. If that number is less than 100, the request is allowed; otherwise, the request is denied.

Limitations of sliding window logging:

  • Resource-intensive constraints: Maintaining a log of all requests can be computationally intensive, especially for large numbers of requests.
  • Complex implementation: Its dynamic nature makes the implementation more complex compared to fixed window counters.
  • Potential delay: In the case of heavy traffic, continuous calculation of the sliding window will introduce delay.

6. Sliding Window Counter

The sliding window counter combines elements of the fixed window counter and sliding window log methods, aiming to manage network traffic and API requests in a more balanced way. It tracks the number of requests in a rolling time frame, which is different from a fixed time interval. It counts requests in the current window while also considering partial requests from the previous window, providing smoother transitions between time intervals.

picture

Sliding window logs maintain a log of individual request timestamps, which can be resource intensive. Sliding window counters simplify this process by counting requests in a rolling window, reducing the computational overhead compared to recording the timestamp of each request. Sliding window counters are ideal for APIs that experience steady traffic with occasional bursts, as it prevents sudden rejections when managing peak loads. They are also very useful in CDNs with variable request rates to ensure efficient content delivery without overloading the network.

Limitations of sliding window counters:

  • Slightly more complex than fixed window: While not as resource intensive as the logging approach, it is more complex than a basic fixed window counter.
  • Possibility of overestimation: In some scenarios, more requests than expected may be allowed due to overlapping windows.

7 Speed ​​Limit Characteristics and Countermeasures in Model Application

If you receive an HTTP status code 429 error in a large model application, it means that we are subject to the rate limit constraints of the large model API. For example, Azure OpenAI imposes limits on "tokens per minute" (TPM) and "requests per minute" (RPM). Understanding and managing these limits is important to maintain smooth operations and avoid interruptions.

7.1 Basic concepts related to large model speed limit

To understand the reason for the rate limit, we need to review some basic concepts in large model applications (taking Azure OpenAI as an example).

  • Token: A unit of text processed by an OpenAI model, used to measure input and output text, affecting usage and calculation. 1 token in English ≈ 4 characters. Different OpenAI models have different token input limits, such as GPT-3.5 Turbo, GPT-4, etc.
  • Quotas: These are the maximum number of requests, tokens, or compute resources that an OpenAI API user can use within a specific time frame.
  • TPM (Tokens Per Minute): Rate limiting based on a count of tokens being processed by the request as it is received. This is essential for rate limiting, but is different from the count used for billing, which is calculated after processing.
  • RPM (Requests Per Minute): RPM depends on TPM, every 1000 TPM converts to 6 RPM.
  • Billing Model: There are two types - In PAYG (Pay-As-You-Go) model, users are charged based on actual usage, which is flexible and cost-effective for different workloads; PTU (Provisioned Throughput Unit): This prepaid model allocates a certain level of capacity to users and is ideal for predictable and high-volume usage.

If a large model application requires large tokens or high completion tokens, the server will throttle even if the RPM cannot be met. If the workload requires short completion or prompts, but requires a large number of API requests, the service will throttle. TPM evaluation factors are as follows:

  • Hint Text: A known amount of tokens sent in the hint.
  • Max_Tokens: Constraint on the number of tokens, higher values ​​may result in error code 429.
  • Best_of: The number of answers you need to get from the LLM.

7.2 Expected Calculation of Speed ​​Limit

Assuming a TPM quota of 100,000 tokens per minute and an RPM quota of 600 requests per minute (based on a conversion metric of approximately 6 RPM per 1,000 TPM), then

(1) For scenarios where tokens are used extensively:

The application processes large documents and requires a large number of tokens for each request. If each request uses approximately 1,000 tokens, using the hypothetical TPM quota, a maximum of 100 requests per minute can be made (100,000 tokens / 1,000 tokens per request). If the application attempts to process all 100 requests in the first 10 seconds, the server will throttle the requests, resulting in an HTTP 429 error. This is because rate limiting is calculated over a shorter period of time (1 or 10 seconds) to ensure an even distribution.

(2) For scenarios with a large number of requests:

The application handles short prompts and requires fewer tokens per request, using about 100 tokens per request. If the TPM quota is 100,000, then up to 1,000 requests can be made per minute (100,000 tokens per request). Despite the high number of requests, the RPM quota will limit it to 600 requests per minute [6 RPM per 1,000 TPM]. If the application exceeds this limit, the server will throttle requests even if the total token usage is within the TPM quota.

7.3 Response to speed limit

The general approach to speed limiting for large model APIs is as follows:

  • Set the "max_tokens" and "best_of" parameters to smaller values. Avoid using larger max_tokens values ​​if the expected responses are small.
  • Quota Management: Increase TPM for high-traffic deployments and reduce TPM for limited needs.
  • Implement retry logic: Make sure your LLM application can handle retries.
  • Increase traffic gradually: Avoid drastic changes in workload and increase it gradually.

Moreover, we can use SDKs like LangChain to implement load balancing at the client level. This approach distributes API requests across multiple clients, reducing the possibility of hitting rate limits. In addition, use Azure API Management (APIM) to create custom policies to manage and distribute load more effectively. You can learn how to implement load balancing with APIM here.

8. Summary

In the fast-paced API world, rate limiting is like a pace setter, ensuring everything runs smoothly and efficiently. The key is to choose the right approach, whether it is the consistency of fixed windows, the precision of sliding window days, or the balance of sliding counters. Simple and effective, the right rate limiting strategy is the key to a smooth digital experience, keeping services fast and reliable, and ready for the future. Similarly, we need to improve our response to API rate limiting in large model applications.

<<: 

>>:  The most powerful remote access tool, no objection

Recommend

How do operators grasp the pulse of the cloud computing market in the 5G era?

On June 6, 2019, the Ministry of Industry and Inf...

LiCloud: $4.19/month KVM-756MB/10GB/399GB/Hong Kong Data Center

LiCloud.io is a very new hosting company, which w...

BuyVM management panel Stallion simple use tutorial

I searched the blog and it seems that no one has ...

Dewu App intercepts WiFi at 10,000 meters

0. Summary of the previous situation During a fli...

Meituan second interview: TCP's four waves, can it be reduced to three?

Hello everyone, I am Xiaolin. I have posted this ...