Throttling Solutions in Standalone and Distributed Scenarios

26 min readFeb 2, 2021

By Jianfeng Fu (Jianfeng)

Different scenarios require different throttling algorithms. How can you choose the right throttling algorithm for your scenario? This article describes the code and ideas behind several throttling algorithms, such as simple window, sliding window, leaky bucket, token bucket, and sliding log in standalone and distributed throttling scenarios. It also summarizes their complexity and applicable scenarios. Considering the length of this article, I recommend that you add it to your favorites and read it at your convenience.

1. Throttling Scenarios

Throttling is important. In most scenarios, throttling is intended to protect limited downstream resources against traffic bursts and ensure service availability. The throttling threshold is generally flexible, allowing the limit to be increased when needed.

In some scenarios, the throttling service is charged. For example, some cloud vendors charge the service based on API calls. Since money is involved, the number of calls generally cannot exceed the threshold.

Different throttling algorithms are applicable to different scenarios. Although Sentinel can be applied to most cases, other throttling solutions are still required for specific scenarios.

2. API Definition

For convenience, the following sample code is implemented based on the throttler API.

The throttler API defines a common method for applying for a single quota.

You can also define a tryAcquire (String key, int permits) signature method to apply for multiple quotas at a time. The implementation ideas are the same.

Some throttling algorithms maintain a throttler instance for each key.

public interface Throttler {
    /**
     * Apply for a quota.
     *
     * @param key: The key for applying for a quota.
     * @return true is returned if the application succeeds; or false is returned if the application fails.
     */
    boolean tryAcquire(String key);
}

3. Standalone Throttling

3.1 Simple Window

The simple window algorithm is also called the fixed window algorithm, to be distinguished from the sliding window algorithm below.

Throttling limits the visits allowed within a specified time interval. Therefore, the most straightforward idea is to maintain a counter for a given time window to count the number of visits, and then enforce the following rules:

If the number of visits is less than the threshold, a visit is allowed, and the number of visits increases by 1.
If the number of visits exceeds the threshold, visits are rejected, and the number of visits remains unchanged.
If the time window expires, the counter is reset, and the time of the first successful visit after the reset is set as the starting time of the current window. This way, the counter counts the visits in the most recent window.

Code Implementation of SimpleWindowThrottler

/**
 * The time window in ms.
 */
private final long windowInMs;
/**
 * The maximum allowed threshold in the time window.
 */
private final int threshold;
/**
 * The time of the last successful request.
 */
private long lastReqTime = System.currentTimeMillis();
/**
 * The counter.
 */
private long counter;public boolean tryAcquire(String key) {
    long now = System.currentTimeMillis();
    // If the time window starting with the last visit time has expired, the counter is reset, and the current time is used as the starting value of a new window.
    if (now - lastReqTime > windowInMs) {       #1
        counter = 0;
        lastReqTime = now;                  #2
    }
    if (counter < threshold) {                  #3
        counter++;                          #4
        return true;
    } else {
        return false;
    }
}

In another common scenario, throttling is performed based on keys. Each key is configured with a time window and a threshold. Therefore, you need to maintain a separate throttler instance for each key.

Switching to a Multithreading Environment

In practice, multiple threads are usually used to apply for a quota. To briefly express the idea behind the algorithm, the sample code does not incorporate concurrent synchronization.

For example, to convert the simple window algorithm into a throttling algorithm with multithreading security, you can set the tryAcquire method to synchronized.

Alternatively, you can modify the types of read and write variables:

private volatile long lastReqTime = System.currentTimeMillis();
private LongAdder counter = new LongAdder();

However, this method is not really “secure.” Imagine the following scenario: Thread A and thread B successively attempt to obtain quotas. After the judgment condition at position #1 is satisfied, the process goes to position #2 to modify the lastReqTime value, and the value assigned to thread B overwrites thread A, resulting in a backward offset of the time window start time. Similarly, contention occurs at positions #3 and #4. Such contention is acceptable if you do not require high throttling precision.

Critical Mutation

Throttling with the simple window algorithm is simple. Assuming that 100 visits are allowed per minute, but 200 visits uniformly occur per minute, the visits curve of the system is roughly shown below (reset by the minute):

However, if the visit traffic is nonuniform, assuming that several visits occur at the beginning of the time window (0:00), and then 10 requests are received per second since 0:50, the following visits curve appears:

In the critical 20 seconds (0:50 to 1:10), the system is visited 200 times. In other words, the system experienced twice the allowed traffic near the critical point of the time window in the worst case. The simple window algorithm cannot address this kind of critical mutation.

3.2 Sliding Window

How can we address the critical mutation in the simple window algorithm? Since the statistical precision in a large time window is low, we can split it into smaller sub-windows. Then, we can collect statistics on each sub-window. In addition, each time the duration of a sub-window is elapsed, the time window slides rightward by one sub-window. This is the idea of the sliding window algorithm.

As shown in the figure above, the 1-minute time window is split into six sub-windows, and a counter is maintained for each sub-window to count the visits within 10 seconds. The time window slides rightward by one sub-window every 10 seconds.

Refer to the example of critical mutation in the simple window algorithm again. Let’s look at how the sliding window algorithm eliminates critical mutation in the figure above. If 100 requests are received from 0:50 to 1:00 (corresponding to the gray grid), the next 100 requests received at 1:00 to 1:10 fall into the yellow grid. The algorithm counts the total visits in the six sub-windows. If the sum exceeds the threshold (100), the algorithm rejects the latter 100 requests.

Code Implementation (Sentinel)

Sentinel provides a lightweight high-performance sliding window throttling algorithm. You can focus on the following classes when you read the code:

(1) StatisticSlot, a functional slot, records and collects statistics on the monitored runtime metrics in different dimensions, such as RT and QPS.

Sentinel adopts the slot chain mode. Functional slots provide different functions (throttling, degradation, and system protection) and are connected in series through the ProcessorSlotChain.

You can check this website for more information.

(2) StatisticSlot records the numbers of requests allowed per second and per minute using StatisticNode#addPassRequest.

(3) The Metric API used for recording corresponds to the implementation class ArrayMetric. The sliding window data structure is LeapArray.

(4) LeapArray maintains the key attributes and structures used by the sliding window algorithm, including:

(a) intervalInMs (the total window size), windowLengthInMs (the sliding sub-window size), and sampleCount (the number of samples):

sampleCount = intervalInMs / windowLengthInMs

In the current implementation, sampleCount is 2 and the total window size is 1s by default. Therefore, the default sliding sub-window size is 500 ms. You can adjust the statistical precision by modifying sampleCount.

(b) Sliding window array: Elements in the array are represented by WindowWrap, including:

windowStart: The start time of the sliding window
windowLength: The duration of the sliding window
value: The content recorded by the sliding window. It is a generic type. The key type is MetricBucket, which contains a set of LongAdder for recording different types of data, such as the number of allowed requests, the number of rejected requests, and the number of abnormal requests.

In other words, recording a request is to obtain the sliding window to which the request belongs based on the current time and increase the statistical value of the window by 1. However, the method of obtaining the time window to which the request belongs includes many steps. For details about the implementation, see the source code annotations in LeapArray#currentWindow.

The preceding process is shown in the figure below:

The preceding process is based on the source code of version 3.9.21. Earlier versions of Sentinel adopt a data structure called SentinelRollingNumber, but the implementation principle is similar.

Precision

Here comes another question. Can the sliding window algorithm accurately control the number of visits within a given time window T to be less than or equal to N?

The answer is no. Take the example of splitting 1 minute into six 10-second sub-windows. Assuming that 20 requests are received per second, and the first request is received at 0:05, 100 requests are allowed in the period from 0:05 to 0:10, and subsequent requests are rejected until the window slides rightward at 1:00. Then, another 100 requests are allowed between 1:00 and 1:05. If 0:05 to 1:05 are regarded as a 1-minute time window, the true number of requests in the window is 200, exceeding the given threshold of 100.

Theoretically, higher precision requires finer-grained splitting of the sliding window. For example, Sentinel enables you to set precision by modifying the sampleCount per unit time. The value of sampleCount is generally determined according to business needs, to achieve a balance between precision and memory consumption.

Smoothness

When using the sliding window algorithm for throttling, we usually see the following traffic curve:

The traffic burst reaches the throttling threshold soon after the window starts. As a result, all subsequent requests in the remaining duration of the window are rejected. This exerts great impact when the unit of the time window is large. For example, throttling is performed based on a time window that is one minute or longer. In practice, we usually expect smooth throttling instead of cutting off the traffic suddenly.

3.3 Leaky Bucket

The sliding window algorithm cannot ensure throttling smoothness. Again, we aim to control the traffic rate within a range allowed by the system when the traffic exceeds the threshold instead of cutting off the traffic at once. Assuming that the average traffic rate is v, throttling is used to control the average traffic rate, to ensure that v ≤ n/T.

The leaky bucket algorithm is usually used for traffic shaping in network communication. The idea of the leaky bucket algorithm is to control the traffic rate. Think about the problem of pumping water from a pool while injecting water. Replace the pool with a bucket (one with a hole at the bottom that starts to leak as soon as the water is injected), and regard requests as water injected into the bucket. The water leaked from the bottom of the bucket represents the requests leaving the buffer and to be processed by the server. The water overflowing out of the bucket represents the requests to be discarded. Conceptual analogies are listed below:

Maximum Number of Requests Allowed (N): The bucket size
Time Window Size (T): The time required for the entire bucket to be drained
Maximum Traffic Rate (V): The rate at which the entire bucket of water leaks, which is equal to N/T
Request Throttling: Indicates that the water injection rate is greater than the water leakage rate, eventually causing an overflow of water in the bucket

Assume that the bucket is empty at the start time and each visit is injecting water of one unit volume into the bucket. In this case, when we inject water into the bucket at a rate less than or equal to N/T, the water in the bucket never overflows. Once the water injection rate exceeds the water leakage rate, an increasing amount of water accumulates in the bucket until it overflows. In addition, the water leakage rate is always controlled within N/T, thus smoothing traffic.

The traffic rate curve of the leaky bucket algorithm is shown below:

The following figure from the Internet shows the working principle of the leaky bucket algorithm:

The Code Implementation of `LeakyBucketThrottler`

/**
 * The water that is in the bucket.
 */
private long left;
/**
 * The timestamp of the last successful water injection.
 */
private long lastInjectTime = System.currentTimeMillis();
/**
 * The bucket capacity.
 */
private long capacity;
/**
 * The time required for the bucket to be drained.
 */
private long duration;
/**
 * The water leakage rate of the bucket, which is equal to capacity/duration.
 */
private double velocity;public boolean tryAcquire(String key) {
    long now = System.currentTimeMillis();
    // Water in the bucket = Previously left water – Water leaked during the past period of time.
    // Water leaked during the last period of time = (Current time – Last water injection time) × Water leakage rate
    // If the current time is too far from the last water injection time (no water has been injected for a long time), the water left in the bucket is 0 (the bucket is drained).
    left = Math.max(0, left - (long)((now - lastInjectTime) * velocity));
    // If no water overflows after one unit volume of water is injected, access is allowed.
    if (left + 1 <= capacity) {
        lastInjectTime = now;
        left++;
        return true;
    } else {
        return false;
    }
}

The Issue with the Leaky Bucket Algorithm

The leaky bucket algorithm can smooth traffic. However, if traffic is nonuniform, it cannot achieve precise control, just like the sliding window algorithm. In extreme cases, the leaky bucket algorithm also accepts traffic approximately twice the threshold N within the time window T.

Assume that the number of visits far exceeds the window size N and the visits flood in at the starting time 0 of the window (0 to T), so that the leaky bucket overflows at time t (0 ≈ t < T). Then, visits are accepted in the remaining duration of T — t at a rate of N/T. The number of visits in the entire window is equal to N + (T — t) × N/T and is close to 2N provided that t is small enough.

Although you can control the number of visits within N by limiting the bucket size, traffic is restricted before the restriction condition is met.

An implied constraint requires that the water leakage rate be an integer (that is, the time window size T can be exactly divided by the capacity N.) Otherwise, the remaining water volume calculated has an error.

3.4 Token Bucket

In the leaky bucket model, accepting requests is regarded as injecting water into the bucket. If we regard accepting requests as pumping water out of the bucket and injected water as system-allowed traffic, the leaky bucket algorithm becomes a token bucket algorithm.

Regarding the leaky bucket algorithm, you can easily understand the token bucket algorithm. Let’s look at the general principle of the token bucket algorithm:

According to the token bucket algorithm, the system generates tokens at a constant rate and puts them into a token bucket with a specific capacity. If the system puts tokens into the token bucket when it is full, excess tokens are discarded. Before handling a request, the system needs to obtain a token from the token bucket. If no token is available, the system rejects the request.

The Code Implementation of `TokenBucketThrottler`

The leaky bucket algorithm is essentially the same as the token bucket algorithm, and can be transformed into the token bucket algorithm with slight variations in the code.

long now = System.currentTimeMillis();
left = Math.min(capacity, left + (long)((now - lastInjectTime) * velocity));
if (left - 1 > 0) {
    lastInjectTime = now;
    left--;
    return true;
} else {
    return false;
}

To apply the token bucket algorithm to the production environment, you can use the RateLimiter provided in Guava, which ensures multithreading security. If you call RateLimiter#acquire when the remaining tokens are insufficient, the system blocks the thread for a time until sufficient tokens are available (instead of directly rejecting the thread, which is useful in some scenarios.) In addition to the default SmoothBursty policy, RateLimiter provides the SmoothWarmingUp policy, which enables you to specify a warm-up period. During the warm-up period, RateLimiter smoothly increases the token release rate to the maximum value. This design adapts to cases in which the resource provider requires a warm-up period and cannot provide services (for example, services with caches that need to be refreshed regularly) for each visit at a stable rate. However, RateLimiter acts based only on QPS.

Differences Between the Leaky Bucket Algorithm and the Token Bucket Algorithm

Although the leaky bucket algorithm and the token bucket algorithm are similar, they are applicable to different scenarios in practice:

(1) Leaky Bucket Algorithm: Controls the rate over the network. In this algorithm, the input rate can vary, but the output rate remains constant. It is usually used with a FIFO queue.

Imagine the hole of the leaky bucket is fixed in size, so the water leakage rate can be kept constant.

(2) Token Bucket Algorithm: Adds tokens to a bucket at a fixed rate and allows the output rate to vary depending on the traffic burst size.

For example, a system allows up to 60 visits in 60 seconds, which is converted into a rate of 1 visit/second. If the system is not visited for a long time, the leaky bucket will be empty. Then, 60 requests flood in. After traffic shaping, the leaky bucket leaks 60 requests downstream in 1 minute at a rate of 1 request/second. If the token bucket algorithm is used in this case, 60 tokens are fetched from the token bucket and passed to the downstream at a time.

3.5 Sliding Log

In general, the preceding algorithms can adapt to most application scenarios. However, there are a few scenarios that require precise control (that is, require the number of requests within any given time window T be less than or equal to N.) For precise control, you need to log each user request. Upon each throttling judgment, you can retrieve the number of logs in the latest time window to check if it is greater than the throttling threshold. This is the idea of the sliding log algorithm.

Assume that the system receives a request at time t. To determine whether the system is to allow the request, you need to check if at least N requests were allowed in the past time period of t — N. Therefore, theoretically, the number of requests starting from time t — N can be calculated, provided that the system maintains a queue q recording the time of each request.

Only the records in a maximum time period of T before the current time are considered. Therefore, the length of the queue q can change dynamically, and the maximum length is N because the queue q records a maximum of N visits.

The sliding log algorithm is similar to the sliding window algorithm. The difference is how the sliding log algorithm enables dynamic sliding based on the log entry time, while the sliding window algorithm enables sliding with sub-windows based on the sub-window size.

Pseudocode Implementation

The pseudo code for the sliding log algorithm is shown below:

# Initialization
counter = 0
q = []# Request handling process
# 1. Find the first request whose timestamp is greater than or equal to t – T in the queue, that is, the earliest request in the time window T ending at the current time t.
t = now
start = findWindowStart(q, t)# 2. Truncate the queue and keep only the records and counts in the last time window T.
q = q[start, q.length - 1] 
counter -= start# 3. Determine whether to allow the request, and if yes, add the request to the end of the queue q.
if counter < threshold
    push(q, t)
    counter++
    # Allow the request.
else
    # Enable throttling.

The implementation of findWindowStart depends on the data structure of the queue q. For example, you can use the binary search method for a simple array. We will also discuss the implementation with other data structures later.

If you implement findWindowStart with an array, you need a solution for how to truncate a queue. A feasible idea is to use a set of head and tail pointers to point to the most recent and earliest valid record indexes in the array, respectively. In this case, the implementation of the findWindowStart is to find the corresponding elements between the tail and head pointers.

Complexity

Although the sliding log algorithm achieves precise control, it results in apparent costs.

First, you need to save a queue with a maximum length of N, which increases the space complexity to O(N). The space usage is even higher in key-specific throttling. Certainly, you could reuse the queue of inactive keys to reduce memory consumption.

Second, you need to determine the time window in the queue, that is, find request records not earlier than the current timestamp t — N using findWindowStart. For example, the time complexity of a binary search is O(logN).

4. Distributed Throttling

In practice, application services are usually deployed in a distributed manner. Distributed throttling is applicable if shared resources (such as databases) or dependent downstream services have traffic limits.

You can allocate throttling quotas evenly to application servers to enable standalone throttling. However, this method is ineffective in the cases of nonuniform traffic, machine downtime, and temporary scaling.

The core idea behind the algorithm for distributed throttling is similar to standalone throttling. The difference is how distributed throttling requires a synchronization mechanism to ensure global quotas. The synchronization mechanism can work in a centralized or decentralized mode:

(1) Centralized: A central system manages all quotas, and application processes apply for throttling quotas from the central system.

The central system maintains state consistency, which is easy to implement.
However, throttling will fail when the central system node is unavailable. Therefore, additional protection is required. For example, centralized throttling is usually degraded to standalone throttling when central storage is unavailable.

(2) Decentralized: Application processes store and maintain the states of their respective throttling quotas. Periodic asynchronous communication within a cluster maintains state consistency.

Compared with the centralized solution, the decentralized solution can reduce the impact of a single point of failure. However, the implementation is complex, and state consistency cannot be guaranteed.
Inconsistency, availability, and partition tolerance (CAP), the decentralized solution puts more focus on A, and the centralized solution puts more focus on C.

The decentralized solution has not been applied in the production environment. Therefore, we will discuss only the idea of centralized throttling.

4.1 Entry Throttling at the Access Layer

In the network architecture for application access, an LVS or NGINX layer is usually deployed before application servers as a unified entry. You can apply entry throttling at this layer. This is essentially a standalone throttling scenario.

Take NGINX as an example. NGINX provides the ngx_http_limit_req_module for throttling and uses the leaky bucket algorithm at the underlying layer.

In the following NGINX throttling configuration, each IP address can request the /login/ API 10 times per second.

limit_req_zone $binary_remote_addr zone=mylimit:10m rate=10r/s;server {
    location /login/ {
        limit_req zone=mylimit;        proxy_pass http://my_upstream;
    }
}

You can also configure the NGINX throttling commands in other ways. For example, when configuring the limit_req command, you can add the burst and nodelay parameters to allow bursts within a specific range or configure the geo and map commands to implement throttling based on blacklists and whitelists. Please check the NGINX official documentation of NGINX for more information: Rate Limiting with NGINX and NGINX Plus

If the built-in modules of NGINX cannot meet your requirements, you can use the custom lua module. For more information, see the lua-resty-limit-traffic module provided by OpenResty.

4.2 TokenServer-Based Throttling

Here, we introduce the concept of TokenServer in Sentinel. Visit this website for more information about the cluster flow control of Sentinel.

The idea of this throttling solution is to find a TokenServer to specially manage throttling quotas, including counting the total number of calls and determining whether a single request is allowed. An application server communicates with the TokenServer as a client to obtain a quota. Since the TokenServer handles the throttling logic in a unified manner, the throttling algorithms used for standalone throttling are also applicable here.

This throttling solution relies heavily on the performance and availability of the TokenServer.

Performance: The standalone TokenServer can easily result in a bottleneck. As shown in the source code of Sentinel, Netty is used for network communication, and data packets are in a custom format. Other than this, little information about performance optimization is found.

Availability: As described in the official documentation of Sentinel, to apply TokenServer-based cluster throttling to the production environment, you need to enable the following capabilities of the TokenServer:

Automatic Management and Scheduling: Allocation/election of the TokenServer
High Availability: Automatic failover to another server when a server is unavailable.

Currently, the Sentinel TokenServer does not provide these capabilities by default. You need to implement these capabilities through customization or by adding other systems. For example, you can use a distributed consistency protocol for cluster leader election or deploy a group of monitors to monitor the cluster status, which is quite costly.

4.3 Storage-Based Throttling

The idea of storage-based throttling is to store statistical information, such as the throttling counts, in a storage system. Applications obtain statistical information from the storage system and write the latest request information into the storage system. The storage system can be an existing MySQL database or Redis cache. Considering performance, caches are more commonly used. In this article, Tair and Redis are used as examples.

Tair-Based Throttling

This simple throttling solution can be directly implemented with code:

public boolean tryAcquire(String key) {
  // Create a Tair key in second.
  String wrappedKey = wrapKey(key);
  // Each time a request is received, the value increases by 1. The initial value is 0. The validity period of the key is set to 5s.
  Result<Integer> result = tairManager.incr(NAMESPACE, wrappedKey, 1, 0, 5);
  return result.isSuccess() && result.getValue() <= threshold;
}private String wrapKey(String key) {
  long sec = System.currentTimeMillis() / 1000L;
  return key + ":" + sec;
}

Isn’t it simple? This throttling solution can adapt to large traffic because of the high performance of Tair.

This Tair-based throttling solution is based on the same idea as the simple window algorithm. It enables QPS control for each key using each second as a time window. QPM and QPD control follows a similar principle. The essence is to use the following API of Tair:

incrResult incr(int namespace, Serializable key, int value, int defaultValue, int expireTime)DescriptionThe count increases. Note: Do not call the put API before you call the incr API.Parametersnamespace: The namespace allocated during the application.
key: The key list, which is up to 1 KB.
value: The increment.
defaultValue: The initial count of the key when the incr API is called for the first time. The first return value is equal to defaultValue plus value.
expireTime: The data expiration time in second. It can be a relative time or an absolute time (a Unix timestamp). expireTime = 0: Indicates that the data never expires. expireTime > 0: Indicates that the expiration time is specified. If expireTime is greater than the current timestamp, an absolute time is used; otherwise, a relative time is used. expireTime < 0: Indicates that the expiration time is of no concern. If an expiration time has been specified before, the specified expiration time is used. If no expiration time is specified, the data never expires. Currently, MDB data never expires.Return valueA result object. The return value can be negative. If no key exists, the first return value is equal to defaultValue plus value. The incr API subsequently increases value based on this return value.

You also need to note the following issues when using this method:

The simple window algorithm is subject to critical mutation.
A degradation solution is required to ensure the reliability of Tair. As mentioned earlier, you usually need to enable degradation from centralized throttling to standalone throttling.
Time synchronization is required among machines in a cluster. Local times of the machines are used to generate keys, so they must be the same.

Even if the local times of two machines vary slightly, for example, by 10 ms, the statistics at the division point between time windows will have a large error. For example, if the local time of one machine is 0.990 and that of another is 1.000, the two machines use different keys when calling the incr API, thus affecting the precision.

Redis-Based Throttling

Redis supports diverse data structures and acts with high performance. Its “single-process” model is convenient for synchronization control, so it is a very suitable storage system for distributed throttling.

(1) The Implementation of the Simple Window Algorithm

The idea of using Redis for throttling based on the simple window algorithm is the same as using Tair. Redis also provides the INCR command for counting. Its “single-process” model provides superb concurrency protection. The official documentation of Redis describes how to use the INCR command to implement RateLimiter. I will give a brief explanation below.

Redis INCR Key

The simplest implementation of the simple window algorithm is shown below:

FUNCTION LIMIT_API_CALL(ip)
ts = CURRENT_UNIX_TIME()
keyname = ip+":"+ts
current = GET(keyname)
IF current ! = NULL AND current > 10 THEN
    ERROR "too many requests per second"
ELSE
    MULTI
        INCR(keyname,1)
        EXPIRE(keyname,10)
    EXEC
    PERFORM_API_CALL()
END

Similar to the implementation of Tair, this implementation maintains a counter for each key in second. The difference is how Redis does not provide the atomic INCR + EXPIRE command. Therefore, after running the INCR command, you need to call the EXPIRE command to set the validity period of the key. In addition, you need to add the MULTI and EXEC commands to the outer layer to ensure transactions.

To avoid calling the EXPIRE command every time, you can adopt the following second implementation:

FUNCTION LIMIT_API_CALL(ip):
current = GET(ip)
IF current ! = NULL AND current > 10 THEN
    ERROR "too many requests per second"
ELSE
    value = INCR(ip)
    IF value == 1 THEN
        EXPIRE(ip,1)
    END
    PERFORM_API_CALL()
END

The validity period of the counter is set to 1s during the first call of the INCR command. Therefore, you do not need to set the key again.

However, this implementation contains an implicit contention condition: The counter is always present if a client does not call the EXPIRE command after the first call of the INCR command due to an application crash or any other reason.

You can use a lua script to resolve this problem:

local current
current = redis.call("incr",KEYS[1])
if tonumber(current) == 1 then
    redis.call("expire",KEYS[1],1)
end

The third implementation relies on the list structure of Redis. It is more complex but can record each request.

FUNCTION LIMIT_API_CALL(ip)
current = LLEN(ip)
IF current > 10 THEN
    ERROR "too many requests per second"
ELSE
    IF EXISTS(ip) == FALSE              #1
        MULTI
            RPUSH(ip,ip)
            EXPIRE(ip,1)
        EXEC
    ELSE
        RPUSHX(ip,ip)
    END
    PERFORM_API_CALL()
END

This implementation also contains an implicit contention condition: When two clients run the EXIST command for judgment (at position #1), false may be returned to both clients, and the commands in the MULTI/EXEC block are run twice. However, this case rarely occurs, with little impact on the accuracy of the counter.

You can further optimize the preceding implementations. A count is returned after you run the INCR or RPUSH command. Therefore, you can obtain the count using the set-then-get method.

The idea of transforming the simple window algorithm into the sliding window algorithm is similar. You can replace a single key with a hash structure, in which a count is saved for each sub-window. During statistics collection, the counts of all sub-windows in the same hash structure are added up.

(2) The Implementation of the Token Bucket or Leaky Bucket Algorithm

Redis can easily implement the token bucket or leaky bucket algorithm. For example, to implement the token bucket algorithm, you can use two keys to store the number of available tokens and the last request time of each user. Preferably, you can use the hash data structure of Redis.

In the following example, Redis stores the current quota data of user_1: Two tokens are in the token bucket, and the timestamp of the last visit is 1490868000.

When receiving a new request, the Redis client performs operations the same as those in the standalone throttling algorithms. First, the client obtains the current quota data (HGETALL) from the corresponding hash and calculates the number of tokens to be refilled based on the current timestamp, the timestamp of the last request, and the token refilling speed. Then, the client determines whether to allow the request and updates the timestamp and the number of tokens (HMSET).

There is an example shown below:

Similarly, to achieve high precision, you need to control the concurrent operations of clients.

Problems may occur without synchronization control. For example, if two clients request for the only token in the bucket concurrently, both requests are allowed.

The following is an example of lua code:

local tokens_key = KEYS[1]
local timestamp_key = KEYS[2]local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])local fill_time = capacity/rate
local ttl = math.floor(fill_time*2)local last_tokens = tonumber(redis.call("get", tokens_key))
if last_tokens == nil then
  last_tokens = capacity
endlocal last_refreshed = tonumber(redis.call("get", timestamp_key))
if last_refreshed == nil then
  last_refreshed = 0
endlocal delta = math.max(0, now-last_refreshed)
local filled_tokens = math.min(capacity, last_tokens+(delta*rate))
local allowed = filled_tokens >= requested
local new_tokens = filled_tokens
if allowed then
  new_tokens = filled_tokens - requested
endredis.call("setex", tokens_key, ttl, new_tokens)
redis.call("setex", timestamp_key, ttl, now)return { allowed, new_tokens }

(3) The Implementation of the Sliding Log Algorithm

The Sorted Set structure of Redis helps implement the sliding log algorithm easily. The rough process is shown below:

(a) Configure a Sorted Set for each user to log requests.

The key and value of each element can be the same, that is, the request timestamp.
You can set the validity period of the Sorted Set based on the time window size. For example, you can set the validity period to 5s when the time window is 1s to save the memory space of the Redis server when the number of requests is small.

(b) When a new user request is received, run the ZREMRANGEBYSCORE command to delete expired elements in the Sorted Set. You can identify expired elements based on the following criterion:

Request timestamp (t) < Current timestamp (now) – Time window size (interval)

(d) Run the ZCOUNT command to obtain the remaining space of the Sorted Set and determine whether throttling is required.

long now = System.currentTimeMillis();
long maxScoreMs = now - windowInSecond * 1000;Transaction redis = jedisPool.getResource().multi();
redis.zremrangeByScore(key, 0, maxScoreMs);
redis.zadd(key, now, now + "-" + Math.random()); // Add a random value to make the member unique.
redis.expire(key, windowInSecond);
redis.exec();

The following is an example of JavaScript code:

https://github.com/peterkhayes/rolling-rate-limiter/blob/master/index.js

The sliding log algorithm has higher space complexity than other algorithms. Therefore, note the memory usage of Redis when you use the sliding log algorithm.

(4) Concurrency Control

All the preceding algorithms contain implicit contention conditions without concurrency control. Additional concurrency control inevitably leads to performance degradation. Therefore, we usually need to choose between precision and performance. You can apply concurrency control to Redis-based throttling in the following ways:

Use the MULTI and EXEC transaction commands of Redis.
Use distributed locks, such as RedLock. Each client must obtain a distributed lock of the corresponding key before performing any operations.
Use a lua script.

We recommend choosing one of the ways listed above for performance testing.

4.4 Extended Thoughts

Distributed throttling results in network communication, lock, and synchronization overheads, affecting the performance. The reliability of distributed environments also imposes more challenges. How can we design a distributed throttling system with high performance and reliability? This question involves all aspects of the entire system.

I’d like to share some personal thoughts, and any discussion is welcome:

(1) Properly configure hierarchical throttling based on actual demands, attempting to complete throttling at the outer layer. For example, NGINX-based throttling at the API layer is usually combined with application-layer throttling.

(2) Select an appropriate cache system to store dynamic throttling data. This depends on the company’s unified technical architecture.

(3) Complete static throttling configuration on the configuration center (for example, Diamond.)

(4) During design, consider the scenarios where distributed throttling is unavailable (for example, when the cache crashes.) If necessary, switch to standalone throttling and use Sentinel, which is mature and reliable.

(5) Most scenarios do not require high precision because bursts within a specific range are usually allowed. In this case, we can optimize performance. The biggest bottleneck of performance is how the cache is accessed upon each request. I adopted a compromising method in an earlier design:

Pre-allocate a specific proportion (for example, 50%) of available quotas to machines in a cluster. Usually, you can evenly allocate the quotas, or you can allocate them based on the weight (if known) of each machine. The machines consume quotas at different rates, and some machines may fail or may be scaled during the process. Therefore, the pre-allocation proportion should not be excessively large or small.
When running out of quotas, each machine requests quotas from the central system. An improved solution here is that each machine records its quota consumption rate (equivalent to the traffic rate experienced) and applies for different quotas based on the rate. A machine that consumes quotas faster can apply for more quotas at a time.
When the total number of available quotas is less than a specific proportion (for example, 10%), limit the number of quotas that each machine can apply for at a time, calculate the issued quotas based on the remaining window size, and control the quotas issued each time within a specific proportion (for example, 50%) of the remaining quotas, to ensure a smooth transition of the remaining traffic.

5. Summary

Distributed throttling is an extension of standalone throttling. Therefore, throttling algorithms are essentially the same. I summarized the complexity and applicable scenarios in the table below.