Detailed Explanation of Guava RateLimiter’s Throttling Mechanism

Throttling is one of the three effective methods for protecting a high concurrency system. The other two are respectively caching and downgrading. Throttling is used in many scenarios to limit the concurrency and the number of requests. For example, in the event of flash sales, throttling protects your own system and the downstream system from being overwhelmed by tremendous amounts of traffic.

The purpose of throttling is to protect the system by restricting concurrent access or requests, or restricting requests of a specified time window. After the threshold is exceeded, denial of service or traffic shaping is triggered.

Common throttling methods are: 1. Limit the total concurrency. For example, you can limit the size of the database connection pool and thread pool. 2. Limit the instantaneous concurrency. For example, the limit_conn_module of NGINX is used to limit the instantaneously concurrent connections. The Semaphore class of Java can implement the same. 3. Limit the average access rate of a time window. For example, the RateLimiter of Guava and the limit_req module of NGINX can both be used to limit the average access rate per second. 4. Limit the remote API invocation rate. 5. Limit the MQ consumption rate. In addition, we can also implement throttling according to the number of network connections, network traffic, and CPU or memory load.

For example, if we need to limit the concurrency that a method is called simultaneously to less than 100, we can implement it through Semaphore. If we want to limit the average number of times that a method is called during a period to less than 100, we need to use RateLimiter.

Throttling Algorithms

There are two commonly used algorithms for throttling: the leaky bucket algorithm and token bucket algorithm.

You can see from the preceding figure that the water enters the leaky bucket first like the access traffic. Then the water drips out of the bucket like how our system processes the requests. When the water (access traffic) inflows too fast, the bucket gets filled up and then overflows.

The implementation of the leaky bucket algorithm usually relies on the queue. If your system receives a new access request and the queue is not full, it puts the request into the queue. A processor pulls requests from the queue and processes it at a fixed frequency. If the volume of the inbound access requests is too large and the queue becomes full, new requests will be discarded.

The token bucket algorithm works like a bucket that stores a fixed number of tokens, where tokens are added to it at a fixed rate. After the maximum number of tokens stored in the bucket is exceeded, new tokens are discarded or denied. When traffic or network requests arrive, each request must obtain a token. Requests with tokens are processed directly, and one token is removed from the bucket for each request processed. Requests that fail to obtain tokens will be limited: to be directly discarded or wait in the buffer area.

Comparison between the token bucket and the leaky bucket:

  • Tokens are added to the token bucket at a fixed rate. Whether a request can be processed depends on whether sufficient tokens are available in the bucket. When the number of available tokens is reduced to zero, all new requests are denied. The leaky bucket processes requests at a fixed rate. The rate of inbound requests is not limited, but when the accumulated amount of inbound requests exceeds the maximum capacity of the bucket, new inbound requests are denied.
  • The token bucket limits the average inflow rate and allows sudden increase in traffic. The request can be processed as long as it has a token. Three or four tokens can be given at one time. The leaky bucket limits the constant outflow rate, which is set to a fixed value. For example, if the outflow rate is set to one request per second, it cannot process two requests per second. This ensures the outflow rate is always stable, regardless of the inflow rate.
  • The token bucket allows for sudden increase in traffic to some extent, while the leaky bucket is mainly used to ensure the smooth outflow rate.

Guava RateLimiter

Guava is an excellent open-source Java project. It contains several core libraries of Google that are used in their Java-based projects: collections, caching, concurrency libraries, common annotations, string processing, I/O, and so forth.

Guava RateLimiter provides the token bucket algorithm implementations: SmoothBursty and SmoothWarmingUp.

The class diagram of RateLimiter is provided in the preceding figure, where the RateLimiter class provides two factory methods to create two subclasses. This is in line with the idea to replace constructors by using statistic factory methods. This idea was proposed by the book Effective Java, the author of which is one of the main maintainers of Guava libraries.


Use a static method of RateLimiter to create a limiter, and set the number of tokens to be placed into the bucket to five. The returned RateLimiter object ensures that no more than five tokens will be placed into the bucket every second at a fixed rate, to smooth the outflow traffic.

RateLimiter uses the token bucket algorithm and accumulates the tokens. If the token consumption frequency is low, the requests can directly get the tokens without waiting.

RateLimiter accumulates tokens to allow it to cope with sudden increase in traffic. In the following snippet, a thread will directly acquire five tokens. The request will be quickly responded, because the token bucket has the sufficient number of accumulated tokens.

When RateLimiter does not have sufficient number of tokens available, it uses the delayed processing method. The period of time that the current thread ought wait to acquire a token is assumed by the next thread. In other words, the next thread waits additional time on behalf of the current thread.


The RateLimiterSmoothWarmingUp method has a warm-up period after teh startup. It gradually increases the distribution rate to the configured value.

For example, you can use the following snippet to create a ratelimiter with an average token creation rate of 2 tokens/s, and a warm-up period of 3 seconds. The token bucket does not create a token every 0.5 seconds immediately after startup, because a warm-up period of 3 seconds has been set. Instead, the system gradually increases the token creation rate to the pre-set value within 3 seconds, and then creates tokens at a fixed rate. This feature is suitable for scenarios where the system needs some time to warm up after startup.

Source Code Analysis

After getting familiar with the basic usage of the RateLimiter, let us learn about how it is implemented. First, take a look at several important variables and their definitions.


Every time the acquire() method is called, RateLimiter compares the current time with the nextFreeTicketMicros, and refreshes the number of stored tokens storedPermits based on the difference between them and the interval for adding a unit number of tokens stableIntervalMicros. Then the acquire()method sleeps until nextFreeTicketMicros.

The snippet for the acquire() method is provided as follows. The acquire() method calls the reserve function to calculate the period of time that is required for acquiring the target number of tokens. Then it calls the SleepStopwatch method to sleep and return the period off time to be waited.

reserveEarliestAvailable is the key function for refreshing the number of tokens and the time for acquiring the next token nextFreeTicketMicros. Three steps are involved in this process. Step 1: Call the resync function to add tokens. Step 2: Calculate the additional time that the thread must wait after it was preconsumed for the specified number of tokens. Step 3: Update the time that the next token can be acquired nextFreeTicketMicrosand the number of stored tokens storedPermits.

The preconsumption feature of RateLimiter is involved in this case. RateLimiter allows a thread to preconsume tokens without waiting. To acquire tokens, the next thread must wait for an additional period for the preconsumed tokens. For detailed explanation of the coding logic, see the following comments:

The resync function is used to increase the number of stored tokens. Its core logic is (nowMicros - nextFreeTicketMicros) / stableIntervalMicros. Refresh when the current time is greater than nextFreeTicketMicros, or directly return.

Here is an example that may help you better understand the logic of the resync and reserveEarliestAvailablefunctions.

For example, stableIntervalMicros of RateLimiter is 500, which means it generates two tokens per second. The storedPermits is 0, and the nextFreeTicketMicros is 1553918495748. Thread 1 calls acquire(2). The current time is 1553918496248. First, call the resync function to calculate the number of tokens that can be acquired, which is (1553918496248 - 1553918495748)/500 = 1. Ratelimiter supports preconsumption, so nextFreeTicketMicros= nextFreeTicketMicro + 1 × 500 = 1553918496748. Thread 1 does not have to wait.

Then Thread 2 calls acquire(2). First, the resync function finds that the current time is earlier than nextFreeTicketMicros, so it cannot add new tokens. Therefore, two tokens need to be preconsumed, and nextFreeTicketMicros= nextFreeTicketMicro + 2 × 500 = 1553918497748. Thread 2 needs to wait until 1553918496748, which is equal to nextFreeTicketMicros calculated for Thread 1. Thread 3 also needs to wait until nextFreeTicketMicros calculated for Thread 2.


Next, let us take a look at how the SmoothWarmingUp rate limiter is implemented.

The main difference is the SmoothWarmingUp implementation has a warm-up period, during which the token generation rate and the number of tokens increases with time. This process is illustrated as follows. The token (permit) refresh interval reduces with time. When the number of stored tokens reduces from maxPermits to thresholdPermits, the time cost for creating tokens also reduces from the coldInterval to the normal stableInterval.

The code snippet of the SmoothWarmingUp rate limiter is provided as follows, and the logic of the code is explained in comments.


RateLimiter can only be used in throttling of standalone applications. If you want to implement throttling in a cluster, you need to introduce Redis or an Alibaba open-source middleware sentinel. Stay tuned.

Personal blog: remcarpediem

Official WeChat account: remcarpediem



Follow me to keep abreast with the latest technology news, industry insights, and developer trends.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store