What is rate limiting?

Organizations use rate limiting to ensure the fair usage of their API-based services and other web services and resources by clients. It regulates the number of times a user can request a particular API-based service in a given time-frame.

When thousands of users share a service, rate limiting helps the system to be available for all its clients. Rate limiting also protects a system from denial of service attacks (DoS). In a DoS attack, a malicious user might initiate a huge number of service requests in a short period. This leads to the exhaustion of the resources. This means the system becomes unavailable to legitimate users. In some cases, a web service, API, or other resource might receive numerous requests due to an error in the client or user side. Rate limiting is a practical approach that organizations use to avoid such scenarios.

Limits on social media messaging is a typical instance of rate limiting that most internet users encounter. Social media websites like Facebook, LinkedIn, or Instagram often put a cap on the number of direct messages users can send in a day. For example, if a user decides to forward a message to thousands of contacts, the social media service’s rate-limiting logic kicks in. It may block the user from sending any more messages for a certain period.

Why do organizations need rate limiting?

There are several major types of rate limiting models that a business can choose between depending on which one offers the best fit for a business based on the nature of the web services that they offer, as we will explore in greater detail below.

What are the major types of rate limiting?

User-level rate limiting

In cases where a system can uniquely identify a user, it can restrict the number of API requests that a user makes in a time period. For example, if the user is only allowed to make two requests per second, the system denies the user’s third request made in the same second. User-level rate limiting ensures fair usage. However, maintaining the usage statistics of each user can create an overhead to the system that if not required for other reasons, could be a drain on resources.

Server-level rate limiting

Most API-based services are distributed in nature. That means when a user sends a request, it might be serviced by any one of the many servers. In distributed systems, rate limiting can be used for load-sharing among servers. For example, if one server receives a large chunk of requests out of ten servers in a distributed system and others are mostly idle, the system is not fully utilized. There will be a restriction on the number of service requests that a particular server can handle in server-level rate limiting. If a server receives requests that are over this set limit, they are either dropped or routed to another server. Server-level rate limiting ensures the system’s availability and prevents denial of service attacks targeted at a particular server.

Geography-based rate limiting

Most API-based services have servers spread across the globe. When a user issues an API request, a server close to the user’s geographic location fulfils it. Organizations implement geography-based rate limiting to restrict the number of service requests from a particular geographic area. This can also be done based on timing. For example, if the number of requests coming from a particular geographic location is small from 1:00 am to 6:00 am, then a web server can have a rate limiting rule for this particular period. If there is an attack on the server during these hours, the number of requests will spike. In the event of a spike, the rate limiting mechanism will then trigger an alert and the organization can quickly respond to such an attack.

What are the algorithms used for rate limiting?

Fixed window rate limiting

Fixed window rate limiting restricts the number of API requests at a specific time. For example, a server can have a rate limiting component that implements a fixed window algorithm that only accepts 100 requests per minute. The time-frame is fixed, and it starts at a specific time. For example, the server will only serve 100 requests between 10:00 am, and 10:01 am. At 10:01 am, the window resets. Fixed window algorithms can be implemented at the user-level or server-level. If it is implemented at user-level, then each user can make 100 requests per minute. If it is a server-level, then all the users can collectively make 100 requests per minute.

The below diagram shows the workflow of the fixed window algorithm with user-level rate limiting. In this workflow, a user is assigned a key for unique identification, and there is a counter attached to each unique user when the user makes a request within a time-frame, the counter increments.

Advantages and disadvantages of the fixed window algorithm

The fixed window algorithm is easy to implement as it is based on a fixed time-frame. As the request count renews at the beginning of every time-frame, the fixed window algorithm doesn’t cause starvation of newer requests. A disadvantage of a fixed window algorithm, on the other hand, is that it leads to a rush in requests, especially at the beginning of the time window. For example, if a server allows 1000 requests per second, all these 1000 requests can happen simultaneously, potentially overloading the server. This issue can arise because there is no restriction on the minimum time gap between the two requests.

Leaky bucket rate limiting

Unlike the fixed window algorithm, the leaky bucket algorithm doesn’t rely on time windows. Instead, it has a fixed-length queue that doesn’t depend on the time. The web server services the request on a first-come, first-serve basis. Each new request is added to the end of the queue. If the queue is full when a new request arrives, the server drops the request.

Advantages and disadvantages of the leaky bucket algorithm

One major advantage of the leaky bucket algorithm is that, since it is based on a queue, it is easy to implement. Also, it presents API requests to the server at a constant rate. Unlike the fixed window algorithm, there will not be a burst in the number of requests at any particular time. However, as the leaky bucket algorithm is based on a static queue, there is a chance of starvation, meaning that the newer requests may not get serviced at all. This issue is not present in a fixed window, as the time window is periodically refreshed to accept new requests.

Sliding window rate limiting algorithm

The sliding window algorithm is quite similar to the fixed window algorithm, except for the time window’s starting point. In the sliding window, the time window starts only when a new request is made. For example, if the first request is made at 10:02 am, and the server allows two requests per minute, the time-frame is 10:02 to 10:03 am.

The sliding window algorithm effectively solves the burst in request issues that the fixed window algorithm faces. It also addresses the starvation issue that leaky bucket algorithms face.

What are the main uses of rate limiting?

The primary aim of rate limiting is ensuring fair use of shared resources. Beyond that, rate limiting is a versatile technique that organizations may make use of for a wide variety of reasons.

Rate limiting offers extra security

Rate limiting prevents denial of service (DoS) attacks. In a DoS attack, a malicious user initiates a massive number of service requests to bring down the system. For example, if a concert promoter and ticket sales website gets a million requests in a second once a concert goes on sale, it will choke the system, and the webserver and database may become unavailable. With rate limiting, the website can prevent any such attempts. A denial of service attack can even happen even if the client does not have any wrong intent. It happens when there is an error in the system that issues the request (client-side). Rate limiting also prevents such unintended attacks.

Access control

Rate limiting not only deals with limiting the number of requests, but it can be modified to limit the level of access also. For example, if there is an API-based service to view and modify the personal details of a user, the rate-limiting algorithm can implement different access levels. One set of users can only view the personal details, while the second set can both view and modify the details.

Metering for APIs

In API business models, rate limiting can be used for metering the usage. For example, if a user has signed up for a plan that allows 1000 API requests per hour, the rate limiting logic will restrict any API request above the cap. Also, the rate limiting algorithm can dynamically allow the user to purchase more requests per second.

Guarantees performance

A key objective of implementing rate limiting logic is to ensure the performance of an API. When the system allows unlimited requests, the performance of the server degrades and slows down the API. In extreme cases, the server might fail to take up any requests. This may lead to cascading failures in a distributed system, where the load from the non-functioning server is distributed to the other servers and gradually overloads them. Rate limiting prevents this condition by either restricting the requests at the user level or the server level.

Ensures availability

One of the main requirements of API-based services is their 24/7 availability. Every second, thousands of users access an API. Even a few seconds of outage can result in a huge loss for the organization. Therefore, it is in the best interest of every organization to strive for zero downtime. Rate limiting and other techniques like load-sharing allow for organizations to restrict the sudden bursts in APIrequests and ensure the system’s availability.

Rate limiting at the client and server sides

Server-side rate limiting

Restricting user requests at the server side is a widely-practiced method for rate limiting. It is in the service provider’s interest to restrict unlimited user requests to ensure performance and availability. As we have seen in the above section, organizations use server-side rate limiting for a variety of purposes.

Client-side rate limiting

It might seem that rate limiting is solely in the API providers’ interest. In practice, the users or client of the service could also benefit from rate limiting. Clients who adhere to usage restrictions can always be assured that the service provider will fulfil their requests. If the user does not take the rate limiting into account, then they may have to face some sudden or unexpected rejection of their requests, which can affect the functionality of their solutions.

For example, a mobile application might be built upon the Facebook messaging API. When people use this application, they are actually using Facebook messaging in the background. In this case, it is the client who must take steps to ensure the application adheres to the rate limiting norms. If the rate limiting is overlooked at the client-side, users of this mobile application might face unexpected errors. It is in the best interest of API users to enforce rate limiting at the client-side.

What are the challenges in implementing rate limiting?

Distributed systems

Implementing rate limiting logic in a system with distributed servers is a challenge. When a user requests an API-based service from a distributed system, the rate limiting component in each of the servers should synchronize with each other. For example, assuming a user has already used up four out of a five requests per minute quota and issues two more requests, then these two requests would go to two different servers. Each of the servers pulls up the user’s request count and thinks that the user has only issued four requests. Then both of the servers allow the service request. So instead of the original five requests per minute, the user gets six requests per minute. Many such synchronization issues and race conditions could occur when rate limiting is implemented in distributed systems.

There are multiple solutions to the rate limiting inconsistencies in distributed systems. Servers using software locks to secure the user request counter is one solution. In this method, only one server will access the counter, which stores the number of requests made in a time window. Another simple but less elegant solution is sticky sessions wherein a single server always serves the user. This, however, fails the very idea of distributed systems.

Computational overheads

Adding an additional layer of logic for the purpose of rate limiting will increase the computational overhead of a server. If the system is already overloaded, adding the rate limiting logic would further slow down the system, and the users may experience delays.

Instead of implementing the rate limiting logic within the same server, it could be incorporated as an external component. Whenever a user requests an API, the request is first routed through this external component which decides to either allow or reject the request.

Failure in rate limiting momponents

Adding an additional layer of complexity for rate limiting adds a new chance for failure. Even if the API is up and running, an error in the rate limiting component will result in the rejection of requests.

A good rate limiting implementation should be able to quickly recover from failures. This can be in the form of hard resets when failures occur. Also, just like the backup servers in API-based services, there could be a duplicate rate limiting component that can take up the role if the original component fails.