Rate Limiting Strategies to Protect APIs and Control Cloud Costs

October 8, 2025 5�7 min read

Introduction

Rate limiting controls how many requests a client can make during a time window. It protects availability, ensures fair usage, and prevents surprise cloud bills caused by abusive or accidental traffic spikes.

Why it matters

Stability: Prevents resource starvation and cascading failures.
Fairness: Stops noisy neighbors from degrading service for others.
Cost control: Caps bursty traffic that would otherwise inflate egress/compute costs.
Abuse defense: Mitigates brute force, scraping, and DoS attempts.

Common techniques

Fixed window: Allow N requests per minute/hour. Simple, but can allow bursts at window boundaries.
Sliding window (log or counters): Counts requests in the last T seconds. Fairer than fixed window; higher state overhead if using logs.
Token bucket: Tokens refill at rate r; requests consume 1 token. Supports controlled bursts up to bucket size b.
Leaky bucket (queue): Processes at a constant rate; excess is queued or dropped. Smooths traffic aggressively.

Comparison table

Rate limiting strategies comparison
Strategy	Burst handling	Fairness	State/complexity	Good for
Fixed window	Allows boundary bursts	Lower	Low	Simple per-IP or per-key caps
Sliding window	Controls boundary bursts	High	Medium (logs) / Low (rolling counter)	Public APIs needing fairness
Token bucket	Supports limited bursts	High	Low	User-tier limits; sustained rate with bursts
Leaky bucket	Strict smoothing	High	Low�Medium	Backpressure on bursty writers

Choosing keys and tiers

Key by identity: API key, OAuth client, or user ID (not only IP).
Segment by tier: Free vs. Pro vs. Enterprise limits.
Path/operation classes: Stricter limits for expensive endpoints.
Geo/region-aware: Apply limits close to where traffic enters.

Practical snippets

Nginx fixed window with burst

# /etc/nginx/conf.d/ratelimit.conf
limit_req_zone $binary_remote_addr zone=perip:10m rate=10r/s;

server {
  listen 443 ssl http2;
  server_name api.example.com;

  location /v1/ {
    limit_req zone=perip burst=20 nodelay;
    proxy_pass http://backend;
  }
}

Express + Redis (token bucket-ish)

// pseudo-code: per-key tokens with Redis
const rate = 5;            // tokens per second
const burst = 50;          // bucket size
const ttl = 3600;

app.use(async (req, res, next) => {
  const key = `bucket:${req.user.id}`;
  const now = Date.now();

  // LUA script ideal; simplified JS here:
  let bucket = await redis.hgetall(key);
  if (!bucket.last) bucket = { tokens: burst, last: now };

  const elapsed = (now - bucket.last) / 1000;
  const tokens = Math.min(burst, Number(bucket.tokens) + elapsed * rate);

  if (tokens < 1) return res.status(429).json({ error: "Too Many Requests" });

  await redis.hset(key, { tokens: tokens - 1, last: now });
  await redis.expire(key, ttl);
  next();
});

Return 429 Too Many Requests with Retry-After. For idempotent calls, clients should retry after the header delay.

Best practices

Layer limits: Edge (CDN/WAF) + gateway + app-level for expensive ops.
Expose headers: X-RateLimit-Limit, -Remaining, -Reset for developer UX.
Protect auth flows: Stricter limits on login, password reset, token minting.
Adaptive limits: Tighten during incidents; loosen for trusted clients.
Separate write vs. read: Writes usually get lower thresholds.
Monitor and alert: Track 429s, latency, and cache hit ratios to tune values.

Common pitfalls

Only per-IP: Breaks behind NATs and misses authenticated abuse. Use identity keys.
Clock skew: Distributed windows can drift�prefer server-side counters (Redis/Lua).
Retry storms: Clients hammering after 429. Add jitter and backoff guidance.
Caching interaction: Cacheable responses reduce pressure�set explicit Cache-Control.

Conclusion

Effective rate limiting blends the right algorithm (sliding or token bucket), correct keys (per user/app), and multi-layer enforcement. Do this well and you protect uptime, ensure fairness, and keep cloud costs predictable.

Tip: Start conservative, ship metrics, then tune limits per endpoint based on real traffic and error budgets.