Scenario Advanced Python Python AWS Scripting

boto3 Retry Decorator with Exponential Backoff for ThrottlingException

Python decorator that adds automatic retry logic with exponential backoff and jitter to any boto3 API call, handling ThrottlingException and transient AWS errors.

January 20, 2025 9 min read ~20 min to complete DB
The Situation

Production resilience — AWS throttles API calls per account per region. Batch scripts that call describe_instances 500 times will hit limits. Automatic retry with backoff is essential.

5 Steps
5 Services Used
~20 min Duration
Advanced Difficulty

Problem Statement

Your compliance script calls describe_instances in a loop across 50 regions and 20 accounts. After 30 seconds, you start getting ThrottlingException: Rate exceeded. Without retry logic, your script crashes and you have incomplete data. With exponential backoff, it slows down automatically and completes successfully.


Why Exponential Backoff + Jitter?

Naive retry (bad):     retry immediately → still throttled → fail
Fixed delay (better):  wait 1s → retry → wait 1s → retry
Exponential (good):    wait 1s → 2s → 4s → 8s → 16s (backs off)
Exponential + Jitter (best): 0.8s → 1.7s → 3.9s → 7.2s (avoids thundering herd)

Jitter adds random variation so that 100 concurrent threads don’t all wake up and retry at the same moment — which would just cause another wave of throttling.


Complete Decorator

import boto3
import time
import random
import functools
import logging
from botocore.exceptions import ClientError
from botocore.config import Config

logger = logging.getLogger(__name__)


# ── Decorator: add retry to any function ─────────────────────────
def aws_retry(
    max_retries: int   = 5,
    base_delay: float  = 0.5,
    max_delay: float   = 30.0,
    jitter: bool       = True,
    retryable_errors: set = None,
):
    """
    Decorator that wraps any function with automatic retry logic.

    max_retries:      total number of retry attempts (not counting the first try)
    base_delay:       initial wait before first retry (seconds)
    max_delay:        cap the wait time at this many seconds
    jitter:           add random variation to prevent thundering herd
    retryable_errors: set of AWS error codes to retry on

    Exponential backoff formula:
      delay = min(base_delay × 2^attempt, max_delay)
      with jitter: delay × uniform(0.75, 1.25)

    functools.wraps(func) copies the original function's __name__,
    __doc__, __module__, __qualname__ and __annotations__ to the
    wrapper — essential for debugging and introspection.
    """
    if retryable_errors is None:
        retryable_errors = {
            "ThrottlingException",
            "RequestLimitExceeded",
            "TooManyRequestsException",
            "ServiceUnavailable",
            "InternalServerError",
            "RequestTimeout",
            "ProvisionedThroughputExceededException",
            "LimitExceededException",
            "RequestExpired",
            "Throttling",               # Some services use just "Throttling"
            "SlowDown",                 # S3 uses this for throttling
        }

    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None

            for attempt in range(max_retries + 1):   # +1: attempt 0 is the initial try
                try:
                    return func(*args, **kwargs)

                except ClientError as e:
                    error_code = e.response["Error"]["Code"]
                    error_msg  = e.response["Error"]["Message"]

                    # Non-retryable: re-raise immediately
                    if error_code not in retryable_errors:
                        raise

                    last_exception = e

                    # We've exhausted all retries
                    if attempt == max_retries:
                        logger.error(
                            f"{func.__name__} failed after {max_retries} retries. "
                            f"Last error: {error_code}: {error_msg}"
                        )
                        raise

                    # Calculate wait time
                    delay = min(base_delay * (2 ** attempt), max_delay)
                    if jitter:
                        # Multiply by random value between 0.75 and 1.25
                        # This spreads retries across time instead of bunching them
                        delay *= (0.75 + random.random() * 0.5)

                    logger.warning(
                        f"{func.__name__} attempt {attempt + 1}/{max_retries} failed "
                        f"({error_code}). Retrying in {delay:.2f}s..."
                    )
                    time.sleep(delay)

            raise last_exception   # Should never reach here, but satisfies type checkers

        return wrapper
    return decorator


# ── Usage: function-level decorator ──────────────────────────────
@aws_retry(max_retries=5, base_delay=1.0)
def list_all_instances(region: str) -> list:
    """List all EC2 instances in a region with auto-retry on throttle."""
    ec2 = boto3.client("ec2", region_name=region)
    instances = []
    paginator = ec2.get_paginator("describe_instances")
    for page in paginator.paginate():
        for reservation in page["Reservations"]:
            instances.extend(reservation["Instances"])
    return instances


@aws_retry(max_retries=3, base_delay=0.5)
def get_secret(secret_name: str) -> dict:
    """Retrieve secret from Secrets Manager with retry."""
    import json
    sm = boto3.client("secretsmanager")
    response = sm.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])


# ── Usage: class-based wrapper (all methods auto-retry) ───────────
class AWSClientWithRetry:
    """
    Wraps a boto3 client so that EVERY method call is automatically
    retried on throttling. Useful when you use a single client extensively.

    __getattr__ is called when Python can't find an attribute on the object.
    We intercept it to wrap any callable (boto3 method) with the retry decorator.
    """

    def __init__(
        self,
        service: str,
        region: str = "us-east-1",
        max_retries: int = 5,
        **boto_kwargs,
    ):
        self._client      = boto3.client(service, region_name=region, **boto_kwargs)
        self._max_retries = max_retries

    def __getattr__(self, name: str):
        """
        Called when accessing any attribute not found on this object.
        Returns the boto3 method wrapped with the retry decorator.
        """
        attr = getattr(self._client, name)
        if callable(attr):
            return aws_retry(max_retries=self._max_retries)(attr)
        return attr


# ── Method 3: botocore built-in retry config (simpler) ───────────
def get_client_with_builtin_retry(service: str, region: str = "us-east-1"):
    """
    botocore has built-in retry logic via Config.
    retry.mode options:
      "legacy"  — default, 3 retries with fixed delays
      "standard" — 3 retries with exponential backoff
      "adaptive" — dynamic retry with token bucket algorithm (best for throttling)

    max_attempts includes the initial attempt + retries.
    So max_attempts=5 means 1 initial + 4 retries.
    """
    config = Config(
        retries={
            "mode":         "adaptive",   # Adaptive token bucket algorithm
            "max_attempts": 10,           # Up to 9 retries
        },
        connect_timeout=5,
        read_timeout=30,
    )
    return boto3.client(service, region_name=region, config=config)


# ── Combining approaches ──────────────────────────────────────────
if __name__ == "__main__":
    # Approach 1: Function decorator (best for specific functions)
    instances = list_all_instances("ap-south-1")
    print(f"Found {len(instances)} instances")

    # Approach 2: Class wrapper (best when reusing a client heavily)
    ec2 = AWSClientWithRetry("ec2", region="ap-south-1", max_retries=5)

    # All these calls will auto-retry on ThrottlingException:
    response = ec2.describe_vpcs()
    sgs = ec2.describe_security_groups()
    subnets = ec2.describe_subnets()
    print(f"VPCs: {len(response['Vpcs'])}")

    # Approach 3: botocore adaptive mode (simplest — built-in)
    s3 = get_client_with_builtin_retry("s3")
    buckets = s3.list_buckets()["Buckets"]
    print(f"S3 buckets: {len(buckets)}")

Retry Timing Comparison

AttemptBase(0.5s)×2^nWith Jitter
1st retry0.5s0.50.38–0.63s
2nd retry1.0s1.00.75–1.25s
3rd retry2.0s2.01.50–2.50s
4th retry4.0s4.03.00–5.00s
5th retry8.0s8.06.00–10.00s
Max cap30.0s22.5–37.5s

Key Commands Explained

CommandWhat it does
@functools.wraps(func)Copies original function metadata to the wrapper (preserves __name__, __doc__)
e.response["Error"]["Code"]The AWS error type string (e.g., "ThrottlingException")
e.response["Error"]["Message"]Human-readable error description
min(base_delay * (2 ** attempt), max_delay)Exponential backoff capped at max_delay
random.random() * 0.5Random value 0.0–0.5, added to 0.75 to get 0.75–1.25 multiplier
time.sleep(delay)Block the current thread for the calculated delay
Config(retries={"mode": "adaptive"})botocore built-in adaptive retry with token bucket
__getattr__(self, name)Python dunder called for attribute misses — used for transparent method wrapping

Common Issues

Decorator not retrying — Check that the error code in your ClientError matches one in retryable_errors. Print e.response["Error"]["Code"] to see the exact value.

Jitter causing very long delays — With high max_delay and jitter multiplier > 1, delays can exceed max_delay. The formula delay × (0.75 + random × 0.5) keeps jitter between ×0.75 and ×1.25.

Don’t wrap write operations blindly — Retrying create_security_group on a transient error can create duplicate resources. Add idempotency checks (e.g., check if the resource exists before creating).


🔍 Line-by-Line Code Walkthrough

Imports

LineWhy It’s Used
import functoolsProvides functools.wraps — the key tool for writing proper decorators
from botocore.exceptions import ClientErrorThe exception class for all AWS API errors. Has .response["Error"]["Code"] to identify the error type
from botocore.config import ConfigAllows configuring retry behavior, timeouts, and connection pooling at the client level

aws_retry(max_retries, base_delay, max_delay, jitter, retryable_errors) — The Outer Decorator Factory

def aws_retry(
    max_retries: int   = 5,
    base_delay: float  = 0.5,
    max_delay: float   = 30.0,
    jitter: bool       = True,
    retryable_errors: set = None,
):
ParameterExplanation
max_retries=5Total retry attempts AFTER the first try. So 5 retries = 6 total attempts
base_delay=0.5Wait 0.5 seconds before the first retry. Each subsequent retry doubles this
max_delay=30.0Cap the backoff at 30 seconds — prevents waiting 512s on attempt 10
jitter=TrueMultiplies the delay by a random factor (0.75–1.25). Prevents 100 threads all retrying at the same moment (“thundering herd”)
retryable_errors: set = NoneWhich AWS error codes trigger a retry. None means use the built-in set of throttling/transient codes
if retryable_errors is None:
    retryable_errors = {
        "ThrottlingException",
        "RequestLimitExceeded",
        "SlowDown",
        ...
    }
LineExplanation
retryable_errors is NoneWe use None as default (not set()) because mutable default arguments in Python are shared across calls — a subtle bug. None + this check is the safe pattern
{"ThrottlingException", ...}A set for O(1) lookup. When error_code not in retryable_errors is checked on every exception, set lookup is faster than list search

decorator(func) and wrapper(*args, **kwargs) — The Closure

def decorator(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
LineExplanation
def decorator(func):aws_retry(...) returns decorator. This is the two-level pattern required when a decorator takes arguments. @aws_retry(max_retries=5) calls aws_retry() first, then applies the returned decorator to the function
@functools.wraps(func)Copies func.__name__, func.__doc__, func.__module__, func.__qualname__, func.__annotations__ to wrapper. Without this, your wrapped function’s __name__ would be "wrapper" — breaking logging, tracebacks, and introspection
def wrapper(*args, **kwargs):Accepts any arguments the original function takes and passes them through. The retry logic is completely transparent to callers

The Retry Loop

for attempt in range(max_retries + 1):   # +1: attempt 0 is the initial try
    try:
        return func(*args, **kwargs)
    except ClientError as e:
        error_code = e.response["Error"]["Code"]
        error_msg  = e.response["Error"]["Message"]
        if error_code not in retryable_errors:
            raise
LineExplanation
range(max_retries + 1)If max_retries=5, this is range(6) → attempts 0,1,2,3,4,5. Attempt 0 is the initial call, attempts 1-5 are retries
return func(*args, **kwargs)On success, immediately returns the result. The loop stops here — no more retry overhead
e.response["Error"]["Code"]boto3 puts the AWS error code in e.response["Error"]["Code"]. Examples: "ThrottlingException", "AccessDenied", "NoSuchBucket"
e.response["Error"]["Message"]Human-readable description: "Rate exceeded", "Access Denied", etc.
if error_code not in retryable_errors: raiseFor non-retryable errors (e.g., "AccessDeniedException", "NoSuchBucket"), re-raise immediately — retrying would be pointless and wasteful

Exponential Backoff Formula

delay = min(base_delay * (2 ** attempt), max_delay)
if jitter:
    delay *= (0.75 + random.random() * 0.5)
time.sleep(delay)
LineExplanation
base_delay * (2 ** attempt)Exponential growth: attempt 0 → ×1, attempt 1 → ×2, attempt 2 → ×4, attempt 3 → ×8, etc.
min(..., max_delay)Caps the delay. Without this, attempt 10 would wait 0.5 × 2^10 = 512 seconds
random.random()Returns a float in [0.0, 1.0). Multiply by 0.5 gives [0.0, 0.5). Add 0.75 gives [0.75, 1.25)
delay *= (0.75 + random.random() * 0.5)Jitter: random multiplier between 0.75× and 1.25×. Each thread gets a different delay, spreading retries across time
time.sleep(delay)Blocks the current thread. In a Lambda or single-threaded script, this is fine. In async code (asyncio), you’d use await asyncio.sleep(delay) instead

AWSClientWithRetry.__getattr__ — Transparent Method Wrapping

def __getattr__(self, name: str):
    attr = getattr(self._client, name)
    if callable(attr):
        return aws_retry(max_retries=self._max_retries)(attr)
    return attr
LineExplanation
__getattr__(self, name)Python calls __getattr__ only when the attribute is NOT found through normal lookup. Since describe_instances is not defined on AWSClientWithRetry, Python calls this method with name="describe_instances"
getattr(self._client, name)Gets the actual method from the underlying boto3 client
if callable(attr)callable() returns True for functions and methods, False for properties, strings, etc.
return aws_retry(...)(attr)aws_retry(max_retries=5) returns decorator. Calling decorator(attr) returns the wrapped method. We return it without calling it — the caller will call it
return attrFor non-callable attributes (like meta), return them as-is — no wrapping needed

get_client_with_builtin_retry — botocore’s Built-In Retry

config = Config(
    retries={"mode": "adaptive", "max_attempts": 10},
    connect_timeout=5,
    read_timeout=30,
)
FieldExplanation
"mode": "adaptive"Uses a token bucket algorithm. Tokens are consumed with each retry. If the bucket empties, it waits to replenish. This dynamically adapts to the actual throttle rate
"mode": "standard"Fixed exponential backoff — 3 retries. Simpler but less smart
"mode": "legacy"Original boto3 retry (3 retries, fixed delay). The default if you don’t set mode
"max_attempts": 10Total attempts including the initial call. So 10 = 1 initial + 9 retries
connect_timeout=5Give up connecting to the AWS API endpoint after 5 seconds. Catches DNS failures and network partitions quickly
read_timeout=30Give up waiting for the response body after 30 seconds. Some operations (like large S3 copies) need longer
Services Used
EC2S3IAMboto3botocore
Prerequisites
  • Python 3.8+
  • boto3
  • botocore
What You Learned
  • Python functools.wraps decorator pattern
  • botocore ClientError structure
  • Exponential backoff formula
  • Jitter to prevent thundering herd
  • Adaptive retry mode in botocore

Have a similar scenario to share?

Production incidents are the best teachers. Submit your real-world scenario and help others learn.

Open Google Form

Related Scenarios