boto3 Retry Decorator with Exponential Backoff for ThrottlingException

Problem Statement

Your compliance script calls describe_instances in a loop across 50 regions and 20 accounts. After 30 seconds, you start getting ThrottlingException: Rate exceeded. Without retry logic, your script crashes and you have incomplete data. With exponential backoff, it slows down automatically and completes successfully.

Why Exponential Backoff + Jitter?

Naive retry (bad):     retry immediately → still throttled → fail
Fixed delay (better):  wait 1s → retry → wait 1s → retry
Exponential (good):    wait 1s → 2s → 4s → 8s → 16s (backs off)
Exponential + Jitter (best): 0.8s → 1.7s → 3.9s → 7.2s (avoids thundering herd)

Jitter adds random variation so that 100 concurrent threads don’t all wake up and retry at the same moment — which would just cause another wave of throttling.

Complete Decorator

import boto3
import time
import random
import functools
import logging
from botocore.exceptions import ClientError
from botocore.config import Config

logger = logging.getLogger(__name__)


# ── Decorator: add retry to any function ─────────────────────────
def aws_retry(
    max_retries: int   = 5,
    base_delay: float  = 0.5,
    max_delay: float   = 30.0,
    jitter: bool       = True,
    retryable_errors: set = None,
):
    """
    Decorator that wraps any function with automatic retry logic.

    max_retries:      total number of retry attempts (not counting the first try)
    base_delay:       initial wait before first retry (seconds)
    max_delay:        cap the wait time at this many seconds
    jitter:           add random variation to prevent thundering herd
    retryable_errors: set of AWS error codes to retry on

    Exponential backoff formula:
      delay = min(base_delay × 2^attempt, max_delay)
      with jitter: delay × uniform(0.75, 1.25)

    functools.wraps(func) copies the original function's __name__,
    __doc__, __module__, __qualname__ and __annotations__ to the
    wrapper — essential for debugging and introspection.
    """
    if retryable_errors is None:
        retryable_errors = {
            "ThrottlingException",
            "RequestLimitExceeded",
            "TooManyRequestsException",
            "ServiceUnavailable",
            "InternalServerError",
            "RequestTimeout",
            "ProvisionedThroughputExceededException",
            "LimitExceededException",
            "RequestExpired",
            "Throttling",               # Some services use just "Throttling"
            "SlowDown",                 # S3 uses this for throttling
        }

    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None

            for attempt in range(max_retries + 1):   # +1: attempt 0 is the initial try
                try:
                    return func(*args, **kwargs)

                except ClientError as e:
                    error_code = e.response["Error"]["Code"]
                    error_msg  = e.response["Error"]["Message"]

                    # Non-retryable: re-raise immediately
                    if error_code not in retryable_errors:
                        raise

                    last_exception = e

                    # We've exhausted all retries
                    if attempt == max_retries:
                        logger.error(
                            f"{func.__name__} failed after {max_retries} retries. "
                            f"Last error: {error_code}: {error_msg}"
                        )
                        raise

                    # Calculate wait time
                    delay = min(base_delay * (2 ** attempt), max_delay)
                    if jitter:
                        # Multiply by random value between 0.75 and 1.25
                        # This spreads retries across time instead of bunching them
                        delay *= (0.75 + random.random() * 0.5)

                    logger.warning(
                        f"{func.__name__} attempt {attempt + 1}/{max_retries} failed "
                        f"({error_code}). Retrying in {delay:.2f}s..."
                    )
                    time.sleep(delay)

            raise last_exception   # Should never reach here, but satisfies type checkers

        return wrapper
    return decorator


# ── Usage: function-level decorator ──────────────────────────────
@aws_retry(max_retries=5, base_delay=1.0)
def list_all_instances(region: str) -> list:
    """List all EC2 instances in a region with auto-retry on throttle."""
    ec2 = boto3.client("ec2", region_name=region)
    instances = []
    paginator = ec2.get_paginator("describe_instances")
    for page in paginator.paginate():
        for reservation in page["Reservations"]:
            instances.extend(reservation["Instances"])
    return instances


@aws_retry(max_retries=3, base_delay=0.5)
def get_secret(secret_name: str) -> dict:
    """Retrieve secret from Secrets Manager with retry."""
    import json
    sm = boto3.client("secretsmanager")
    response = sm.get_secret_value(SecretId=secret_name)
    return json.loads(response["SecretString"])


# ── Usage: class-based wrapper (all methods auto-retry) ───────────
class AWSClientWithRetry:
    """
    Wraps a boto3 client so that EVERY method call is automatically
    retried on throttling. Useful when you use a single client extensively.

    __getattr__ is called when Python can't find an attribute on the object.
    We intercept it to wrap any callable (boto3 method) with the retry decorator.
    """

    def __init__(
        self,
        service: str,
        region: str = "us-east-1",
        max_retries: int = 5,
        **boto_kwargs,
    ):
        self._client      = boto3.client(service, region_name=region, **boto_kwargs)
        self._max_retries = max_retries

    def __getattr__(self, name: str):
        """
        Called when accessing any attribute not found on this object.
        Returns the boto3 method wrapped with the retry decorator.
        """
        attr = getattr(self._client, name)
        if callable(attr):
            return aws_retry(max_retries=self._max_retries)(attr)
        return attr


# ── Method 3: botocore built-in retry config (simpler) ───────────
def get_client_with_builtin_retry(service: str, region: str = "us-east-1"):
    """
    botocore has built-in retry logic via Config.
    retry.mode options:
      "legacy"  — default, 3 retries with fixed delays
      "standard" — 3 retries with exponential backoff
      "adaptive" — dynamic retry with token bucket algorithm (best for throttling)

    max_attempts includes the initial attempt + retries.
    So max_attempts=5 means 1 initial + 4 retries.
    """
    config = Config(
        retries={
            "mode":         "adaptive",   # Adaptive token bucket algorithm
            "max_attempts": 10,           # Up to 9 retries
        },
        connect_timeout=5,
        read_timeout=30,
    )
    return boto3.client(service, region_name=region, config=config)


# ── Combining approaches ──────────────────────────────────────────
if __name__ == "__main__":
    # Approach 1: Function decorator (best for specific functions)
    instances = list_all_instances("ap-south-1")
    print(f"Found {len(instances)} instances")

    # Approach 2: Class wrapper (best when reusing a client heavily)
    ec2 = AWSClientWithRetry("ec2", region="ap-south-1", max_retries=5)

    # All these calls will auto-retry on ThrottlingException:
    response = ec2.describe_vpcs()
    sgs = ec2.describe_security_groups()
    subnets = ec2.describe_subnets()
    print(f"VPCs: {len(response['Vpcs'])}")

    # Approach 3: botocore adaptive mode (simplest — built-in)
    s3 = get_client_with_builtin_retry("s3")
    buckets = s3.list_buckets()["Buckets"]
    print(f"S3 buckets: {len(buckets)}")

Retry Timing Comparison

Attempt	Base(0.5s)	×2^n	With Jitter
1st retry	0.5s	0.5	0.38–0.63s
2nd retry	1.0s	1.0	0.75–1.25s
3rd retry	2.0s	2.0	1.50–2.50s
4th retry	4.0s	4.0	3.00–5.00s
5th retry	8.0s	8.0	6.00–10.00s
Max cap	30.0s	—	22.5–37.5s

Key Commands Explained

Command	What it does
`@functools.wraps(func)`	Copies original function metadata to the wrapper (preserves `__name__`, `__doc__`)
`e.response["Error"]["Code"]`	The AWS error type string (e.g., `"ThrottlingException"`)
`e.response["Error"]["Message"]`	Human-readable error description
`min(base_delay * (2 ** attempt), max_delay)`	Exponential backoff capped at max_delay
`random.random() * 0.5`	Random value 0.0–0.5, added to 0.75 to get 0.75–1.25 multiplier
`time.sleep(delay)`	Block the current thread for the calculated delay
`Config(retries={"mode": "adaptive"})`	botocore built-in adaptive retry with token bucket
`__getattr__(self, name)`	Python dunder called for attribute misses — used for transparent method wrapping

Common Issues

Decorator not retrying — Check that the error code in your ClientError matches one in retryable_errors. Print e.response["Error"]["Code"] to see the exact value.

Jitter causing very long delays — With high max_delay and jitter multiplier > 1, delays can exceed max_delay. The formula delay × (0.75 + random × 0.5) keeps jitter between ×0.75 and ×1.25.

Don’t wrap write operations blindly — Retrying create_security_group on a transient error can create duplicate resources. Add idempotency checks (e.g., check if the resource exists before creating).

🔍 Line-by-Line Code Walkthrough

Imports

Line	Why It’s Used
`import functools`	Provides `functools.wraps` — the key tool for writing proper decorators
`from botocore.exceptions import ClientError`	The exception class for all AWS API errors. Has `.response["Error"]["Code"]` to identify the error type
`from botocore.config import Config`	Allows configuring retry behavior, timeouts, and connection pooling at the client level

`aws_retry(max_retries, base_delay, max_delay, jitter, retryable_errors)` — The Outer Decorator Factory

def aws_retry(
    max_retries: int   = 5,
    base_delay: float  = 0.5,
    max_delay: float   = 30.0,
    jitter: bool       = True,
    retryable_errors: set = None,
):

Parameter	Explanation
`max_retries=5`	Total retry attempts AFTER the first try. So 5 retries = 6 total attempts
`base_delay=0.5`	Wait 0.5 seconds before the first retry. Each subsequent retry doubles this
`max_delay=30.0`	Cap the backoff at 30 seconds — prevents waiting 512s on attempt 10
`jitter=True`	Multiplies the delay by a random factor (0.75–1.25). Prevents 100 threads all retrying at the same moment (“thundering herd”)
`retryable_errors: set = None`	Which AWS error codes trigger a retry. `None` means use the built-in set of throttling/transient codes

if retryable_errors is None:
    retryable_errors = {
        "ThrottlingException",
        "RequestLimitExceeded",
        "SlowDown",
        ...
    }

Line	Explanation
`retryable_errors is None`	We use `None` as default (not `set()`) because mutable default arguments in Python are shared across calls — a subtle bug. `None` + this check is the safe pattern
`{"ThrottlingException", ...}`	A set for O(1) lookup. When `error_code not in retryable_errors` is checked on every exception, set lookup is faster than list search

`decorator(func)` and `wrapper(*args, **kwargs)` — The Closure

def decorator(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):

Line	Explanation
`def decorator(func):`	`aws_retry(...)` returns `decorator`. This is the two-level pattern required when a decorator takes arguments. `@aws_retry(max_retries=5)` calls `aws_retry()` first, then applies the returned `decorator` to the function
`@functools.wraps(func)`	Copies `func.__name__`, `func.__doc__`, `func.__module__`, `func.__qualname__`, `func.__annotations__` to `wrapper`. Without this, your wrapped function’s `__name__` would be `"wrapper"` — breaking logging, tracebacks, and introspection
`def wrapper(args, *kwargs):`	Accepts any arguments the original function takes and passes them through. The retry logic is completely transparent to callers

The Retry Loop

for attempt in range(max_retries + 1):   # +1: attempt 0 is the initial try
    try:
        return func(*args, **kwargs)
    except ClientError as e:
        error_code = e.response["Error"]["Code"]
        error_msg  = e.response["Error"]["Message"]
        if error_code not in retryable_errors:
            raise

Line	Explanation
`range(max_retries + 1)`	If `max_retries=5`, this is `range(6)` → attempts 0,1,2,3,4,5. Attempt 0 is the initial call, attempts 1-5 are retries
`return func(args, *kwargs)`	On success, immediately returns the result. The loop stops here — no more retry overhead
`e.response["Error"]["Code"]`	boto3 puts the AWS error code in `e.response["Error"]["Code"]`. Examples: `"ThrottlingException"`, `"AccessDenied"`, `"NoSuchBucket"`
`e.response["Error"]["Message"]`	Human-readable description: `"Rate exceeded"`, `"Access Denied"`, etc.
`if error_code not in retryable_errors: raise`	For non-retryable errors (e.g., `"AccessDeniedException"`, `"NoSuchBucket"`), re-raise immediately — retrying would be pointless and wasteful

Exponential Backoff Formula

delay = min(base_delay * (2 ** attempt), max_delay)
if jitter:
    delay *= (0.75 + random.random() * 0.5)
time.sleep(delay)

Line	Explanation
`base_delay * (2 ** attempt)`	Exponential growth: attempt 0 → ×1, attempt 1 → ×2, attempt 2 → ×4, attempt 3 → ×8, etc.
`min(..., max_delay)`	Caps the delay. Without this, attempt 10 would wait `0.5 × 2^10 = 512 seconds`
`random.random()`	Returns a float in `[0.0, 1.0)`. Multiply by 0.5 gives `[0.0, 0.5)`. Add 0.75 gives `[0.75, 1.25)`
`delay = (0.75 + random.random() 0.5)`	Jitter: random multiplier between 0.75× and 1.25×. Each thread gets a different delay, spreading retries across time
`time.sleep(delay)`	Blocks the current thread. In a Lambda or single-threaded script, this is fine. In async code (asyncio), you’d use `await asyncio.sleep(delay)` instead

`AWSClientWithRetry.getattr` — Transparent Method Wrapping

def __getattr__(self, name: str):
    attr = getattr(self._client, name)
    if callable(attr):
        return aws_retry(max_retries=self._max_retries)(attr)
    return attr

Line	Explanation
`__getattr__(self, name)`	Python calls `__getattr__` only when the attribute is NOT found through normal lookup. Since `describe_instances` is not defined on `AWSClientWithRetry`, Python calls this method with `name="describe_instances"`
`getattr(self._client, name)`	Gets the actual method from the underlying boto3 client
`if callable(attr)`	`callable()` returns `True` for functions and methods, `False` for properties, strings, etc.
`return aws_retry(...)(attr)`	`aws_retry(max_retries=5)` returns `decorator`. Calling `decorator(attr)` returns the wrapped method. We return it without calling it — the caller will call it
`return attr`	For non-callable attributes (like `meta`), return them as-is — no wrapping needed

`get_client_with_builtin_retry` — botocore’s Built-In Retry

config = Config(
    retries={"mode": "adaptive", "max_attempts": 10},
    connect_timeout=5,
    read_timeout=30,
)

Field	Explanation
`"mode": "adaptive"`	Uses a token bucket algorithm. Tokens are consumed with each retry. If the bucket empties, it waits to replenish. This dynamically adapts to the actual throttle rate
`"mode": "standard"`	Fixed exponential backoff — 3 retries. Simpler but less smart
`"mode": "legacy"`	Original boto3 retry (3 retries, fixed delay). The default if you don’t set `mode`
`"max_attempts": 10`	Total attempts including the initial call. So 10 = 1 initial + 9 retries
`connect_timeout=5`	Give up connecting to the AWS API endpoint after 5 seconds. Catches DNS failures and network partitions quickly
`read_timeout=30`	Give up waiting for the response body after 30 seconds. Some operations (like large S3 copies) need longer

boto3 Retry Decorator with Exponential Backoff for ThrottlingException

Problem Statement

Why Exponential Backoff + Jitter?

Complete Decorator

Retry Timing Comparison

Key Commands Explained

Common Issues

🔍 Line-by-Line Code Walkthrough

Imports

`aws_retry(max_retries, base_delay, max_delay, jitter, retryable_errors)` — The Outer Decorator Factory

`decorator(func)` and `wrapper(*args, **kwargs)` — The Closure

The Retry Loop

Exponential Backoff Formula

`AWSClientWithRetry.getattr` — Transparent Method Wrapping

`get_client_with_builtin_retry` — botocore’s Built-In Retry

Have a similar scenario to share?

Related Scenarios

Generic boto3 Pagination Utility — Handle All Paginated AWS APIs

Production-Grade Python Scripts for AWS — Best Practices & Patterns

Auto Stop/Start EC2 Instances Using Schedule Tags with Python

boto3 Retry Decorator with Exponential Backoff for ThrottlingException

Problem Statement

Why Exponential Backoff + Jitter?

Complete Decorator

Retry Timing Comparison

Key Commands Explained

Common Issues

🔍 Line-by-Line Code Walkthrough

Imports

aws_retry(max_retries, base_delay, max_delay, jitter, retryable_errors) — The Outer Decorator Factory

decorator(func) and wrapper(*args, **kwargs) — The Closure

The Retry Loop

Exponential Backoff Formula

AWSClientWithRetry.__getattr__ — Transparent Method Wrapping

get_client_with_builtin_retry — botocore’s Built-In Retry

Have a similar scenario to share?

Related Scenarios

Generic boto3 Pagination Utility — Handle All Paginated AWS APIs

Production-Grade Python Scripts for AWS — Best Practices & Patterns

Auto Stop/Start EC2 Instances Using Schedule Tags with Python

`aws_retry(max_retries, base_delay, max_delay, jitter, retryable_errors)` — The Outer Decorator Factory

`decorator(func)` and `wrapper(*args, **kwargs)` — The Closure

`AWSClientWithRetry.getattr` — Transparent Method Wrapping

`get_client_with_builtin_retry` — botocore’s Built-In Retry