The Situation
AWS API best practice — any API that can return more than 100 results is paginated. Forgetting to paginate means silently missing resources — a dangerous bug in security and compliance scripts.
Problem Statement
You write ec2.describe_instances() and it works in dev with 5 instances. In production with 1,200 instances, it silently returns only the first 1,000. Your security audit reports “no violations” — but 200 instances were never checked. Pagination is not optional.
Which APIs Are Paginated?
Almost all AWS list/describe APIs are paginated. Common ones:
| API | Default Page Size | Result Key |
|---|
ec2:DescribeInstances | 100–1000 | Reservations |
iam:ListUsers | 100 | Users |
s3:ListObjectsV2 | 1000 | Contents |
cloudtrail:LookupEvents | 50 | Events |
rds:DescribeDBInstances | 100 | DBInstances |
Complete Utility
import boto3
from typing import Generator, Any, Optional
# ── Method 1: Built-in paginator (preferred) ──────────────────────
def paginate_all(
client,
method_name: str,
result_key: str,
**kwargs,
) -> Generator[Any, None, None]:
"""
Generic paginator for any boto3 API that supports pagination.
client: a boto3 service client (e.g., boto3.client("ec2"))
method_name: the API method name as a string (e.g., "describe_instances")
result_key: the top-level dict key containing results (e.g., "Reservations")
**kwargs: any additional arguments to pass to the API (Filters, etc.)
How boto3 paginators work:
- client.get_paginator("method_name") returns a Paginator object.
- paginator.paginate(**kwargs) returns a PageIterator.
- Each iteration yields one page (a dict with the same structure as
a single API call response).
- boto3 automatically appends NextToken to each subsequent request
and stops when there are no more pages.
We use yield from to yield items one at a time — making this a
lazy generator that never loads all results into memory at once.
This is critical when you have millions of S3 objects.
"""
try:
# get_paginator() raises OperationNotPageable if the method
# doesn't support pagination — we fall back to manual NextToken.
paginator = client.get_paginator(method_name)
for page in paginator.paginate(**kwargs):
yield from page.get(result_key, [])
except client.exceptions.ClientError:
raise
except Exception:
# Fallback: manual NextToken loop for non-standard pagination
yield from _manual_paginate(client, method_name, result_key, **kwargs)
def _manual_paginate(
client,
method_name: str,
result_key: str,
**kwargs,
) -> Generator[Any, None, None]:
"""
Manual NextToken pagination for APIs that don't have a built-in paginator.
Some older APIs use 'Marker' instead of 'NextToken'.
"""
method = getattr(client, method_name)
while True:
response = method(**kwargs)
yield from response.get(result_key, [])
# Try NextToken first, then Marker (IAM uses Marker)
next_token = response.get("NextToken") or response.get("Marker")
if not next_token:
break
# Set the appropriate continuation token for the next call
if "NextToken" in response:
kwargs["NextToken"] = next_token
else:
kwargs["Marker"] = next_token
# ── Method 2: Collect all results into a list (convenience) ───────
def paginate_all_list(
client,
method_name: str,
result_key: str,
**kwargs,
) -> list:
"""
Wrapper that collects all paginated results into a list.
Use when you need to access results multiple times or check length.
For very large result sets, prefer the generator version.
"""
return list(paginate_all(client, method_name, result_key, **kwargs))
# ── Method 3: Paginate with a callback ────────────────────────────
def paginate_with_callback(
client,
method_name: str,
result_key: str,
callback,
**kwargs,
) -> int:
"""
Process each item with a callback function as pages arrive.
Returns total items processed.
Useful for writing results to a file/DB without buffering everything.
"""
count = 0
for item in paginate_all(client, method_name, result_key, **kwargs):
callback(item)
count += 1
return count
# ── Usage examples ─────────────────────────────────────────────────
if __name__ == "__main__":
ec2 = boto3.client("ec2", region_name="ap-south-1")
s3 = boto3.client("s3")
iam = boto3.client("iam")
ct = boto3.client("cloudtrail")
# ── Example 1: List all running EC2 instances ─────────────────
# Without pagination you'd call ec2.describe_instances() and risk
# missing instances if there are more than the default page size.
print("Running EC2 instances:")
instance_count = 0
for reservation in paginate_all(
ec2,
"describe_instances",
"Reservations",
Filters=[{"Name": "instance-state-name", "Values": ["running"]}],
):
for instance in reservation["Instances"]:
print(f" {instance['InstanceId']}")
instance_count += 1
print(f"Total: {instance_count} running instances\n")
# ── Example 2: List all IAM users ─────────────────────────────
# iam.list_users() returns 100 users per page (max 1000 per call
# when using PaginationConfig — but paginator handles it).
all_users = paginate_all_list(iam, "list_users", "Users")
print(f"Total IAM users: {len(all_users)}\n")
# ── Example 3: List all S3 objects in a bucket ────────────────
# S3 can have BILLIONS of objects — never load them all into a list.
# Use the generator to process one at a time.
bucket_name = "my-data-bucket"
total_size = 0
object_count = 0
for obj in paginate_all(s3, "list_objects_v2", "Contents", Bucket=bucket_name):
total_size += obj["Size"]
object_count += 1
print(f"Bucket {bucket_name}: {object_count:,} objects, {total_size / 1e9:.2f} GB\n")
# ── Example 4: CloudTrail events with callback ─────────────────
from datetime import datetime, timedelta
login_events = []
def collect_console_logins(event):
if event.get("EventName") == "ConsoleLogin":
login_events.append(event)
count = paginate_with_callback(
ct,
"lookup_events",
"Events",
collect_console_logins,
StartTime=datetime.utcnow() - timedelta(days=7),
EndTime=datetime.utcnow(),
)
print(f"Processed {count} CloudTrail events, {len(login_events)} console logins")
# ── Example 5: Paginate RDS instances ─────────────────────────
rds = boto3.client("rds")
db_instances = paginate_all_list(rds, "describe_db_instances", "DBInstances")
print(f"\nTotal RDS instances: {len(db_instances)}")
for db in db_instances:
print(f" {db['DBInstanceIdentifier']} ({db['DBInstanceStatus']})")
Why Paginators Beat Manual NextToken
# ❌ WRONG — silently misses resources beyond the first page
response = ec2.describe_instances() # Returns ONLY the first page!
instances = response["Reservations"] # May be incomplete
# ❌ FRAGILE — manual but verbose and easy to forget
response = ec2.describe_instances()
all_reservations = response["Reservations"]
while "NextToken" in response:
response = ec2.describe_instances(NextToken=response["NextToken"])
all_reservations.extend(response["Reservations"])
# ✅ CORRECT — paginator handles everything
paginator = ec2.get_paginator("describe_instances")
for page in paginator.paginate():
for reservation in page["Reservations"]:
process(reservation)
# ✅ BEST — use our generic utility
for reservation in paginate_all(ec2, "describe_instances", "Reservations"):
process(reservation)
Key Commands Explained
| Command | What it does |
|---|
client.get_paginator("method_name") | Returns a boto3 Paginator for the given API method |
paginator.paginate(**kwargs) | Returns a PageIterator — yields one page dict per iteration |
page.get(result_key, []) | Extracts the result list from each page — defaults to [] if key absent |
yield from iterable | Delegates iteration to the inner iterable (lazy generator composition) |
response.get("NextToken") | Returns None if no more pages (loop terminates) |
getattr(client, method_name) | Gets a method by name string — allows dynamic method dispatch |
PaginationConfig={"MaxItems": 500} | Limit total results across all pages |
# Limit total results (useful for sampling or testing)
paginator = client.get_paginator("list_objects_v2")
for page in paginator.paginate(
Bucket="my-bucket",
PaginationConfig={
"MaxItems": 100, # Stop after 100 total items
"PageSize": 50, # 50 items per API call
"StartingToken": None, # Resume from a specific token
}
):
process(page["Contents"])
🔍 Line-by-Line Code Walkthrough
Imports
| Line | Why It’s Used |
|---|
import boto3 | AWS SDK — needed for creating service clients |
from typing import Generator, Any, Optional | Type hints. Generator[Any, None, None] declares that a function returns a generator that yields Any type values |
paginate_all(client, method_name, result_key, **kwargs)
def paginate_all(client, method_name: str, result_key: str, **kwargs) -> Generator[Any, None, None]:
| Part | Explanation |
|---|
client | Any boto3 service client (e.g., boto3.client("ec2"), boto3.client("s3")) |
method_name: str | The API method name as a string (e.g., "describe_instances", "list_objects_v2") — allows this function to work with ANY paginated API |
result_key: str | The key in each page response that contains the list of results (e.g., "Reservations", "Contents", "Users") |
**kwargs | Any additional arguments to pass through to the underlying API (e.g., Filters=[...], Bucket="my-bucket") |
-> Generator[Any, None, None] | Return type hint: this is a generator function. Any = items can be any type. First None = no values are sent into the generator. Second None = no return value |
try:
paginator = client.get_paginator(method_name)
for page in paginator.paginate(**kwargs):
yield from page.get(result_key, [])
| Line | Explanation |
|---|
client.get_paginator(method_name) | Dynamically creates a Paginator for the named method. boto3 knows which response key to use for NextToken automatically |
paginator.paginate(**kwargs) | Returns a PageIterator. Each iteration yields one full API response dict (one page) |
yield from page.get(result_key, []) | yield from delegates iteration — yields each item in the list one at a time to the caller. This is the key to making this a lazy generator (memory efficient). page.get(result_key, []) defaults to [] if the key is absent (some pages may have no results) |
except Exception:
yield from _manual_paginate(client, method_name, result_key, **kwargs)
| Line | Explanation |
|---|
except Exception | Catches OperationNotPageable (raised when the method doesn’t have a built-in paginator) and any other error from get_paginator |
yield from _manual_paginate(...) | Falls back to the manual NextToken implementation. yield from inside a try/except is valid in Python 3.3+ |
_manual_paginate(client, method_name, result_key, **kwargs)
method = getattr(client, method_name)
while True:
response = method(**kwargs)
yield from response.get(result_key, [])
next_token = response.get("NextToken") or response.get("Marker")
if not next_token:
break
if "NextToken" in response:
kwargs["NextToken"] = next_token
else:
kwargs["Marker"] = next_token
| Line | Explanation |
|---|
getattr(client, method_name) | Gets a method by name string. getattr(ec2_client, "describe_instances") returns the describe_instances method object. This enables dynamic dispatch |
while True: | Infinite loop — continues until we break when there are no more pages |
response = method(**kwargs) | Calls the API. **kwargs passes all accumulated parameters including any pagination tokens |
yield from response.get(result_key, []) | Yields all items from this page to the caller |
response.get("NextToken") or response.get("Marker") | Tries NextToken first (modern APIs), then Marker (older APIs like IAM use Marker). The or ensures we get whichever is present |
if not next_token: break | None (key absent) or "" (empty string) both evaluate to falsy — stops the loop |
kwargs["NextToken"] = next_token | Injects the continuation token into kwargs so the next method(**kwargs) call fetches the next page |
paginate_all_list(...)
def paginate_all_list(client, method_name, result_key, **kwargs) -> list:
return list(paginate_all(client, method_name, result_key, **kwargs))
| Line | Explanation |
|---|
list(paginate_all(...)) | Consumes the entire generator and stores all results in a list. Use when you need random access (results[5]), length check (len(results)), or need to iterate multiple times |
| When to prefer generator vs list? | Generator = memory efficient, process as data arrives. List = needed when you must check len(), sort, or iterate multiple times |
paginate_with_callback(client, method_name, result_key, callback, **kwargs)
count = 0
for item in paginate_all(client, method_name, result_key, **kwargs):
callback(item)
count += 1
return count
| Line | Explanation |
|---|
callback(item) | Calls the user-provided function with each item. The callback can write to a database, file, or process data without buffering everything |
count += 1 | Tracks total items processed. Returned for logging or reporting |
| Use case | Streaming processing — e.g., processing 1 million S3 objects without loading all their metadata into RAM first |
Usage Example — Why paginate_all Instead of Direct Call
# ❌ WRONG — silently misses resources beyond the first page
response = ec2.describe_instances()
instances = response["Reservations"] # May be incomplete!
# ✅ CORRECT — never misses a result
for reservation in paginate_all(ec2, "describe_instances", "Reservations"):
process(reservation)
| Point | Explanation |
|---|
| The silent failure danger | describe_instances() without pagination returns the first page only (up to 1000 instances). In a small test account it looks correct. In production it silently drops instances |
| No error is raised | AWS doesn’t error when there are more results — it just silently omits them. The response includes "NextToken" but if you don’t check for it, you never know more data exists |