Python for DevOps Automation

Automating Cloud with SDKs

18 min Lesson 7 of 28

Automating Cloud with SDKs

Every serious DevOps automation task eventually talks to a cloud API. You could craft raw HTTP requests with the requests library (covered in Lesson 4), but the cloud providers ship official SDKs that handle authentication, request signing, automatic retries, regional endpoints, pagination, and resource waiters out of the box. For AWS, that SDK is boto3. Understanding its internal model — sessions, clients, resources, paginators, and waiters — is not optional if you plan to operate AWS infrastructure at scale. This lesson teaches that model the way a senior SRE would explain it to a new hire on their first week.

The boto3 Object Hierarchy

boto3 exposes three distinct abstraction layers. Understanding which to use — and when — is the first lesson most tutorials skip.

Session: A configuration container. It holds the credentials, region, and profile that all subsequent API calls inherit. Every boto3 interaction flows through a session, whether you created one explicitly or not.
Client: A thin, low-level wrapper that maps 1:1 to AWS service API operations. Every API action in the AWS documentation corresponds to exactly one client method. Responses are plain Python dicts. This is the layer you use when you need precise control, when the Resource abstraction does not cover a service, or when you need to pass through raw request parameters.
Resource: A higher-level, object-oriented wrapper over clients. An S3 Bucket object has methods like .upload_file() and .objects.all() instead of raw API calls. Resources are convenient but cover only a subset of AWS services (S3, EC2, DynamoDB, IAM, SQS, SNS). For anything else, use a client directly.

Key idea: At big-tech companies, most internal ops tooling uses clients rather than resources. Resources abstract away detail that you sometimes need (e.g., the raw response metadata for debugging or for passing VersionId parameters to S3 operations). Build on clients; use resources only when their convenience genuinely simplifies your code.

Sessions: The Right Way to Handle Credentials

The default boto3 credential chain reads from environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), then ~/.aws/credentials, then IAM instance profiles (on EC2/ECS/Lambda). For most scripts on properly-configured infrastructure this chain works automatically — you do not hardcode anything.

Create an explicit Session when your script needs to work with multiple accounts (cross-account automation), multiple regions, or when you want to assume a role via STS. Passing a session object through your code rather than calling the module-level boto3.client() makes credentials visible, testable, and mockable.

import boto3
from botocore.config import Config

# --- Explicit session: recommended for all non-trivial scripts ---

# Profile-based session (reads from ~/.aws/credentials [prod] section)
session = boto3.Session(profile_name="prod", region_name="us-east-1")

# OR: role assumption via STS (cross-account automation pattern)
sts = boto3.client("sts")
assumed = sts.assume_role(
    RoleArn="arn:aws:iam::123456789012:role/DeployAutomation",
    RoleSessionName="ops-script",
    DurationSeconds=3600,
)
creds = assumed["Credentials"]

session = boto3.Session(
    aws_access_key_id=creds["AccessKeyId"],
    aws_secret_access_key=creds["SecretAccessKey"],
    aws_session_token=creds["SessionToken"],
    region_name="us-east-1",
)

# Create clients from the session, not from the boto3 module directly
# botocore.config.Config controls retry behaviour and timeouts
config = Config(
    retries={"max_attempts": 5, "mode": "adaptive"},
    connect_timeout=5,
    read_timeout=30,
)

ec2 = session.client("ec2", config=config)
s3  = session.client("s3",  config=config)

Pro practice: Use mode="adaptive" in the retry config instead of "standard". Adaptive mode applies exponential backoff with jitter and throttles the client when AWS returns ThrottlingException or RequestLimitExceeded. Standard mode retries immediately, which makes throttling worse under load — exactly the opposite of what you want during a mass deployment or incident response script.

Paginators: Never Assume a Single Response Page

One of the most common bugs in ops scripts is calling a list API and silently missing results. AWS list APIs are paginated: a single call returns at most a few hundred items and a NextToken (or Marker, or NextPageToken — it varies by service). If you do not follow the token, you only see the first page. On a small account this works by accident. On an account with 10,000 S3 objects or 500 EC2 instances, it silently misses most of them.

boto3 paginators eliminate this problem entirely. A paginator is a boto3 object that knows which token field to follow for a given API and automatically issues successive requests until the results are exhausted. The calling code sees a single iterator.

import boto3

session = boto3.Session(region_name="us-east-1")
ec2 = session.client("ec2")

# --- WRONG: may silently miss instances if there are >1000 ---
# response = ec2.describe_instances()
# reservations = response["Reservations"]

# --- RIGHT: use a paginator ---
paginator = ec2.get_paginator("describe_instances")

# page_iterator yields one response dict per page
page_iterator = paginator.paginate(
    Filters=[{"Name": "instance-state-name", "Values": ["running"]}]
)

instances = []
for page in page_iterator:
    for reservation in page["Reservations"]:
        for instance in reservation["Instances"]:
            instances.append({
                "id":   instance["InstanceId"],
                "type": instance["InstanceType"],
                "az":   instance["Placement"]["AvailabilityZone"],
            })

print(f"Found {len(instances)} running instances")

# --- Paginator with result filtering (server-side, reduces API calls) ---
s3 = session.client("s3")
s3_paginator = s3.get_paginator("list_objects_v2")

# PaginationConfig limits pages or total items fetched
pages = s3_paginator.paginate(
    Bucket="my-data-bucket",
    Prefix="logs/2025/",
    PaginationConfig={"PageSize": 1000},
)

# .search() applies a JMESPath expression across all pages
large_objects = list(pages.search("Contents[?Size > `10485760`]"))
print(f"Objects >10 MB: {len(large_objects)}")

Production pitfall: Never call list_objects_v2, describe_instances, list_users, or any other list API without a paginator in production code. The default page size for most services is 100-1000 items. An account that has 1,001 IAM users will silently report only 1,000 in a compliance script — and nobody will notice until an audit fails. Make paginators the default, not the exception.

Waiters: Blocking Until AWS Is Ready

Cloud resources take time to reach their desired state. An EC2 instance moves through pending → running. An RDS snapshot transitions from creating → available. A CloudFormation stack goes from CREATE_IN_PROGRESS → CREATE_COMPLETE. Your automation script must wait for these transitions before taking the next action — but polling manually with time.sleep() loops is fragile, burns API quota, and produces ugly code.

boto3 waiters encapsulate the correct polling logic for each resource state transition: the right API to call, the right field to inspect, the right poll interval (typically 15 seconds), and the right maximum number of attempts before giving up. You get a clean blocking call that either returns when the resource is ready or raises WaiterError on timeout.

import boto3
from botocore.exceptions import WaiterError

session = boto3.Session(region_name="us-east-1")
ec2 = session.client("ec2")

# --- Launch an instance and wait for it to be running ---
response = ec2.run_instances(
    ImageId="ami-0c02fb55956c7d316",   # Amazon Linux 2023, us-east-1
    InstanceType="t3.micro",
    MinCount=1,
    MaxCount=1,
    TagSpecifications=[{
        "ResourceType": "instance",
        "Tags": [{"Key": "Name", "Value": "ops-worker"}],
    }],
)
instance_id = response["Instances"][0]["InstanceId"]
print(f"Launched {instance_id}, waiting for running state...")

# ec2.get_waiter("instance_running") polls describe_instances every 15s
# Default: up to 40 attempts = 10 minutes max wait
waiter = ec2.get_waiter("instance_running")
try:
    waiter.wait(
        InstanceIds=[instance_id],
        WaiterConfig={"Delay": 15, "MaxAttempts": 40},
    )
    print(f"{instance_id} is now running")
except WaiterError as exc:
    print(f"Waiter timed out or failed: {exc}")
    raise SystemExit(1)

# --- Custom waiter config for fast-transitioning resources ---
# RDS snapshot: check every 30s, give up after 60 minutes
rds = session.client("rds")
rds_waiter = rds.get_waiter("db_snapshot_available")
rds_waiter.wait(
    DBSnapshotIdentifier="my-db-snap-2025-06-11",
    WaiterConfig={"Delay": 30, "MaxAttempts": 120},
)
print("RDS snapshot is available")

boto3 object hierarchy: a Session configures credentials; Clients (low-level) and Resources (OO) issue API calls; Paginators and Waiters layer on top of Clients for pagination and state polling.

Putting It Together: A Production-Grade Script

Here is a realistic ops script that decommissions stopped EC2 instances older than 30 days in a given region. It demonstrates every concept from this lesson: explicit session with retry config, paginator to enumerate all instances, a filter to match stopped ones, and an IAM pre-check before any destructive action.

"""
decommission_stopped.py — terminate EC2 instances stopped for >30 days.

Usage:
    python decommission_stopped.py --region us-east-1 --dry-run
    python decommission_stopped.py --region us-east-1
"""
import argparse
import logging
from datetime import datetime, timezone, timedelta

import boto3
from botocore.config import Config
from botocore.exceptions import ClientError

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s %(levelname)s %(message)s",
)
log = logging.getLogger(__name__)

CUTOFF_DAYS = 30


def get_session(region: str) -> boto3.Session:
    return boto3.Session(region_name=region)


def build_client(session: boto3.Session, service: str):
    cfg = Config(
        retries={"max_attempts": 5, "mode": "adaptive"},
        connect_timeout=5,
        read_timeout=30,
    )
    return session.client(service, config=cfg)


def stopped_instances_older_than(ec2, days: int) -> list[dict]:
    cutoff = datetime.now(timezone.utc) - timedelta(days=days)
    paginator = ec2.get_paginator("describe_instances")
    pages = paginator.paginate(
        Filters=[{"Name": "instance-state-name", "Values": ["stopped"]}]
    )
    old = []
    for page in pages:
        for r in page["Reservations"]:
            for inst in r["Instances"]:
                state_reason = inst.get("StateTransitionReason", "")
                # State reason includes the stop timestamp for user-initiated stops
                # e.g. "User initiated (2025-04-10 14:23:00 GMT)"
                # Fall back to LaunchTime if we cannot parse it
                try:
                    ts_str = state_reason.split("(")[1].rstrip(")")
                    ts = datetime.strptime(ts_str, "%Y-%m-%d %H:%M:%S GMT").replace(
                        tzinfo=timezone.utc
                    )
                except (IndexError, ValueError):
                    ts = inst["LaunchTime"]

                if ts < cutoff:
                    old.append({"id": inst["InstanceId"], "stopped_at": ts})
    return old


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--region", required=True)
    parser.add_argument("--dry-run", action="store_true")
    args = parser.parse_args()

    session = get_session(args.region)
    ec2 = build_client(session, "ec2")

    targets = stopped_instances_older_than(ec2, CUTOFF_DAYS)
    if not targets:
        log.info("No instances qualify for decommission")
        return

    ids = [t["id"] for t in targets]
    log.info("Candidate instances: %s", ids)

    if args.dry_run:
        log.info("[DRY RUN] Would terminate %d instance(s): %s", len(ids), ids)
        return

    try:
        ec2.terminate_instances(InstanceIds=ids)
        log.info("Termination issued for %s", ids)
    except ClientError as exc:
        log.error("terminate_instances failed: %s", exc)
        raise SystemExit(1)


if __name__ == "__main__":
    main()

Production pitfall: Always add a --dry-run flag to any destructive automation. But note that boto3 has two different meanings of "dry run": some EC2 APIs accept a DryRun=True parameter that AWS evaluates server-side (returning a DryRunOperation success or an UnauthorizedOperation error to check IAM permissions). Your own --dry-run CLI flag, as shown above, skips the destructive API call entirely — which is safer. Use both layers in combination: your flag for end-to-end testing, and AWS DryRun=True for IAM permission pre-checks.

Error Handling for SDK Calls

boto3 surfaces two categories of exceptions. botocore.exceptions.ClientError wraps AWS service errors (HTTP 4xx/5xx): the wrong permissions, a resource that does not exist, a malformed parameter. botocore.exceptions.BotoCoreError wraps connection-level failures: timeouts, SSL errors, endpoint not reachable. In production scripts, catch both and log the full error code, not just the message.

from botocore.exceptions import ClientError, BotoCoreError

def safe_describe_instance(ec2, instance_id: str) -> dict | None:
    try:
        resp = ec2.describe_instances(InstanceIds=[instance_id])
        return resp["Reservations"][0]["Instances"][0]
    except ClientError as exc:
        code = exc.response["Error"]["Code"]
        if code == "InvalidInstanceID.NotFound":
            log.warning("Instance %s not found", instance_id)
            return None
        # UnauthorizedOperation, InvalidParameterValue, etc. — reraise
        log.error("ClientError [%s] for %s: %s", code, instance_id, exc)
        raise
    except BotoCoreError as exc:
        # Network-level failure — the retry config already handled retries
        log.error("Connection error: %s", exc)
        raise

Key takeaway: The pattern you have built in this lesson — explicit session with retry config, paginator for every list call, waiter for every state transition, structured error handling — is the exact pattern used in AWS-internal tooling and in open-source ops frameworks like awscli, chalice, and cloud-custodian. Apply it to any cloud SDK (GCP uses google-cloud-* with similar concepts; Azure SDK for Python mirrors the same client/paginator model) and you will write automation that holds up in production at scale.