Serverless & Event-Driven Operations

Step Functions & Orchestration

18 min Lesson 8 of 28

Step Functions & Orchestration

Every distributed serverless system eventually confronts the same problem: a chain of Lambda functions calling each other via InvokeFunction creates invisible, unobservable coupling. You lose execution history, retries are manual, error paths live in application code, and a single timeout cascades silently. AWS Step Functions solves this by making workflow state explicit, durable, and observable — lifting coordination logic out of your functions entirely.

Orchestration vs. Choreography

These two patterns reflect fundamentally different answers to the question "who decides what happens next?"

Choreography — each service reacts to events it cares about. Order Service emits order.placed, Inventory Service listens and emits inventory.reserved, Billing listens to that and emits payment.charged. No central brain. Extremely decoupled, scales horizontally, but the overall workflow is implicit — spread across multiple consumers. Debugging a failure means tracing correlation IDs across five CloudWatch log groups and two EventBridge buses.
Orchestration — a central coordinator (the state machine) calls each participant in order, passes outputs as inputs, handles failures, and maintains durable state. The workflow is explicit, visible, and versionable. Coupling is looser than Lambda-to-Lambda direct invocation but tighter than pure choreography.

Neither pattern is universally superior. At Stripe-scale, pure choreography works because teams own bounded contexts and eventual consistency is acceptable. For multi-step business transactions with compensation logic (saga pattern), Step Functions orchestration wins on observability and correctness guarantees.

Choreography (left) routes events through a bus with no central coordinator; orchestration (right) makes the workflow explicit in a Step Functions state machine with built-in compensation paths.

Amazon States Language — Production Patterns

Step Functions state machines are defined in Amazon States Language (ASL), a JSON/YAML superset. The critical state types are: Task (invoke work), Choice (branch), Parallel (fan-out), Map (iterate with concurrency), Wait (sleep or wait until a timestamp), and Pass (transform data). Standard Workflows give you exactly-once semantics, up to one year execution, and a full execution history in the console — use these for business-critical transactions. Express Workflows run up to five minutes at roughly $1 per million state transitions — suited for high-throughput streaming pipelines where at-least-once delivery is acceptable.

The waitForTaskToken integration pattern is indispensable in production: your Lambda receives a task token, performs async work (calls an external API, waits for a human approval), then calls SendTaskSuccess or SendTaskFailure with the token when complete. The state machine parks itself, incurring no poll cost, and resumes exactly where it left off — even days later.

# order-pipeline.asl.yaml — Standard Workflow: e-commerce order saga with compensation
Comment: "Order fulfillment with saga compensation on failure"
StartAt: ValidateOrder
States:

  ValidateOrder:
    Type: Task
    Resource: arn:aws:lambda:us-east-1:123456789012:function:ValidateOrder
    ResultPath: $.validation
    Retry:
      - ErrorEquals: [Lambda.ServiceException, Lambda.AWSLambdaException]
        IntervalSeconds: 2
        MaxAttempts: 3
        BackoffRate: 2
    Catch:
      - ErrorEquals: [ValidationError]
        ResultPath: $.error
        Next: FailOrder
    Next: ReserveInventory

  ReserveInventory:
    Type: Task
    Resource: arn:aws:states:::lambda:invoke.waitForTaskToken
    Parameters:
      FunctionName: arn:aws:lambda:us-east-1:123456789012:function:ReserveInventory
      Payload:
        orderId.$: $.orderId
        taskToken.$: $$.Task.Token
    HeartbeatSeconds: 30
    Catch:
      - ErrorEquals: [States.HeartbeatTimeout, InsufficientStock]
        Next: FailOrder
    Next: ChargePayment

  ChargePayment:
    Type: Task
    Resource: arn:aws:states:::lambda:invoke
    Parameters:
      FunctionName: arn:aws:lambda:us-east-1:123456789012:function:ChargePayment
      Payload.$: $
    Catch:
      - ErrorEquals: [PaymentDeclined, States.TaskFailed]
        Next: ReleaseReservation
    Next: FanOutNotifications

  ReleaseReservation:
    Type: Task
    Resource: arn:aws:lambda:us-east-1:123456789012:function:ReleaseReservation
    Next: FailOrder

  FanOutNotifications:
    Type: Parallel
    Branches:
      - StartAt: SendEmail
        States:
          SendEmail:
            Type: Task
            Resource: arn:aws:states:::ses:sendEmail
            Parameters:
              Destination:
                ToAddresses.$: States.Array($.customerEmail)
              Message:
                Subject:
                  Data: "Order Confirmed"
                Body:
                  Text:
                    Data.$: States.Format('Order {} confirmed.', $.orderId)
              Source: orders@example.com
            End: true
      - StartAt: PublishSNS
        States:
          PublishSNS:
            Type: Task
            Resource: arn:aws:states:::sns:publish
            Parameters:
              TopicArn: arn:aws:sns:us-east-1:123456789012:OrderEvents
              Message.$: States.JsonToString($)
            End: true
    End: true

  FailOrder:
    Type: Fail
    Error: OrderFailed
    Cause: "Order could not be completed; compensation applied"

Deploying with Terraform and the AWS SAM Accelerator

At production scale you version-control the ASL definition alongside your Lambda code. With Terraform, the aws_sfn_state_machine resource takes the definition as a rendered template string. With SAM or the CDK, state machines live in template.yaml as AWS::Serverless::StateMachine resources. The workflow below deploys changes and immediately tails execution logs — a fast inner loop for debugging state transitions.

# Terraform — deploy a Standard Workflow from an ASL file
resource "aws_sfn_state_machine" "order_pipeline" {
  name     = "order-pipeline-${var.env}"
  role_arn = aws_iam_role.sfn_exec.arn
  type     = "STANDARD"

  definition = templatefile("${path.module}/order-pipeline.asl.yaml", {
    validate_fn  = aws_lambda_function.validate_order.arn
    reserve_fn   = aws_lambda_function.reserve_inventory.arn
    charge_fn    = aws_lambda_function.charge_payment.arn
    release_fn   = aws_lambda_function.release_reservation.arn
    notify_topic = aws_sns_topic.order_events.arn
  })

  logging_configuration {
    log_destination        = "${aws_cloudwatch_log_group.sfn.arn}:*"
    include_execution_data = true
    level                  = "ERROR"
  }

  tracing_configuration {
    enabled = true   # X-Ray across all state transitions
  }
}

# Tail the last execution to debug a failure
EXEC_ARN=$(aws stepfunctions list-executions \
  --state-machine-arn arn:aws:states:us-east-1:123456789012:stateMachine:order-pipeline-prod \
  --status-filter FAILED --max-results 1 \
  --query "executions[0].executionArn" --output text)

aws stepfunctions get-execution-history \
  --execution-arn "$EXEC_ARN" \
  --query "events[?type=='TaskFailed']" \
  --output json

Production Failure Modes and Senior-Level Trade-offs

Step Functions introduces its own failure modes that teams new to orchestration hit in production:

Payload size limits — state machine input/output is capped at 256 KB. For large payloads (ML inference results, document contents), store the data in S3 and pass an S3 reference through the state machine. Use ResultSelector to strip Lambda responses down to only the fields you need before they enter the state.
Express Workflow at-least-once semantics — if your downstream work is not idempotent, a re-executed Express Workflow can charge a customer twice or create duplicate records. Always implement idempotency keys (order ID + step name as the DynamoDB key) when using Express mode.
Fan-out with Map state concurrency — MaxConcurrency: 0 means unbounded parallelism. At 10,000 items per Map execution you can exhaust Lambda concurrency quotas or DDoS a downstream API. Set MaxConcurrency to a value aligned with your service quota limits, typically 40-100 for Lambda and 10-20 for third-party APIs.
Standard Workflow cost at high frequency — each state transition costs $0.000025. A 10-state workflow running 100,000 times per day costs ~$25/day in state transitions alone — not counting Lambda. For high-frequency, short-duration flows, Express Workflows reduce that by ~10x.

When you mix orchestration and choreography — the hybrid saga pattern — Step Functions orchestrates the local transaction scope (validate, reserve, charge) while EventBridge carries the final order.fulfilled event to decoupled downstream consumers (analytics, warehouse, CRM). This gives you transaction correctness where it matters and decoupling where it scales.

Never put secrets or PII inside the state machine input or output. Step Functions execution history is stored in CloudWatch Logs and accessible via the console to anyone with states:GetExecutionHistory. Pass a customer ID, not a credit card number; retrieve sensitive data inside the Lambda from Secrets Manager using the ID.

Observability: Execution History vs. CloudWatch Metrics

Standard Workflows give you a visual execution graph in the console — indispensable for debugging. For operational alerting at scale, set CloudWatch alarms on the built-in metrics: ExecutionsFailed, ExecutionsTimedOut, and ExecutionThrottled. Enable include_execution_data: true only in non-production environments or for sampled executions — at high volume it significantly increases CloudWatch Logs costs. In production, log at ERROR level and rely on X-Ray traces to reconstruct the full call graph across Lambda invocations within the workflow.