Service Mapping Across Clouds
Service Mapping Across Clouds
A senior DevOps engineer evaluating a cross-cloud migration or designing a cloud-portable architecture does not compare logos — they compare service contracts. AWS S3, Azure Blob Storage, and Google Cloud Storage all store objects, but their consistency models, pricing dimensions, lifecycle rule syntax, and failure behavior differ in ways that bite you in production at exactly the wrong moment. This lesson builds your mental translation table: storage, messaging, managed databases, and serverless compute across AWS, Azure, and GCP — not just the names, but the behavioral and operational differences that matter at scale.
Object Storage: S3 vs Blob Storage vs Cloud Storage
Object storage is the first thing every cloud architect maps because nearly every service depends on it — build artifacts, logs, ML datasets, static assets. The three providers converge on the same core API shape but diverge on semantics.
AWS S3 sets the de-facto standard. Objects live in globally-unique buckets in a chosen region. S3 offers strong read-after-write consistency for PUTs and DELETEs (since December 2020) — there is no stale-read window after you write an object. Lifecycle rules, Intelligent-Tiering, Glacier transitions, and S3 Object Lambda are the most mature in the industry. Access control is a three-layer cake: IAM policies, bucket policies, and ACLs (ACLs are deprecated for new workloads — disable them with BucketOwnerEnforced).
Azure Blob Storage organizes blobs inside containers inside storage accounts. The storage account is the billing and networking boundary — a detail that surprises AWS engineers who expect a flat bucket model. Tiers are Hot, Cool, Cold, and Archive; lifecycle rules are defined in JSON management policies. Azure introduces the concept of Hierarchical Namespace (ADLS Gen2) which turns Blob Storage into a POSIX-compatible filesystem for big-data workloads — GCS and S3 have no direct equivalent.
Google Cloud Storage is closest to S3 in model: globally-unique buckets, no intermediate account container, and strong consistency throughout. GCS is the native home for BigQuery external tables and Vertex AI datasets, giving it an ecosystem advantage for analytics workloads. The gsutil CLI has been superseded by gcloud storage (GA in 2024) which is significantly faster for parallel transfers.
Messaging: Queues, Topics, and Event Buses
Messaging is where the naming confusion is worst. AWS alone has four overlapping services; Azure and GCP each made different structural choices that reflect different philosophies about what a messaging primitive should do.
AWS SQS is a pull-based queue — consumers poll for messages, delete them on success, and the queue retains messages until the visibility timeout expires. SQS Standard offers at-least-once delivery; SQS FIFO offers exactly-once ordering at a throughput cap of 3,000 messages per second with batching. SNS is a fan-out pub/sub: one publish, many subscribers (SQS queues, Lambda, HTTP endpoints). EventBridge is a rule-based event bus — route events from AWS services or your own apps to targets based on JSON pattern matching. In 2025, EventBridge is the preferred integration layer for new serverless architectures.
Azure Service Bus is the enterprise-grade equivalent of SQS FIFO — it supports sessions (ordered per entity), dead-letter queues, and message deferral. Azure Event Grid maps to SNS: reactive, push-based, fan-out routing of events from Azure resources to handlers. Azure Event Hubs is a different beast — a partitioned log similar to Apache Kafka, designed for streaming telemetry at millions of events per second, not for task queues.
Google Cloud Pub/Sub unifies what AWS splits across SQS and SNS: it is both a queue and a fan-out system. A topic has multiple subscriptions; each subscription maintains its own cursor into the message log and delivers independently. Pub/Sub guarantees at-least-once delivery; ordering keys enable FIFO within a region. Eventarc is GCP's answer to EventBridge — route Cloud Audit Logs and Pub/Sub messages to Cloud Run or Cloud Functions based on event filters.
Managed Databases: Relational, NoSQL, and the Proprietary Tier
Database service mapping has three layers: managed open-source (MySQL, PostgreSQL), proprietary scaled relational, and NoSQL document stores.
Managed open-source relational is the simplest row in the table: AWS RDS maps to Azure Database for PostgreSQL / MySQL and to Cloud SQL. All three run vanilla Postgres or MySQL with automated backups, read replicas, and point-in-time recovery. Minor differences: RDS supports Oracle and SQL Server; Cloud SQL added AlloyDB (Postgres-compatible, columnar acceleration) in 2023; Azure Flexible Server supports zone-redundant HA without a standby proxy hop.
Proprietary scaled relational is where providers diverge sharply. Amazon Aurora decouples compute from a shared distributed storage layer — you scale readers horizontally and storage grows automatically to 128 TB. Azure SQL Hyperscale does the same for SQL Server workloads. Google Spanner goes furthest: a globally-distributed, externally-consistent, horizontally-scalable relational database that runs SQL across regions with no application-level sharding. There is no AWS or Azure equivalent to Spanner's global consistency model at petabyte scale.
NoSQL / document stores: DynamoDB ↔ Cosmos DB ↔ Firestore. All three are serverless, globally-distributed key-value/document stores targeting single-digit millisecond latency. Critical operational differences: DynamoDB requires you to choose a consistency model per-read (eventually consistent is cheaper — 0.5x the RCU cost); Cosmos DB exposes five tunable consistency levels globally per account; Firestore uses strong consistency within a region. DynamoDB pricing is capacity-unit based (RCU/WCU) — this is wildly unpredictable for spiky traffic unless you use on-demand mode, which carries a per-request premium. Cosmos DB uses a similar RU model with the same gotcha.
Serverless Compute: Lambda vs Azure Functions vs Cloud Functions / Cloud Run
Serverless function mapping is superficially simple — all three providers run your code on request without managing servers — but the operational characteristics differ enough to shape architectural decisions.
AWS Lambda supports up to 15 minutes execution, 10 GB memory, and 10 GB ephemeral storage. Cold starts range from roughly 100 ms (Node.js or Python, no VPC attachment) to several seconds (Java or C#, inside a VPC with ENI provisioning). Lambda is triggered natively by S3 events, SQS messages, DynamoDB Streams, Kinesis, API Gateway, EventBridge, and dozens more without any polling infrastructure you manage.
Azure Functions runs on a Consumption plan (true serverless, billed per execution) or Premium plan (pre-warmed instances, no cold start, VNet integration). The Durable Functions extension adds stateful orchestration — equivalent to AWS Step Functions — built directly into the function runtime without external state stores. This is a significant architectural advantage for long-running, multi-step workflows.
Google Cloud Functions (2nd gen) is backed by Cloud Run — Google's container-based serverless platform. Cloud Run runs any container (not just runtime-specific function packages), handles up to 60 minutes per request, and scales to zero. For teams already containerizing everything, Cloud Run is often the better target: it bridges the serverless/container divide that AWS partially addresses with Lambda container image support but does not fully close.
Lifecycle Rules: A Concrete Syntax Comparison
Object storage lifecycle rules illustrate how conceptually identical features require completely different configuration syntax — and why a single Terraform module often cannot abstract them transparently.
Production Failure Modes Unique to Each Service
Every service mapping exercise must include the failure modes that do not appear in the official docs but do appear in your 3 AM pager alerts.
- S3 + Lambda event notifications: S3 event notifications to Lambda do not guarantee ordering and can deliver the same event more than once. If your Lambda function is not idempotent, you will corrupt data under high-write conditions. Design all event-driven functions to be idempotent first.
- Azure Service Bus message lock expiry: If your consumer takes longer to process a message than the lock duration (default 60 seconds), Service Bus releases the lock and another consumer picks it up — causing duplicate processing. Always set the lock duration to 2x your P99 processing time, and renew the lock programmatically for long-running operations.
- Pub/Sub subscription backlog: Pub/Sub retains messages per-subscription until acknowledged. If a subscriber crashes and the subscription backlog grows for hours, restarting the consumer triggers a flood of messages that can overwhelm downstream services. Use flow control (max outstanding messages) in your Pub/Sub client to throttle delivery on restart.
- DynamoDB hot partitions: DynamoDB distributes reads and writes across partitions by primary key. A poorly chosen partition key (e.g., a
statusfield with only three values) concentrates traffic on a small number of partitions, triggeringProvisionedThroughputExceededExceptioneven when your aggregate throughput is far below the table limit. Design partition keys with high cardinality — user ID, order ID, or a composite key. - Cosmos DB RU throttling: Cosmos DB enforces per-partition RU limits at the collection level. A sudden spike in reads for a popular document (viral content, flash sale) can exhaust the RU budget for that logical partition and return HTTP 429 errors to all requests in that partition — even if other partitions have headroom. Use the SDK's built-in retry policy with exponential backoff and design for partition isolation of hot items.
Choosing the Right Abstraction Layer
Once you have internalized the service map, the practical question is how much abstraction to build over it. There are three common patterns at big-tech scale:
- Thin wrappers with provider-specific configuration: Use Terraform modules that expose a common interface (e.g.,
object_storage,message_queue) but generate provider-native resources underneath. This gives you portability at the provisioning layer without hiding provider-specific knobs you will need in production. - Open-source middleware: Use provider-agnostic runtimes —
MinIOfor S3-compatible storage,NATSorApache Kafkafor messaging,Daprfor pub/sub and state abstraction — in front of the native cloud service. This trades operational complexity for portability. Justified for tier-1 services where portability is a hard requirement; overkill for most workloads. - Application-level abstraction: Encode the service contract in your application interface (e.g., a
StoragePortorQueuePortin hexagonal architecture) and swap implementations per deployment environment. This keeps cloud-specific code in adapters and your business logic completely portable — the approach most large-scale organizations actually use for core services.
The translation table in this lesson is not academic — it is the foundation for making those architectural decisions with precision rather than guesswork. Know the services cold, know where they behave identically and where they diverge, and your multi-cloud architecture will be built on solid ground rather than optimistic assumptions.