Multi-Cloud: Azure & GCP

The Multi-Cloud Reality

18 min Lesson 1 of 28

The Multi-Cloud Reality

Walk into the infrastructure review of any Fortune-500 company and you will almost certainly find services running on at least two cloud providers — often three. This is not an accident, a mistake, or a symptom of poor governance. It is the predictable outcome of how large organizations actually grow, acquire companies, negotiate contracts, and manage risk. Understanding the real drivers behind multi-cloud — and the hard limits of "cloud portability" — is the foundation every senior DevOps engineer needs before touching a single Terraform module or a cross-cloud VPN.

How Organizations Actually End Up Multi-Cloud

The narrative that engineering teams choose multi-cloud upfront is largely a myth. In practice, companies land in a multi-cloud posture through one of five paths:

Mergers and acquisitions. Your company acquires a startup that ran entirely on GCP. The board expects the deal to close in six months. Migrating 200 services to AWS in that window is not feasible. The realistic path is a hybrid network connection (VPN or Interconnect) while you plan a longer-term consolidation — which often never fully happens because the acquired services ship features faster than migration tickets get prioritized.
Vendor-specific best-of-breed services. BigQuery is the dominant analytics engine in its category. Azure Active Directory (Entra ID) is already the corporate identity provider because the company was Microsoft-first before it moved workloads to AWS. Snowflake runs on the cloud the data team chose before platform engineering existed. Each team picks the best tool for their job, and those tools span providers.
Negotiation leverage. A $20M annual cloud spend gives you meaningful negotiating power — but only if the vendor believes you can move. Maintaining a real workload on a second provider, even a small one, is frequently justified internally as a hedge against price increases and lock-in. Finance teams understand this argument even when engineering teams resist the operational overhead.
Regulatory and data-residency requirements. Some jurisdictions require data to remain in-country, and not every provider has a local region in every regulated market. A global SaaS company may serve EU customers from AWS eu-central-1, Japanese customers from GCP asia-northeast1, and Middle Eastern customers from Azure UAE North — not because the architecture is elegant, but because those are the only options that satisfy data-residency regulations in each market.
Risk diversification after a major outage. The 2021 AWS us-east-1 outage, the 2022 GCP multi-region event, and the 2024 CrowdStrike incident collectively reminded enterprises that any single provider can fail in ways that take hours to resolve. Boards and CISOs increasingly require a documented failover capability on a second provider for tier-1 services, regardless of the engineering cost.

The pattern to internalize: multi-cloud is almost always an organizational outcome, not an engineering decision. Your job as a DevOps engineer is not to prevent it but to build the tooling and practices that make it manageable — consistent observability, unified secrets management, infrastructure-as-code that abstracts provider specifics, and clear runbooks for cross-cloud incident response.

The Portability Myth

Every cloud provider — and every Kubernetes vendor — sells the idea that their platform is portable: move your workloads anywhere. The reality is more constrained. Portability exists at different layers, and each layer has a different cost.

Container-layer portability (high, cheap): A Docker image built for linux/amd64 or linux/arm64 runs identically on EKS, GKE, or AKS. The compute layer is genuinely portable. This is the layer Kubernetes was designed to standardize.
Infrastructure-layer portability (medium, expensive): Terraform modules abstract provider APIs behind a consistent HCL interface, but a module that provisions an AWS ALB cannot provision an Azure Application Gateway by changing a variable. You need parallel modules, parallel state files, and parallel CI pipelines. The abstraction cost is real engineering time.
Managed-service portability (low, very expensive): Aurora PostgreSQL is not Postgres. Cloud Spanner is not any open standard. BigQuery's SQL dialect, partition strategies, and slot-based pricing have no equivalent on AWS or Azure. The moment your application uses a managed service beyond basic RDS-compatible Postgres, you have accepted lock-in at the data layer — and data-layer lock-in is the hardest to reverse.
Operational portability (lowest, hardest): Your teams know CloudWatch, not Azure Monitor. Your on-call runbooks reference aws ec2 describe-instances, not az vm list. Cognitive overhead is real. Multi-cloud doubles the tool surface your engineers must stay current on.

The abstraction trap: teams sometimes try to solve portability by building an internal abstraction layer — a platform API that hides AWS and GCP behind a common interface. This sounds elegant and ends in disaster. You build the slowest, least-featured version of each cloud provider, you own all the bugs, and you block your teams from using provider-native features that would have solved their problems in a day. Abstract at the process layer (IaC patterns, Helm charts, observability pipelines) not at the API layer.

What Big-Tech Actually Standardizes

Rather than chasing full portability, high-performing engineering organizations standardize the things that genuinely matter across clouds:

Identity and access: a single IdP (Okta, Entra ID, Google Workspace) federated to all three clouds via SAML/OIDC. Engineers log in once; role assumption is provider-specific but governed centrally.
Secrets management: HashiCorp Vault (or its cloud-native equivalent) as the single source of truth for secrets, with cloud-provider auth backends. No secrets are stored in cloud-native secret managers in isolation — they are all managed through the Vault API.
Observability: a single pane of glass — Datadog, Grafana Cloud, or a self-hosted Prometheus/Thanos stack — that ingests metrics, logs, and traces regardless of provider. CloudWatch metrics are exported; GCP Cloud Monitoring metrics are exported. Engineers see one dashboard for the whole fleet.
Cost visibility: a FinOps platform (Apptio Cloudability, CloudHealth, or the open-source OpenCost) that normalizes spend across providers into a single report with consistent tagging taxonomy.
Networking: a defined transit architecture — typically AWS Transit Gateway + GCP HA VPN or dedicated interconnects — with consistent IP address management (IPAM) to prevent CIDR overlap across providers. Overlapping CIDRs in a multi-cloud network are one of the most painful production problems to remediate.

# ── Detect CIDR overlap before establishing cross-cloud peering ────────────────
# This is a real pre-flight check you run before provisioning a VPN or interconnect.

AWS_CIDR="10.10.0.0/16"
GCP_CIDR="10.10.128.0/18"   # overlaps with the AWS block above!

python3 - <<'EOF'
import ipaddress, sys
aws = ipaddress.ip_network("10.10.0.0/16")
gcp = ipaddress.ip_network("10.10.128.0/18")
if aws.overlaps(gcp):
    print(f"OVERLAP DETECTED: {aws} and {gcp} share addresses.")
    print(f"Overlapping range: {list(aws.address_exclude(gcp))}")
    sys.exit(1)
else:
    print("No overlap — safe to proceed with peering.")
EOF

A typical enterprise multi-cloud footprint: each provider hosts different workloads; a cross-cutting standardization layer manages secrets, identity, observability, IaC, and cost across all three.

Pragmatic Decision Framework: When to Go Multi-Cloud

Before adding a second cloud provider, run this checklist honestly. Every "no" increases the operational debt you are taking on:

Do you have a dedicated platform engineering team that will own the cross-cloud tooling? (Solo DevOps engineers cannot sustain two cloud footprints at big-tech quality.)
Is the use case genuinely better served by provider B, or are you using provider B because a team chose it before platform standards existed?
Have you modeled the steady-state operational cost — not just the migration cost — including on-call burden, tooling licenses, and training?
Do you have a documented plan for cross-cloud incident response? Who owns the bridge call when the inter-cloud VPN goes down at 3 AM?
Is the regulatory or business driver for multi-cloud documented and signed off by a stakeholder, or is this an engineering preference dressed up as a strategy?

# ── Quick audit: find what cloud resources an org already has running ──────────
# Run this in a shell with AWS, GCP, and Azure CLIs configured.

echo "=== AWS: Running EC2 instances per region ==="
aws ec2 describe-regions --query 'Regions[].RegionName' --output text | \
  tr '\t' '\n' | while read region; do
    count=$(aws ec2 describe-instances --region "$region" \
      --filters Name=instance-state-name,Values=running \
      --query 'length(Reservations[].Instances[])' --output text 2>/dev/null)
    [ "$count" -gt 0 ] 2>/dev/null && echo "  $region: $count instances"
  done

echo ""
echo "=== GCP: Compute instances per zone ==="
gcloud compute instances list --format="table(zone,name,status)" \
  --filter="status=RUNNING" 2>/dev/null | head -20

echo ""
echo "=== Azure: VMs per resource group ==="
az vm list --show-details \
  --query "[?powerState=='VM running'].[resourceGroup,name,location]" \
  --output table 2>/dev/null | head -20

Start with observability, not compute: before you write a single cross-cloud Terraform module, instrument both environments into a single Grafana or Datadog workspace. You cannot operate what you cannot see. Engineers at Spotify, Netflix, and Shopify all report that unified observability is the highest-ROI first step in any multi-cloud initiative — it surfaces the real traffic patterns, failure modes, and cost drivers that should inform every subsequent architectural decision.

The Honest State of Multi-Cloud in 2025

After years of multi-cloud hype, the industry has settled into a pragmatic consensus: active-active multi-cloud for arbitrary workloads is not economically viable for most organizations. What works at scale is a tiered model:

Tier 1 (primary cloud — 80-90% of spend): AWS or GCP as the primary platform. Deep native integrations, managed services, and provider-specific expertise. This is where your product runs.
Tier 2 (secondary cloud — 10-20% of spend): a second provider for specific, justified workloads — BigQuery for analytics, Entra ID for identity, a secondary Kubernetes cluster for regulatory failover, or a managed database used by an acquired business unit.
Tier 3 (standardization tooling — spans both): the cross-cutting concerns described above: Vault, a single IdP, unified observability, cost normalization. This tier is not a cloud provider; it is your internal platform layer.

The tutorials ahead will teach you Azure and GCP in depth — their compute models, networking primitives, managed Kubernetes offerings, and DevOps toolchains. As you learn each service, keep coming back to this lesson's question: why would an organization actually need this, and what is the true operational cost of adding it? That question separates engineers who deploy infrastructure from engineers who design it.