Cloud Architecture & Landing Zones

Migration Strategies

18 min Lesson 9 of 28

Migration Strategies

Moving a real enterprise workload to the cloud is never a clean lift-and-shift. The systems that matter most — the ERP that touches payroll, the monolith that has been running for twelve years, the database cluster that finance will not let you touch without a sign-off chain — all arrive with different risk profiles, ownership structures, and modernization ambitions. AWS formalized the vocabulary in 2011 with the original 5 Rs; Gartner and AWS later extended it to 7 Rs, which is now the industry-standard migration taxonomy used in every enterprise migration programme.

This lesson teaches you how to classify workloads against the 7 Rs, how to organize the actual work into migration waves, and how to apply the strangler fig pattern to incrementally replace legacy systems without a big-bang cutover that risks the entire business.

The 7 Rs — Workload Classification

Each R is a migration strategy. The goal of the discovery phase is to assign every in-scope workload to exactly one R. That classification drives tooling choices, timeline estimates, team structure, and risk controls.

Retire — Decommission. ~10–20% of a typical enterprise estate falls here: redundant apps, shadow IT, systems with zero active users. Retiring before migrating is the highest ROI activity in any programme. Do not migrate what you can delete.
Retain — Leave on-premises for now, with a scheduled re-evaluation date. Applies to systems with a regulatory hold, hardware-tied software licenses, or less than 18 months remaining before end-of-life. Retain is not a permanent decision — set a quarterly review cadence.
Rehost — "Lift and shift." Move the workload to the cloud with no code changes. Fastest migration path; uses AWS MGN (Application Migration Service) for server replication or a snapshot-based approach. Delivers ~30% cost reduction immediately from rightsizing alone. No modernization — the technical debt moves with it.
Relocate — Move the platform layer, not just the server. Classically used for VMware-on-premises workloads migrated to VMware Cloud on AWS, preserving the exact hypervisor environment and operational tooling. Often a stepping stone to replatform or refactor later.
Replatform — "Lift, tinker, and shift." Migrate with targeted improvements that do not change the core architecture. Examples: move self-managed MySQL to Amazon RDS (managed patching, failover), move a Java app to Elastic Beanstalk (managed runtime), containerize an app without rewriting it. Typically delivers 40–60% operational savings versus rehost.
Repurchase — Replace with a SaaS product. Move from a self-hosted CRM to Salesforce, from on-premises HR software to Workday, from self-managed email to Google Workspace. The migration is mostly data migration and user training, not infrastructure engineering.
Refactor / Re-architect — Redesign the application to be cloud-native. Break a monolith into microservices, migrate from a relational store to DynamoDB for a high-throughput access pattern, adopt event-driven architecture with SQS/EventBridge. Highest cost and risk; highest long-term business value. Reserved for workloads where the current architecture is the bottleneck to growth.

The 80/20 rule in practice: in a typical 200-application portfolio, expect roughly 15% Retire, 15% Retain, 40% Rehost, 10% Replatform, 5% Repurchase, and 15% Refactor. These ratios shift as you move from the initial migration wave (Rehost-heavy) toward modernization waves (Refactor-heavy). Track them in a migration tracker spreadsheet tied to your landing zone account vending process.

Discovery — Assigning the R

You cannot classify what you cannot see. Discovery tooling maps the application portfolio before the first workload moves. AWS offers two first-party tools for this:

AWS Application Discovery Service (ADS) — Agent-based (aws-discovery-agent) or agentless (via vCenter API). Captures CPU, memory, disk, network dependency data over 30+ days. Feeds directly into AWS Migration Hub.
AWS Migration Hub — Central tracking plane. Every migration task from MGN, DMS, Server Migration Service, or a partner tool reports status here. Use it as the single pane of glass for programme leadership.

# ── Enable Migration Hub in the home region (must be done once per org) ────────
aws migrationhub-config create-home-region-control \
  --home-region us-east-1 \
  --target '{"Type":"ACCOUNT"}'

# ── Check current home region ──────────────────────────────────────────────────
aws migrationhub-config describe-home-region-controls \
  --query 'HomeRegionControls[0].HomeRegion'

# ── List all discovered servers in ADS ────────────────────────────────────────
aws discovery describe-servers \
  --query 'servers[*].{ServerId:serverId,Hostname:serverInfo.agentNetworkInfoList[0].ipAddress,OS:osInfo.name,RAM:systemInfo.totalRamInMB}' \
  --output table

# ── Export dependency data for offline analysis ───────────────────────────────
EXPORT_ID=$(aws discovery start-export-task \
  --export-data-format "CSV" \
  --query 'exportId' --output text)
echo "Export started: $EXPORT_ID"
aws discovery describe-export-tasks --export-ids $EXPORT_ID \
  --query 'exportsInfo[0].{Status:exportStatus,S3Url:configurationsDownloadUrl}'

Migration Waves — Sequencing the Work

A migration wave is a time-boxed batch of workloads that move together. Wave design is the discipline of sequencing migration to manage risk, resource contention, and dependency chains. Big-tech migration programmes typically use a three-tier wave structure:

Three-wave migration structure: foundations first, then business apps, then mission-critical cutover.

Wave 1 (Pilot): Move dev/test environments first — they have zero production blast radius and teach your team how the landing zone, networking, and IAM patterns actually behave with real workloads. Learn your mistakes here, not in Wave 3.

Wave 2 (Business apps): Move medium-priority systems that have clear owners and tested rollback paths. Run AWS Database Migration Service (DMS) continuous replication so the source database stays live during migration — enabling a cut-back window of hours rather than days.

Wave 3 (Mission-critical): Move the systems that cannot go down without executive escalation. By Wave 3 your team has done this dozens of times. Use AWS MGN with a test-mode cutover, validate in production-equivalent load, then flip DNS.

# ── AWS MGN: initialize replication for a source server ───────────────────────
# (Run after installing the MGN agent on the source host)
aws mgn initialize-service --region us-east-1

# ── Tag a server for a specific migration wave ─────────────────────────────────
SERVER_ID="s-0a1b2c3d4e5f67890"
aws mgn tag-resource \
  --resource-arn "arn:aws:mgn:us-east-1:123456789012:source-server/$SERVER_ID" \
  --tags MigrationWave=wave-2,Application=crm,Strategy=rehost

# ── Check replication lag before scheduling a cutover ─────────────────────────
aws mgn describe-source-servers \
  --filters "replicationTypes=AGENT_BASED" \
  --query 'items[*].{ID:sourceServerID,State:dataReplicationInfo.dataReplicationState,Lag:dataReplicationInfo.lagDuration}' \
  --output table

# ── Launch test instance (no production impact) ────────────────────────────────
aws mgn start-test \
  --source-server-ids $SERVER_ID

# ── Mark test as complete and initiate real cutover ───────────────────────────
aws mgn finish-test \
  --source-server-ids $SERVER_ID

aws mgn start-cutover \
  --source-server-ids $SERVER_ID

The Strangler Fig Pattern

The strangler fig tree in nature grows around an existing tree, gradually replacing it until the host is entirely encased and eventually dies. Martin Fowler named the corresponding software pattern in 2004, and it remains the dominant approach to safely replacing a monolith at runtime — without a big-bang rewrite that paralyzes delivery for 18 months and routinely fails.

The mechanics: you stand up a new system alongside the old one, route a growing slice of traffic to the new system piece by piece, and retire the old code path only when the new path has proven itself at production traffic. The old monolith never goes dark in one event; it slowly stops receiving traffic until it can be switched off safely.

Strangler fig pattern: the proxy intercepts all traffic and routes an expanding share to new services until the monolith is idle and can be retired.

The proxy — typically an API Gateway, an Application Load Balancer with weighted target groups, or a service mesh like Istio — is the control plane for the strangler. It makes the migration reversible at every step.

Feature flags as the strangler control plane: at Netflix and Amazon, the strangler is not purely routing-layer — it is also controlled by feature flags (LaunchDarkly, AWS AppConfig). A flag can route 1% of users to the new service, monitor error rates, and roll back in seconds without touching ALB weights. This is safer than DNS-level switching for high-value flows like checkout.

Database Migration — The Hard Part

Application servers are stateless and easily replaced. The database is not. The most common migration failure mode is a botched database cutover that corrupts or loses data. AWS Database Migration Service (DMS) solves this with continuous replication: it streams changes from the source database to the target in near-real-time, so the target is always seconds behind the source. When you are ready to cut over, you flip the connection string and the lag closes to zero.

# ── Create a DMS replication instance ─────────────────────────────────────────
aws dms create-replication-instance \
  --replication-instance-identifier prod-migration-ri \
  --replication-instance-class dms.r6i.xlarge \
  --allocated-storage 200 \
  --vpc-security-group-ids sg-0abc123def456789 \
  --replication-subnet-group-identifier prod-migration-subnet-group \
  --multi-az \
  --engine-version 3.5.1

# ── Create source endpoint (on-prem MySQL) ────────────────────────────────────
aws dms create-endpoint \
  --endpoint-identifier onprem-mysql-src \
  --endpoint-type source \
  --engine-name mysql \
  --server-name 10.0.1.50 \
  --port 3306 \
  --username dms_user \
  --password "$DB_PASSWORD" \
  --database-name orders_db

# ── Create target endpoint (Amazon RDS MySQL) ──────────────────────────────────
aws dms create-endpoint \
  --endpoint-identifier rds-mysql-target \
  --endpoint-type target \
  --engine-name mysql \
  --server-name orders-db.cluster-xyz.us-east-1.rds.amazonaws.com \
  --port 3306 \
  --username dms_user \
  --password "$RDS_PASSWORD" \
  --database-name orders_db

# ── Create and start a full-load + CDC replication task ───────────────────────
aws dms create-replication-task \
  --replication-task-identifier orders-full-load-cdc \
  --source-endpoint-arn arn:aws:dms:us-east-1:123456789012:endpoint:XXXXXXXX \
  --target-endpoint-arn arn:aws:dms:us-east-1:123456789012:endpoint:YYYYYYYY \
  --replication-instance-arn arn:aws:dms:us-east-1:123456789012:rep:ZZZZZZZZ \
  --migration-type full-load-and-cdc \
  --table-mappings file://table-mappings.json \
  --replication-task-settings file://task-settings.json

aws dms start-replication-task \
  --replication-task-arn arn:aws:dms:us-east-1:123456789012:task:AAAAAAAA \
  --start-replication-task-type start-replication

# ── Monitor lag before cutover (target: < 5 seconds) ──────────────────────────
aws dms describe-replication-tasks \
  --filters Name=replication-task-arn,Values=arn:aws:dms:... \
  --query 'ReplicationTasks[0].ReplicationTaskStats.{Lag:CdcLatencySource,Applied:TablesLoaded}'

Never do a cold-stop cutover on a live database. The pattern is: (1) quiesce writes to the source — put the app in maintenance mode for 60 seconds, (2) wait for DMS lag to reach zero, (3) promote the target, (4) update the connection string, (5) bring the app back. If anything fails between steps 3 and 5, you revert by pointing back at the source — DMS can continue replication in reverse. Teams that skip the quiesce step routinely cause split-brain data states that require hours of manual reconciliation.

Migration Failure Modes at Scale

These are the patterns that derail real enterprise migration programmes:

Dependency sprawl discovered mid-wave. An app you labelled "standalone" turns out to call seventeen other services via hardcoded IPs. Discovery tooling solves this — 30+ days of network flow data before the first wave starts, not after.
License compliance breaking rehost. Some software licenses are tied to the physical host MAC address or specific CPU socket counts. Moving the OS image to EC2 violates the license. Always run a license audit before rehosting commercial software.
Rollback plans that were never tested. Every cutover must have a tested rollback path. "We will just restore the backup" is not a rollback plan — it is a recovery plan with unknown RTO. If the rollback has never been executed in a drill, it will fail under pressure.
Wave scope creep. Letting a wave grow from 10 servers to 40 servers to "just a few more" eliminates the controlled blast radius. Hard-cap wave size and use the leftover servers as the seed of the next wave.

Migration readiness score: AWS offers the Migration Readiness Assessment (MRA) — a structured interview against six readiness dimensions (business case, landing zone, operating model, security & compliance, migration experience, team structure). Running the MRA before Wave 1 surfaces organizational blockers that tooling cannot fix. Programme managers that skip it consistently hit the same blockers in month 6 that the MRA would have surfaced in month 1.