Skip to content
Infra

Cloud cost: a checklist before your next AWS bill surprise

A concrete audit checklist to cut AWS spend without re-architecting. Works through the top 12 cost categories in order of impact.

Anthra AI TeamAnthra AI Team
Engineering Team7 min read
Cloud cost: a checklist before your next AWS bill surprise hero image
Table of contents
  1. Before you start
  2. 1. EC2 rightsizing (biggest ROI usually)
  3. The 3-step rightsizing process
  4. 2. Savings Plans and Reserved Instances
  5. The commitment ladder
  6. The strategy
  7. 3. Forgotten resources
  8. 4. S3 storage classes
  9. Lifecycle rules to set today
  10. 5. RDS and Aurora efficiency
  11. 6. EKS/Kubernetes cost controls
  12. 7. NAT gateway and data transfer traps
  13. 8. Logging and observability spend
  14. 9. Environment governance
  15. 10. Commitment strategy review cadence
  16. 11. Cost anomaly detection and alerts
  17. 12. Build a 90-day optimization roadmap
  18. A practical prioritization formula
  19. Closing
  20. Related resources

Most AWS bills have 30-50% fat. Not because teams are careless, but because cloud costs grow incrementally — a new service here, a size-up there, a forgotten dev environment — and nobody ever does the full audit.

This is the checklist we work through on infrastructure engagements. It's ordered by impact: items near the top typically yield more savings than items near the bottom.

Before you start

Pull these three things from your AWS account:

  1. Cost Explorer, grouped by service, last 90 days.
  2. Cost and Usage Reports (CUR) into Athena or QuickSight — more granular than Cost Explorer.
  3. Trusted Advisor (if you have Business/Enterprise Support) for baseline rightsizing recommendations.

Now work through the items below in order.

1. EC2 rightsizing (biggest ROI usually)

Pull CloudWatch metrics for every EC2 instance for the last 30 days. Look at:

  • CPU utilization p95 — if consistently under 40%, you're oversized
  • Memory utilization p95 (requires CloudWatch agent) — same threshold
  • Network throughput — if under 10% of instance's max, networking isn't the constraint

The 3-step rightsizing process

  1. Downsize where possible. If an m5.2xlarge averages 15% CPU and 30% memory, drop to m5.xlarge. If it still averages 30% CPU, drop to m5.large. Be aggressive. It's easier to size back up than to notice over-provisioning.

  2. Switch generations. Graviton-based instances (m6g, m7g, c7g) are typically 20% cheaper than x86 equivalents with similar or better performance. If your code runs on Linux ARM64, migrate. Most modern stacks (Java, Python, Go, Node) work on Graviton with zero code changes.

  3. Consider Spot for stateless workloads. CI/CD runners, batch jobs, stateless API servers with proper graceful-shutdown handling can run on Spot for 70-90% discount.

⚠️Beware auto-scaling groups that never scale down

Many ASGs are configured to scale up aggressively and down conservatively (or never). Check your scale-in policies. Often the simplest cost win is a properly-tuned target-tracking scaling policy.

2. Savings Plans and Reserved Instances

If you have stable, predictable compute usage and you're paying on-demand, you're leaving 40-60% on the table.

The commitment ladder

  • Compute Savings Plans (1-year, no upfront) — most flexible, ~30% discount. Safe for most workloads.
  • Compute Savings Plans (3-year, all upfront) — biggest discount (~55%), commit only what you're certain will persist.
  • EC2 Instance Savings Plans — lock to specific instance family, larger discount than Compute Savings Plans but less flexible.
  • Reserved Instances — more rigid, more discount. Mostly superseded by Savings Plans.

The strategy

  1. Start with Compute Savings Plans (1-year, no upfront) covering 60-70% of your baseline compute.
  2. Leave the top 30-40% on-demand to handle variance.
  3. Review coverage and utilization monthly. Adjust next purchase accordingly.
  4. Don't commit more than 80% of predicted usage — waste is worse than no discount.

Use AWS Cost Explorer's Savings Plans recommendations as a starting point, but verify — recommendations can over-commit.

3. Forgotten resources

The easiest wins are things nobody's using but still paying for.

Run through this list:

  • Unattached EBS volumes. Filter EBS volumes by state available. Delete after snapshotting if you're paranoid.
  • Old EBS snapshots. Anything over 90 days old is probably forgotten. Keep your backup retention policy in mind.
  • Idle RDS instances. CPU < 5% for 30 days. Probably a dev/test leftover. Stop or snapshot-and-delete.
  • Idle load balancers. ELBs with zero traffic for 30+ days. ~$20-40/month each.
  • Unused Elastic IPs. Unassociated EIPs cost $3.60/month each. Usually cheap individually, but we've seen accounts with 200+.
  • Old AMIs and ECR images. Set lifecycle policies (keep last 10 tagged, delete untagged > 30 days).
  • CloudWatch log groups without retention. Default is infinite retention. Set to 30-90 days unless you have compliance reasons otherwise.
💡Script it

Write a weekly Lambda that reports forgotten resources to Slack. Makes cleanup a habit, not a project.

4. S3 storage classes

S3 is one of the most-wasted services. Data lands in Standard and stays forever at full price.

Lifecycle rules to set today

  • Move logs and historical exports to STANDARD_IA after 30 days.
  • Move archival data to GLACIER_IR or DEEP_ARCHIVE based on retrieval needs.
  • Expire temporary artifacts aggressively (7-30 days).
  • Keep explicit exceptions for legal/compliance retention sets.

Start with one bucket category at a time (app logs, data lake raw, backups) and validate retrieval behavior before broad rollout.

5. RDS and Aurora efficiency

Databases are usually the second-largest line item after compute.

Audit:

  • CPU, memory, and IOPS utilization over 30 days.
  • replica utilization and lag behavior.
  • storage allocation vs actual data footprint.
  • expensive long-running queries causing oversized instances.

Common wins:

  • rightsize writer/reader nodes independently.
  • remove stale replicas created for one-off events.
  • tune autovacuum and indexing before scaling hardware.
  • use Aurora Serverless v2 only where burst patterns justify it.

6. EKS/Kubernetes cost controls

Kubernetes cost sprawl is easy because waste hides at multiple layers.

Checklist:

  • enforce CPU/memory requests and limits on all workloads.
  • remove zombie namespaces and stale preview environments.
  • run cluster autoscaler with sane scale-down policies.
  • adopt Karpenter or equivalent for better bin-packing.
  • use spot node groups for resilient stateless workloads.

If your pods have no requests, you are flying blind on both reliability and cost.

7. NAT gateway and data transfer traps

Many teams underestimate transfer and egress.

Review:

  • NAT gateway cost by AZ and service path.
  • cross-AZ traffic patterns from chatty services.
  • unnecessary internet egress for internal service calls.
  • CloudFront opportunities for cacheable content/API responses.

Architectural fixes here often produce disproportionately high savings.

8. Logging and observability spend

CloudWatch and third-party observability bills grow silently.

Do this:

  • set log retention by class (dev, staging, prod).
  • sample high-volume logs once debugging is complete.
  • avoid shipping duplicate logs to multiple sinks.
  • reserve high-cardinality metrics for truly actionable use cases.

You need visibility. You do not need infinite raw telemetry.

9. Environment governance

Cost waste is frequently organizational, not technical.

Set baseline governance:

  • required tags: owner, team, env, cost_center, expiry_date.
  • automatic shutdown policies for non-prod during off-hours.
  • monthly review per team with top spend deltas.
  • explicit owner for every major service line item.

No owner means no optimization.

10. Commitment strategy review cadence

Savings Plans are not "set and forget."

Run a monthly FinOps review:

  • commitment coverage (% of baseline on discounted pricing).
  • commitment utilization (how much committed spend is actually used).
  • instance family drift and seasonal trends.
  • upcoming workload changes that alter commitment assumptions.

Bad commitments create hidden waste just as surely as on-demand usage.

11. Cost anomaly detection and alerts

You need early warning before month-end surprises.

Minimum setup:

  • daily spend anomaly alerts by service and account.
  • budget thresholds (50%, 80%, 100%) routed to owners.
  • sudden egress or NAT spikes flagged separately.
  • deployment-to-cost correlation in change logs.

Fast feedback loops are what make cost discipline sustainable.

12. Build a 90-day optimization roadmap

Treat this as a delivery program, not a one-time cleanup.

Break actions into:

  • Week 1-2: quick cleanup and rightsizing.
  • Week 3-6: deeper service-level tuning.
  • Week 7-12: architectural changes with larger ROI.

Each item should have:

  • owner
  • expected savings range
  • implementation risk
  • verification method

A practical prioritization formula

Use this to rank actions:

Priority score = (estimated monthly savings x confidence) / implementation effort

This prevents teams from spending weeks on low-impact optimizations while large obvious savings remain untouched.

Closing

Most AWS cost reduction comes from disciplined execution of boring fundamentals. Do the high-leverage checks consistently, automate the recurring controls, and reserve architecture changes for where they create real ROI.

Tags

Anthra AI Team

Anthra AI Team

Engineering Team

Collective posts from the engineers at Anthra AI. We write about what we build.

More posts by Anthra AI Team

Share this article

Share

Get insights like this weekly

Product engineering notes on AI, data, and infrastructure - no fluff.

Previous post

Kafka topic design: the 5 mistakes we see most often

Data

Next post

Choosing a vector database in 2026: a practical comparison

AI