AWS API

All responses from the AWS API (using CLI or SDK) are eventually consistent. A recent change might not appear in the result. Making the same request from two clients at the same time can result in different responses. There is no guarantee to read your writes.

Calls to the AWS API should be retried if they fail and are retriable.

All APIs are rate limited and when retrying an exponential backoff with a random component should be applied.

CloudTrail can be used to debug failed requests.

Exercise: If you write a script to copy snapshots from one AWS account to another, what are your assumptions?

AWS Resources

Most AWS resources send metrics to CloudWatch. You have to create CloudWatch Alarms to monitor those metrics. Pro tip: We offer a product to configure your AWS monitoring and manage your incidents using Slack.

An Infrastructure as Code (IaC) tool (e.g., CloudFormation) is used to create/update/delete resources and rolls back on error. A deployment pipeline (e.g., CodePipeline) invokes the IaC tool.

Exercise: Pick three AWS resources with finite resources (CPU, Memory, Disk, …) and check if you monitor them with CloudWatch Alarms.

Economics

Labor is expensive. Comparing costs of AWS services should take this into account (e.g., running a database on EC2 seems cheaper compared to RDS, but how many hours of labor does the EC2 solution require?)

Managed services from AWS are a good choice.

There are many ways to solve a problem with AWS. Know what you optimize for and design accordingly.

Exercise: Pick an infrastructure service that your team operates and calculate how many hours/month you work to maintain the solution.

Network

The smallest unit to reason about is the Elastic Network Interface (ENI). Internal traffic in AWS is received on or send from an ENI. An EC2 instance comes with at least one ENI. As well as an RDS instance, ElastiCache node, and so on.

If a security group or NACL blocks a packet, Flow Logs can be used to see this (with a delay). Issues with route tables are not visible in Flow Logs.

Security groups provide enough security to control network traffic. NACLs are not needed most of the time.

Traffic inside VPCs is referenced using Security Groups (not IP addresses).