Challenge accepted !
In this first part of the challenge, we’ll determine :
- how we should encode data transmitted by 1M devices
- how we should store it
- roughly the amount of compute resources required
- where to host this system
Napkin time
Network bandwidth
Let’s look at some back-of-the-envelope calculations. Given our specs:
- 1M devices generating 10 data points / s ⇒
Depending on the method we use to encode data, we are looking at anywhere from *10 bytes/sec/device = 40 MB/sec [link to paper about state of the art time series compression and adjust multiplier] to a naive 44 bytes/sec/device
Storage requirements
10x compression factor over streamed data
Compute requirements
Selecting a cloud provider
Let’s see what we can obtain in terms of compute, storage and networking with a budget of $1000 / month:
AWS Azure DO GCP Hetzner
Cloud benchmark written in Pulumi
Disclaimer
You’ll notice that the major cloud providers are not topping the list., but we are purposely looking at the solution space with a rather narrow lens. We are asking for :
- a small number of resources,
- at the lowest possible cost,
- with no need to auto-scale this system depending on traffic (IoT pipelines at least on the data ingestion side tend to have very predictable usage patterns)
- without fancy networking services like global network balancers
Settled
Cost constraints are a major part of this challenge, so we’ll proceed with [Hetzner Cloud](link to Hetzner).
I’d be very thrilled if someone achieves the same on a major cloud provider, in that case please consider contributing to this challenge. Thank you!
In [part 2](link to post), we’ll dive into architecture and various trade-offs.