Challenge accepted !

In this first part of the challenge, we’ll determine :

  • how we should encode data transmitted by 1M devices
    • how we should store it
    • roughly the amount of compute resources required
  • where to host this system

Napkin time

Network bandwidth

Let’s look at some back-of-the-envelope calculations. Given our specs:

  • 1M devices generating 10 data points / s ⇒

Depending on the method we use to encode data, we are looking at anywhere from *10 bytes/sec/device = 40 MB/sec [link to paper about state of the art time series compression and adjust multiplier] to a naive 44 bytes/sec/device

Storage requirements

10x compression factor over streamed data

Compute requirements

Selecting a cloud provider

Let’s see what we can obtain in terms of compute, storage and networking with a budget of $1000 / month:

AWS Azure DO GCP Hetzner

Cloud benchmark written in Pulumi

Disclaimer

You’ll notice that the major cloud providers are not topping the list., but we are purposely looking at the solution space with a rather narrow lens. We are asking for :

  • a small number of resources,
  • at the lowest possible cost,
  • with no need to auto-scale this system depending on traffic (IoT pipelines at least on the data ingestion side tend to have very predictable usage patterns)
  • without fancy networking services like global network balancers

Settled

Cost constraints are a major part of this challenge, so we’ll proceed with [Hetzner Cloud](link to Hetzner).

I’d be very thrilled if someone achieves the same on a major cloud provider, in that case please consider contributing to this challenge. Thank you!

In [part 2](link to post), we’ll dive into architecture and various trade-offs.