Data pipeline - summary

https://itnext.io/big-data-pipeline-recipe-c416c1782908

  • OLTP vs OLAP
  • Data warehouse vs data lake vs data lakehouse
  • Hadoop vs No Hadoop
  • Batch vs streaming
  • ETL vs ELT

Evolution of modern data pipeline

https://www.youtube.com/watch?v=zNwN-zm4qPE

  • None
  • Batch
  • Realtime
  • …
  • Automation
  • Decentralized

ETL/ELT → graphs and data products

https://jack-vanlightly.com/blog/2024/11/26/dismantling-elt-the-case-for-graphs-not-silos

Backend at Netflix

https://medium.com/swlh/a-design-analysis-of-cloud-based-microservices-architecture-at-netflix-98836b2da45f

Data lakehouse

AWS LakeFormation

https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc

Databricks Lakehouse

https://docs.airbyte.com/integrations/destinations/databricks/

Data lakes - Intro (AWS flavor)

https://www.upsolver.com/blog/intro-to-aws-data-lakes-components-architecture

Data lakes - Best practices

https://www.upsolver.com/blog/best-practices-high-performance-data-lake?utm_campaign=Nurture%20-%20Other%20Persona&utm_medium=email&_hsmi=81620416&utm_content=81620416&utm_source=hs_automation

  1. Make several copies of the data
  2. Set a retention policy
  3. Understand the data you’re bringing in
  4. Partition your data
  5. Readable file formats
  6. Merge small files
  7. Data governance and access control

Modern data intensive architecture

https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/

Data warehouse vs OLAP RDBMS

https://statsbot.co/blog/modern-data-warehouse/

Data Vault - fantastic intro and comparison to Star Schema (Kimball) and Data Warehouse (Inmon)

https://www.youtube.com/watch?v=zNwN-zm4qPE

ELT and data pipelines

Airbyte - Opensource connectors for ELT

https://airbyte.com/blog/data-integration

Meltano ELT

https://meltano.com/

  • based on a set of standard Singer.io taps (extractors) and targets (loaders)

Simple reporting DB

https://tech.fretlink.com/build-your-own-data-lake-for-reporting-purposes/

  • plus simple dashboards with Metabase

See also