Data pipeline - summary
https://itnext.io/big-data-pipeline-recipe-c416c1782908
- OLTP vs OLAP
- Data warehouse vs data lake vs data lakehouse
- Hadoop vs No Hadoop
- Batch vs streaming
- ETL vs ELT
Evolution of modern data pipeline
https://www.youtube.com/watch?v=zNwN-zm4qPE
- None
- Batch
- Realtime
- …
- Automation
- Decentralized
ETL/ELT → graphs and data products
https://jack-vanlightly.com/blog/2024/11/26/dismantling-elt-the-case-for-graphs-not-silos
Backend at Netflix
Data lakehouse
AWS LakeFormation
Databricks Lakehouse
https://docs.airbyte.com/integrations/destinations/databricks/
Data lakes - Intro (AWS flavor)
https://www.upsolver.com/blog/intro-to-aws-data-lakes-components-architecture
Data lakes - Best practices
- Make several copies of the data
- Set a retention policy
- Understand the data you’re bringing in
- Partition your data
- Readable file formats
- Merge small files
- Data governance and access control
Modern data intensive architecture
https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/
Data warehouse vs OLAP RDBMS
https://statsbot.co/blog/modern-data-warehouse/
Data Vault - fantastic intro and comparison to Star Schema (Kimball) and Data Warehouse (Inmon)
https://www.youtube.com/watch?v=zNwN-zm4qPE
ELT and data pipelines
Airbyte - Opensource connectors for ELT
https://airbyte.com/blog/data-integration
Meltano ELT
- based on a set of standard Singer.io taps (extractors) and targets (loaders)
Simple reporting DB
https://tech.fretlink.com/build-your-own-data-lake-for-reporting-purposes/
- plus simple dashboards with Metabase