Overview of file formats
https://davidgomes.com/understanding-parquet-iceberg-and-data-lakehouses-at-broad/
- Avro
- Parquet
- Arrow
- CSV
https://www.vladsiv.com/big-data-file-formats/?ref=davidgomes.com
- Avro, Parquet, ORC
Tools that work natively with Parquet
- Presto/Trino
- Spark
- DuckDB
- Hive
- Dremio
- Impala
- AWS Athena
- Apache Drill
Detect PII and data schema
https://github.com/capitalone/DataProfiler
- supports CSV, JSON, Avro, Parquet
Optimize for analytics and BI
https://www.metabase.com/learn/data-diet/analytics/data-model-mistakes
Run SQL queries on CSV/TSV files
BI Visualization
- free when self-hosted
- large selection of sources
Plugin for ClickHouse
https://github.com/enqueue/metabase-clickhouse-driver
Future trends
https://motherduck.com/blog/big-data-is-dead/
- Power laws in amount of data (very few customers have PB, few have TB, majority < 1TB )
- Storage requirements grow faster than compute
- Workload sizes are small compared to overall data sizes
- Big Data is getting smaller (relative to the size of new instances available)
- Data is a liability