Overview of file formats

https://davidgomes.com/understanding-parquet-iceberg-and-data-lakehouses-at-broad/

  • Avro
  • Parquet
  • Arrow
  • CSV

https://www.vladsiv.com/big-data-file-formats/?ref=davidgomes.com

  • Avro, Parquet, ORC

Tools that work natively with Parquet

  • Presto/Trino
  • Spark
  • DuckDB
  • Hive
  • Dremio
  • Impala
  • AWS Athena
  • Apache Drill

Detect PII and data schema

https://github.com/capitalone/DataProfiler

  • supports CSV, JSON, Avro, Parquet

Optimize for analytics and BI

https://www.metabase.com/learn/data-diet/analytics/data-model-mistakes

Run SQL queries on CSV/TSV files

http://harelba.github.io/q/

BI Visualization

https://www.metabase.com/

  • free when self-hosted
  • large selection of sources

Plugin for ClickHouse

https://github.com/enqueue/metabase-clickhouse-driver

Future trends

https://motherduck.com/blog/big-data-is-dead/

  • Power laws in amount of data (very few customers have PB, few have TB, majority < 1TB )
  • Storage requirements grow faster than compute
  • Workload sizes are small compared to overall data sizes
  • Big Data is getting smaller (relative to the size of new instances available)
  • Data is a liability