Introduction
https://aws.amazon.com/what-is/mlops/
https://www.hopsworks.ai/dictionary/machine-learning-infrastructure
Data ingestion / preparation / feature engineering infra
- data lakes
- feature stores
- data warehouses
- data management tools (e.g. DVC)
- workflow orchestration
Model training infra
- model registry
- specialized hardware (GPU, TPU)
- scalable computational resources
- frameworks (e.g. Spark) and tools (e.g. Slurm) for distributing training jobs
Model deployment and serving infra
- tools to package, version, deploy and serve
- containerizarion technologies (Docker, Kubernetes)
- serverless computing platforms for inference
- serving frameworks (e.g. Tensorflow Serving and PyTorch Serve )
Infra for monitoring
- integrates with logging frameworks, metrics tracking tools, and anomaly detection systems
- monitoring for model drift detection
Best practices for ML infrastructure
Modularity and flexibility
Modular design involves breaking down complex infrastructure systems into smaller, reusable components or modules. Each module serves a specific function and can be included, upgraded, or replaced without affecting the whole system. This approach promotes flexibility and agility, allowing developers to adapt their infrastructure to changing requirements and technologies.
Automation
Automated model lifecycle management will simplify the deployment, monitoring, and optimization of ML models, reducing the burden on ML teams. Automation that streamlines the ML development phases also helps reduce manual intervention and minimize human-introduced errors.
Public model registries
On-prem GPUs
Consumer-grade for best value
https://lambdalabs.com/blog/best-gpu-2022-sofar
ML infrastructure
https://imbue.com/research/70b-infrastructure/
- Setup of bare metal cluster (4096 GPUs)
- InfiniBand and NVLink
- Ubuntu MAAS
- Monitoring with Prometheus
- Failure modes
- Health checks
- Performance diagnostics (Py-Spy, Torch profiler, Nvidia Nsight)
- S3 caching in the cluster
- Kraken - Distributed P2P Docker registry
Cloud GPUs
Table comparing all providers
https://www.paperspace.com/gpu-cloud-comparison
Paperspace
Lambda Labs
https://lambdalabs.com/service/gpu-cloud
Jarvis Labs
- Supports spot instances with auto-saved storage
Vast
Modal
Latitude.sh
https://www.latitude.sh/pricing
Replicate
Together.ai
Deepinfra
Tools
https://research.aimultiple.com/mlops-tools/
Data prep / exploration
- ML libraries in Docker
Visualization / programmable dashboards
- Dash
- Streamlit
- Voila’
- …
Custom ML platforms
https://www.aporia.com/building-an-ml-platform-from-scratch/
- AWS
- DVC
- MLFlow
- FastAPI
- Aporia