Introduction

https://aws.amazon.com/what-is/mlops/

https://www.hopsworks.ai/dictionary/machine-learning-infrastructure

Data ingestion / preparation / feature engineering infra

  • data lakes
  • feature stores
  • data warehouses
  • data management tools (e.g. DVC)
  • workflow orchestration

Model training infra

  • model registry
  • specialized hardware (GPU, TPU)
  • scalable computational resources
  • frameworks (e.g. Spark) and tools (e.g. Slurm) for distributing training jobs

Model deployment and serving infra

  • tools to package, version, deploy and serve
  • containerizarion technologies (Docker, Kubernetes)
  • serverless computing platforms for inference
  • serving frameworks (e.g. Tensorflow Serving and PyTorch Serve )

Infra for monitoring

  • integrates with logging frameworks, metrics tracking tools, and anomaly detection systems
  • monitoring for model drift detection

Best practices for ML infrastructure

Modularity and flexibility

Modular design involves breaking down complex infrastructure systems into smaller, reusable components or modules. Each module serves a specific function and can be included, upgraded, or replaced without affecting the whole system. This approach promotes flexibility and agility, allowing developers to adapt their infrastructure to changing requirements and technologies.

Automation

Automated model lifecycle management will simplify the deployment, monitoring, and optimization of ML models, reducing the burden on ML teams. Automation that streamlines the ML development phases also helps reduce manual intervention and minimize human-introduced errors.

Public model registries

On-prem GPUs

Consumer-grade for best value

https://lambdalabs.com/blog/best-gpu-2022-sofar

ML infrastructure

https://imbue.com/research/70b-infrastructure/

  • Setup of bare metal cluster (4096 GPUs)
  • InfiniBand and NVLink
  • Ubuntu MAAS
  • Monitoring with Prometheus
  • Failure modes
  • Health checks
  • Performance diagnostics (Py-Spy, Torch profiler, Nvidia Nsight)
  • S3 caching in the cluster
  • Kraken - Distributed P2P Docker registry

Cloud GPUs

Table comparing all providers

https://www.paperspace.com/gpu-cloud-comparison

Paperspace

Lambda Labs

https://lambdalabs.com/service/gpu-cloud

Jarvis Labs

https://jarvislabs.ai/

  • Supports spot instances with auto-saved storage

Vast

https://vast.ai/

https://modal.com/

Latitude.sh

https://www.latitude.sh/pricing

Replicate

https://replicate.com/

Together.ai

https://www.together.ai/

Deepinfra

https://deepinfra.com/

Tools

https://research.aimultiple.com/mlops-tools/

Data prep / exploration

https://hands-on.cloud/how-to-run-jupiter-keras-tensorflow-pandas-sklearn-and-matplotlib-in-docker-container/

  • ML libraries in Docker

Visualization / programmable dashboards

https://medium.datadriveninvestor.com/streamlit-vs-dash-vs-voil%C3%A0-vs-panel-battle-of-the-python-dashboarding-giants-177c40b9ea57

  • Dash
  • Streamlit
  • Voila’
  • …

Custom ML platforms

https://www.aporia.com/building-an-ml-platform-from-scratch/

  • AWS
  • DVC
  • MLFlow
  • FastAPI
  • Aporia

See also