Introduction

https://aws.amazon.com/what-is/mlops/

https://www.hopsworks.ai/dictionary/machine-learning-infrastructure

Data ingestion / preparation / feature engineering infra

data lakes
feature stores
data warehouses
data management tools (e.g. DVC)
workflow orchestration

Model training infra

model registry
specialized hardware (GPU, TPU)
scalable computational resources
frameworks (e.g. Spark) and tools (e.g. Slurm) for distributing training jobs

Model deployment and serving infra

tools to package, version, deploy and serve
containerizarion technologies (Docker, Kubernetes)
serverless computing platforms for inference
serving frameworks (e.g. Tensorflow Serving and PyTorch Serve )

Infra for monitoring

integrates with logging frameworks, metrics tracking tools, and anomaly detection systems
monitoring for model drift detection

Best practices for ML infrastructure

Modularity and flexibility

Modular design involves breaking down complex infrastructure systems into smaller, reusable components or modules. Each module serves a specific function and can be included, upgraded, or replaced without affecting the whole system. This approach promotes flexibility and agility, allowing developers to adapt their infrastructure to changing requirements and technologies.

Automation

Automated model lifecycle management will simplify the deployment, monitoring, and optimization of ML models, reducing the burden on ML teams. Automation that streamlines the ML development phases also helps reduce manual intervention and minimize human-introduced errors.

Public model registries

On-prem GPUs

Consumer-grade for best value

https://lambdalabs.com/blog/best-gpu-2022-sofar

ML infrastructure

https://imbue.com/research/70b-infrastructure/

Setup of bare metal cluster (4096 GPUs)
InfiniBand and NVLink
Ubuntu MAAS
Monitoring with Prometheus
Failure modes
Health checks
Performance diagnostics (Py-Spy, Torch profiler, Nvidia Nsight)
S3 caching in the cluster
Kraken - Distributed P2P Docker registry

Cloud GPUs

Table comparing all providers

https://www.paperspace.com/gpu-cloud-comparison

Paperspace

Lambda Labs

https://lambdalabs.com/service/gpu-cloud

Jarvis Labs

https://jarvislabs.ai/

Supports spot instances with auto-saved storage

Vast

https://vast.ai/

https://modal.com/

Latitude.sh

https://www.latitude.sh/pricing

Replicate

https://replicate.com/

Together.ai

https://www.together.ai/

Deepinfra

https://deepinfra.com/

Tools

https://research.aimultiple.com/mlops-tools/

Data prep / exploration

https://hands-on.cloud/how-to-run-jupiter-keras-tensorflow-pandas-sklearn-and-matplotlib-in-docker-container/

ML libraries in Docker

Visualization / programmable dashboards

https://medium.datadriveninvestor.com/streamlit-vs-dash-vs-voil%C3%A0-vs-panel-battle-of-the-python-dashboarding-giants-177c40b9ea57

Dash
Streamlit
Voila’
…

Custom ML platforms

https://www.aporia.com/building-an-ml-platform-from-scratch/

AWS
DVC
MLFlow
FastAPI
Aporia

📚 Tom's Notes

Explorer

MLOps

Introduction

Data ingestion / preparation / feature engineering infra

Model training infra

Model deployment and serving infra

Infra for monitoring

Best practices for ML infrastructure

Modularity and flexibility

Automation

Public model registries

On-prem GPUs

Consumer-grade for best value

ML infrastructure

Cloud GPUs

Table comparing all providers

Paperspace

Lambda Labs

Jarvis Labs

Vast

Latitude.sh

Replicate

Together.ai

Deepinfra

Tools

Data prep / exploration

Visualization / programmable dashboards

Custom ML platforms

See also

Graph View

Table of Contents

Backlinks

📚 Tom's Notes

Explorer

MLOps

Introduction

Data ingestion / preparation / feature engineering infra

Model training infra

Model deployment and serving infra

Infra for monitoring

Best practices for ML infrastructure

Modularity and flexibility

Automation

Public model registries

On-prem GPUs

Consumer-grade for best value

ML infrastructure

Cloud GPUs

Table comparing all providers

Paperspace

Lambda Labs

Jarvis Labs

Vast

Modal

Latitude.sh

Replicate

Together.ai

Deepinfra

Tools

Data prep / exploration

Visualization / programmable dashboards

Custom ML platforms

See also

Graph View

Table of Contents

Backlinks