ML Infrastructure

ML infrastructure is the collection of systems and tools that support the full ML lifecycle: data storage, compute, training, serving, and tooling. A well-designed ML platform reduces friction for practitioners and increases reliability.

Components of ML Infrastructure

Storage layer
  ├── Data lake (S3, GCS, Azure Blob)
  ├── Data warehouse (BigQuery, Snowflake, Redshift)
  ├── Feature store (Feast, Tecton)
  └── Model/artifact store (MLflow, W&B)

Compute layer
  ├── Training cluster (GPU/TPU nodes)
  ├── Serving cluster (Kubernetes + GPUs)
  └── Batch processing (Spark, Dataflow)

Orchestration layer
  ├── Pipeline orchestrator (Airflow, Kubeflow, Metaflow)
  └── Experiment scheduler (SLURM, Ray)

Developer tooling
  ├── Notebooks (JupyterHub, SageMaker Studio)
  ├── Experiment tracking (MLflow, W&B)
  └── CI/CD (GitHub Actions, Jenkins)

Cloud ML Platforms

AWS SageMaker: end-to-end managed ML platform. SageMaker Studio (notebooks), SageMaker Training (managed training jobs), SageMaker Pipelines (pipeline orchestration), SageMaker Model Registry, SageMaker Endpoints (serving).

Google Vertex AI: Managed training, hyperparameter tuning, pipelines (Kubeflow-based), model registry, and endpoints. Tight integration with BigQuery and GCS.

Azure ML: end-to-end ML platform on Azure. Workspaces, datasets, environments, pipelines, model registry, endpoints.

Databricks: Unified data + AI platform. MLflow integration, distributed Spark compute, Delta Lake, Unity Catalog.

Storage Architecture

Data lake: raw, unprocessed data in object storage. Cheap; any schema. Used for archiving and feeding the feature pipeline.

Data warehouse: structured, query-optimized columnar storage. SQL access; used for analytics and offline feature computation.

Lakehouse (Delta Lake, Apache Iceberg): combines data lake flexibility with data warehouse ACID transactions. UPSERT, DELETE, time travel on raw files.

Object storage (S3/GCS): stores model checkpoints, training datasets, artifacts. Virtually unlimited scale; cheap cold storage.

Compute Management

Instance types for ML:

Task Preferred instance Notes
Large model training A100 80GB, H100 NVLink interconnect, multi-node
Fine-tuning A100 40GB, A10G Single node often sufficient
Online serving A10G, T4, CPU Cost-efficient for low latency
Batch inference A100 80GB Maximize throughput
Development T4, A10G Cheap interactive GPU

Auto-scaling groups: scale training workers dynamically based on queue depth. Reduces idle costs.

Spot/preemptible instances: 60-90% cost reduction. Training must checkpoint and resume; use with checkpoint-aware frameworks (PyTorch Lightning, DeepSpeed).

ML Platform Abstractions

Managed notebooks: JupyterHub on Kubernetes; per-user environments with GPU access. Practitioners interact with data and develop experiments.

Training launchers: abstraction over cluster scheduling. Submit a training job (Docker image + command + resource request); the launcher handles SLURM sbatch or Kubernetes job submission.

Cluster autoscaler: scale Kubernetes node pools up when pending pods require GPU nodes; scale down idle nodes. Cloud provider autoscalers (GKE, EKS node groups).

Shared environments via Docker: practitioners build Docker images with their requirements; reproducible across local, CI, and cluster environments.

Resource Quotas and Cost Control

GPU quotas: limit GPU hours per team or project to control costs.

Spot instance strategy: training: spot instances with checkpointing; serving: on-demand for SLA guarantees.

Cost allocation tagging: tag cloud resources by team, project, and environment. Charge back costs to teams.

Idle resource cleanup: auto-terminate idle notebooks, auto-delete old artifact versions (with retention policy).

Security and Compliance

IAM (Identity and Access Management): restrict access to datasets and models by role. Data scientists access training data; serving systems access model artifacts; not vice versa.

Data encryption: at-rest (S3 SSE, GCS CMEK) and in-transit (TLS). Required for PII and health data.

Audit logging: log all access to sensitive datasets. Required for GDPR, HIPAA compliance.

Model access control: model registry permissions. Only authorized users can push to production.

Secrets management: credentials (API keys, database passwords) in a secrets manager (Vault, AWS Secrets Manager). Never store in code or environment variables in containers.