MLFlow¶

We show how LaminDB can be integrated with MLflow to track the training process and associate datasets & parameters with models.

# !pip install 'lamindb[jupyter]' torchvision lightning wandb
!lamin init --storage ./lamin-mlops

import lamindb as ln
import mlflow
import lightning

from torch import utils
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from autoencoder import LitAutoEncoder

ln.track()

Define a model¶

We use a basic PyTorch Lightning autoencoder as an example model.

Query & download the MNIST dataset¶

We saved the MNIST dataset in curation notebook which now shows up in the Artifact registry:

ln.Artifact.filter(kind="dataset").df()

Show code cell output Hide code cell output

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
1	SEeuxhRTHh8t4zNN0000	testdata/mnist	None		dataset	None	54950048	amFx_vXqnUtJr0kmxxWK2Q	4	None	md5-d	True	True	1	1	None	None	True	1	2025-08-06 17:29:44.709000+00:00	1	{'af': {'0': True}}	1

You can also find it on lamin.ai if you were connected your instance.

Let’s get the dataset:

artifact = ln.Artifact.get(key="testdata/mnist")
artifact

And download it to a local cache:

path = artifact.cache()
path

Create a PyTorch-compatible dataset:

dataset = MNIST(path.as_posix(), transform=ToTensor())
dataset

Monitor training with MLflow¶

Train our example model and track the training progress with MLflow.

mlflow.pytorch.autolog()

MODEL_CONFIG = {"hidden_size": 32, "bottleneck_size": 16, "batch_size": 32}

# Start MLflow run
with mlflow.start_run() as run:
    train_dataset = MNIST(
        root="./data", train=True, download=True, transform=ToTensor()
    )
    train_loader = utils.data.DataLoader(
        train_dataset, batch_size=MODEL_CONFIG["batch_size"]
    )

    # Initialize model
    autoencoder = LitAutoEncoder(
        MODEL_CONFIG["hidden_size"], MODEL_CONFIG["bottleneck_size"]
    )

    # Create checkpoint callback
    from lightning.pytorch.callbacks import ModelCheckpoint

    checkpoint_callback = ModelCheckpoint(
        dirpath="model_checkpoints",
        filename=f"{run.info.run_id}_last_epoch",
        save_top_k=1,
        monitor="train_loss",
    )

    # Train model
    trainer = lightning.Trainer(
        accelerator="cpu",
        limit_train_batches=3,
        max_epochs=2,
        callbacks=[checkpoint_callback],
    )

    trainer.fit(model=autoencoder, train_dataloaders=train_loader)

    # Get run information
    run_id = run.info.run_id
    metrics = mlflow.get_run(run_id).data.metrics
    params = mlflow.get_run(run_id).data.params

    # Access model artifacts path
    model_uri = f"runs:/{run_id}/model"
    artifacts_path = run.info.artifact_uri

Show code cell output Hide code cell output

2025/08/06 17:30:18 WARNING mlflow.utils.autologging_utils: MLflow pytorch autologging is known to be compatible with 2.1.0 <= torch <= 2.7.1, but the installed version is 2.8.0+cu128. If you encounter errors during autologging, try upgrading / downgrading torch to a compatible version, or try upgrading MLflow.

  0%|          | 0.00/9.91M [00:00<?, ?B/s]

  8%|▊         | 786k/9.91M [00:00<00:01, 7.77MB/s]

100%|██████████| 9.91M/9.91M [00:00<00:00, 55.1MB/s]

  0%|          | 0.00/28.9k [00:00<?, ?B/s]

100%|██████████| 28.9k/28.9k [00:00<00:00, 1.65MB/s]

  0%|          | 0.00/1.65M [00:00<?, ?B/s]

 42%|████▏     | 688k/1.65M [00:00<00:00, 6.74MB/s]

100%|██████████| 1.65M/1.65M [00:00<00:00, 13.2MB/s]

  0%|          | 0.00/4.54k [00:00<?, ?B/s]

100%|██████████| 4.54k/4.54k [00:00<00:00, 12.7MB/s]

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False

INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores

INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs

/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:658: Checkpoint directory /home/runner/work/lamin-mlops/lamin-mlops/docs/model_checkpoints exists and is not empty.

  | Name    | Type       | Params | Mode 
-----------------------------------------------
0 | encoder | Sequential | 25.6 K | train
1 | decoder | Sequential | 26.4 K | train
-----------------------------------------------
52.1 K    Trainable params
0         Non-trainable params
52.1 K    Total params
0.208     Total estimated model params size (MB)
8         Modules in train mode
0         Modules in eval mode

/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:425: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.
/opt/hostedtoolcache/Python/3.13.5/x64/lib/python3.13/site-packages/lightning/pytorch/loops/fit_loop.py:310: The number of training batches (3) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.

Training: |          | 0/? [00:00<?, ?it/s]

Training:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 0:   0%|          | 0/3 [00:00<?, ?it/s]

Epoch 0:  33%|███▎      | 1/3 [00:00<00:00, 53.04it/s]

Epoch 0:  33%|███▎      | 1/3 [00:00<00:00, 51.33it/s, v_num=0]

Epoch 0:  67%|██████▋   | 2/3 [00:00<00:00, 75.92it/s, v_num=0]

Epoch 0:  67%|██████▋   | 2/3 [00:00<00:00, 74.51it/s, v_num=0]

Epoch 0: 100%|██████████| 3/3 [00:00<00:00, 89.63it/s, v_num=0]

Epoch 0: 100%|██████████| 3/3 [00:00<00:00, 88.06it/s, v_num=0]

Epoch 0: 100%|██████████| 3/3 [00:00<00:00, 86.35it/s, v_num=0]

2025/08/06 17:30:20 WARNING mlflow.utils.checkpoint_utils: Checkpoint logging is skipped, because checkpoint 'save_best_only' config is True, it requires to compare the monitored metric value, but the provided monitored metric value is not available.

Epoch 0:   0%|          | 0/3 [00:00<?, ?it/s, v_num=0]

Epoch 1:   0%|          | 0/3 [00:00<?, ?it/s, v_num=0]

Epoch 1:  33%|███▎      | 1/3 [00:00<00:00, 126.14it/s, v_num=0]

Epoch 1:  33%|███▎      | 1/3 [00:00<00:00, 118.14it/s, v_num=0]

Epoch 1:  67%|██████▋   | 2/3 [00:00<00:00, 133.36it/s, v_num=0]

Epoch 1:  67%|██████▋   | 2/3 [00:00<00:00, 127.09it/s, v_num=0]

Epoch 1: 100%|██████████| 3/3 [00:00<00:00, 135.79it/s, v_num=0]

Epoch 1: 100%|██████████| 3/3 [00:00<00:00, 132.54it/s, v_num=0]

Epoch 1: 100%|██████████| 3/3 [00:00<00:00, 128.60it/s, v_num=0]

2025/08/06 17:30:20 WARNING mlflow.utils.checkpoint_utils: Checkpoint logging is skipped, because checkpoint 'save_best_only' config is True, it requires to compare the monitored metric value, but the provided monitored metric value is not available.

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=2` reached.

Epoch 1: 100%|██████████| 3/3 [00:00<00:00, 96.64it/s, v_num=0] 

2025/08/06 17:30:25 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.

See the training progress in the mlflow UI:

Save model in LaminDB¶

# save checkpoint as a model in LaminDB
artifact = ln.Artifact(
    f"model_checkpoints/{run_id}_last_epoch.ckpt",
    key="testmodels/mlflow/litautoencoder.ckpt",  # is automatically versioned
    type="model",
).save()

# create a label with the mlflow experiment name
mlflow_run_name = mlflow.get_run(run_id).data.tags.get(
    "mlflow.runName", f"run_{run_id}"
)
experiment_label = ln.ULabel(
    name=mlflow_run_name, description="mlflow experiment name"
).save()

# annotate the model Artifact
artifact.ulabels.add(experiment_label)

# define the associated model hyperparameters in ln.Param
for k, v in MODEL_CONFIG.items():
    ln.Param(name=k, dtype=type(v).__name__).save()
artifact.params.add_values(MODEL_CONFIG)

# look at Artifact annotations
artifact.describe()
artifact.params

Show code cell output Hide code cell output

! `type` will be removed soon, please use `kind`

! calling anonymously, will miss private instances

→ returning existing Feature record with same name: 'hidden_size'

→ returning existing Feature record with same name: 'bottleneck_size'

→ returning existing Feature record with same name: 'batch_size'

/tmp/ipykernel_3470/2236147127.py:22: FutureWarning: Use features instead of params, params will be removed in the future.
  artifact.params.add_values(MODEL_CONFIG)

Artifact .ckpt
├── General
│   ├── key: testmodels/mlflow/litautoencoder.ckpt
│   ├── uid: Ub9xe3Yv75D0vBRi0000          hash: 8cu1gry5ZGiKdY9_Zmo_qA
│   ├── size: 621.8 KB                     transform: mlflow.ipynb
│   ├── space: all                         branch: all
│   ├── created_by: anonymous              created_at: 2025-08-06 17:30:26
│   └── storage path: 
│       /home/runner/work/lamin-mlops/lamin-mlops/docs/lamin-mlops/testmodels/mlflow/litautoencoder.ckpt
├── Linked features
│   └── batch_size                      int                                32                                      
│       bottleneck_size                 int                                16                                      
│       hidden_size                     int                                32                                      
└── Labels
    └── .ulabels                        ULabel                             invincible-ray-612

/tmp/ipykernel_3470/2236147127.py:26: FutureWarning: Use features instead of params, params will be removed in the future.
  artifact.params

Artifact .ckpt
└── Linked features
    └── batch_size                      int                                32                                      
        bottleneck_size                 int                                16                                      
        hidden_size                     int                                32

See the checkpoints:

If later on, you want to re-use the checkpoint, you can download it like so:

ln.Artifact.get(key="testmodels/mlflow/litautoencoder.ckpt").cache()

Or on the CLI:

lamin get artifact --key 'testmodels/litautoencoder'

ln.finish()