MLFlow¶
We show how LaminDB can be integrated with MLflow to track the training process and associate datasets & parameters with models.
# !pip install 'lamindb[jupyter]' torchvision lightning wandb
!lamin init --storage ./lamin-mlops
Show code cell output
• resetting django module variables
→ connected lamindb: anonymous/lamin-mlops
import lamindb as ln
import mlflow
import lightning
from torch import utils
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
from autoencoder import LitAutoEncoder
Show code cell output
→ connected lamindb: anonymous/lamin-mlops
Tracking models in both LaminDB and MLFlow
Note
It is not always necessary to track all model parameters and metrics in both LaminDB and MLFlow. However, if specific artifacts or runs should be queryable by specific model attributes such as, for example, the learning rate, then these attributes should be tracked. Below, we show exemplary how to do that for the batch size and learning rate but the approach generalizes to more features.
# define model run parameters & features
MODEL_CONFIG = {"batch_size": 32, "lr": 0.001}
hyperparameter = ln.Feature(name="Autoencoder hyperparameter", is_type=True).save()
hyperparams = ln.Feature.from_dict(MODEL_CONFIG, str_as_cat=True)
for param in hyperparams:
param.type = hyperparameter
param.save()
ln.track(params=MODEL_CONFIG)
→ created Transform('eLaoqQwwt2m60000'), started new Run('vCefiQAg...') at 2025-09-14 14:08:37 UTC
→ params: batch_size=32, lr=0.001
→ notebook imports: autoencoder lamindb==1.11.0 lightning==2.5.5 mlflow-skinny==3.3.2 mlflow-tracing==3.3.2 mlflow==3.3.2 torch==2.8.0 torchvision==0.23.0
• recommendation: to identify the notebook across renames, pass the uid: ln.track("eLaoqQwwt2m6", params={...})
Define a model¶
We use a basic PyTorch Lightning autoencoder as an example model.
Code of LitAutoEncoder
import torch
import lightning
from torch import optim, nn
class LitAutoEncoder(lightning.LightningModule):
def __init__(self, hidden_size: int, bottleneck_size: int) -> None:
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(28 * 28, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, bottleneck_size),
)
self.decoder = nn.Sequential(
nn.Linear(bottleneck_size, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, 28 * 28),
)
self.save_hyperparameters()
def training_step(
self, batch: tuple[torch.Tensor, torch.Tensor], batch_idx: int
) -> torch.Tensor:
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = nn.functional.mse_loss(x_hat, x)
self.log("train_loss", loss)
return loss
def configure_optimizers(self) -> optim.Optimizer:
optimizer = optim.Adam(self.parameters(), lr=1e-3)
return optimizer
Query & download the MNIST dataset¶
We saved the MNIST dataset in a curation notebook which now shows up in the Artifact registry:
ln.Artifact.filter(kind="dataset").df()
Show code cell output
/tmp/ipykernel_3681/2387445862.py:1: DeprecationWarning: Use to_dataframe instead of df, df will be removed in the future.
ln.Artifact.filter(kind="dataset").df()
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | branch_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1 | RFRgEHdpXja7yPrH0000 | testdata/mnist | None | dataset | None | 54950048 | amFx_vXqnUtJr0kmxxWK2Q | 4 | None | md5-d | True | True | 1 | 1 | None | None | True | 1 | 2025-09-14 14:08:06.578000+00:00 | 1 | {'af': {'0': True}} | 1 |
Let’s get the dataset:
artifact = ln.Artifact.get(key="testdata/mnist")
artifact
Show code cell output
Artifact(uid='RFRgEHdpXja7yPrH0000', is_latest=True, key='testdata/mnist', suffix='', kind='dataset', size=54950048, hash='amFx_vXqnUtJr0kmxxWK2Q', n_files=4, branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-09-14 14:08:06 UTC)
And download it to a local cache:
path = artifact.cache()
path
Show code cell output
PosixUPath('/home/runner/work/lamin-mlops/lamin-mlops/docs/lamin-mlops/.lamindb/RFRgEHdpXja7yPrH')
Create a PyTorch-compatible dataset:
dataset = MNIST(path.as_posix(), transform=ToTensor())
dataset
Show code cell output
Dataset MNIST
Number of datapoints: 60000
Root location: /home/runner/work/lamin-mlops/lamin-mlops/docs/lamin-mlops/.lamindb/RFRgEHdpXja7yPrH
Split: Train
StandardTransform
Transform: ToTensor()
Monitor training with MLflow¶
Train our example model and track the training progress with MLflow
.
# enable MLFlow PyTorch autologging
mlflow.pytorch.autolog()
with mlflow.start_run() as mlflow_run:
train_dataset = MNIST(
root="./data", train=True, download=True, transform=ToTensor()
)
train_loader = utils.data.DataLoader(train_dataset, batch_size=32)
# Initialize model
autoencoder = LitAutoEncoder(32, 16)
# Create checkpoint callback
from lightning.pytorch.callbacks import ModelCheckpoint
checkpoint_callback = ModelCheckpoint(
dirpath="model_checkpoints",
filename=f"{mlflow_run.info.run_id}_last_epoch",
save_top_k=1,
monitor="train_loss",
)
# Train model
trainer = lightning.Trainer(
accelerator="cpu",
limit_train_batches=3,
max_epochs=2,
callbacks=[checkpoint_callback],
)
trainer.fit(model=autoencoder, train_dataloaders=train_loader)
# Get run information
run_id = mlflow_run.info.run_id
ln.context.run.reference = run_id
# save model summary artifact
local_model_summary_path = (
f"{mlflow_run.info.artifact_uri.removeprefix('file://')}/model_summary.txt"
)
mlflow_model_summary_af = ln.Artifact(
local_model_summary_path,
key=f"testmodels/mlflow/{local_model_summary_path}",
kind="model",
).save()
# save checkpoint as a model
mlflow_model_ckpt_af = ln.Artifact(
f"model_checkpoints/{run_id}_last_epoch.ckpt",
key="testmodels/mlflow/litautoencoder.ckpt",
kind="model",
).save()
Show code cell output
0%| | 0.00/9.91M [00:00<?, ?B/s]
3%|▎ | 262k/9.91M [00:00<00:03, 2.61MB/s]
21%|██▏ | 2.13M/9.91M [00:00<00:00, 11.8MB/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 33.2MB/s]
0%| | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 1.02MB/s]
0%| | 0.00/1.65M [00:00<?, ?B/s]
24%|██▍ | 393k/1.65M [00:00<00:00, 3.65MB/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 10.3MB/s]
0%| | 0.00/4.54k [00:00<?, ?B/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 9.69MB/s]
INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
/opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:76: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
2025/09/14 14:08:40 WARNING mlflow.utils.autologging_utils: MLflow autologging encountered a warning: "/opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/mlflow/pytorch/_lightning_autolog.py:467: UserWarning: Autologging is known to be compatible with pytorch-lightning versions between 2.0.7 and 2.5.2 and may not succeed with packages outside this range."
/opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:751: Checkpoint directory /home/runner/work/lamin-mlops/lamin-mlops/docs/model_checkpoints exists and is not empty.
| Name | Type | Params | Mode
-----------------------------------------------
0 | encoder | Sequential | 25.6 K | train
1 | decoder | Sequential | 26.4 K | train
-----------------------------------------------
52.1 K Trainable params
0 Non-trainable params
52.1 K Total params
0.208 Total estimated model params size (MB)
8 Modules in train mode
0 Modules in eval mode
/opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:433: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=3` in the `DataLoader` to improve performance.
/opt/hostedtoolcache/Python/3.13.7/x64/lib/python3.13/site-packages/lightning/pytorch/loops/fit_loop.py:310: The number of training batches (3) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.
Training: | | 0/? [00:00<?, ?it/s]
Training: | | 0/? [00:00<?, ?it/s]
Epoch 0: 0%| | 0/3 [00:00<?, ?it/s]
Epoch 0: 33%|███▎ | 1/3 [00:00<00:00, 59.07it/s]
Epoch 0: 33%|███▎ | 1/3 [00:00<00:00, 56.86it/s, v_num=0]
Epoch 0: 67%|██████▋ | 2/3 [00:00<00:00, 80.25it/s, v_num=0]
Epoch 0: 67%|██████▋ | 2/3 [00:00<00:00, 78.55it/s, v_num=0]
Epoch 0: 100%|██████████| 3/3 [00:00<00:00, 93.09it/s, v_num=0]
Epoch 0: 100%|██████████| 3/3 [00:00<00:00, 91.61it/s, v_num=0]
Epoch 0: 100%|██████████| 3/3 [00:00<00:00, 89.58it/s, v_num=0]
2025/09/14 14:08:40 WARNING mlflow.utils.checkpoint_utils: Checkpoint logging is skipped, because checkpoint 'save_best_only' config is True, it requires to compare the monitored metric value, but the provided monitored metric value is not available.
Epoch 0: 0%| | 0/3 [00:00<?, ?it/s, v_num=0]
Epoch 1: 0%| | 0/3 [00:00<?, ?it/s, v_num=0]
Epoch 1: 33%|███▎ | 1/3 [00:00<00:00, 129.46it/s, v_num=0]
Epoch 1: 33%|███▎ | 1/3 [00:00<00:00, 120.92it/s, v_num=0]
Epoch 1: 67%|██████▋ | 2/3 [00:00<00:00, 123.91it/s, v_num=0]
Epoch 1: 67%|██████▋ | 2/3 [00:00<00:00, 119.77it/s, v_num=0]
Epoch 1: 100%|██████████| 3/3 [00:00<00:00, 126.55it/s, v_num=0]
Epoch 1: 100%|██████████| 3/3 [00:00<00:00, 123.53it/s, v_num=0]
Epoch 1: 100%|██████████| 3/3 [00:00<00:00, 120.01it/s, v_num=0]
2025/09/14 14:08:40 WARNING mlflow.utils.checkpoint_utils: Checkpoint logging is skipped, because checkpoint 'save_best_only' config is True, it requires to compare the monitored metric value, but the provided monitored metric value is not available.
INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=2` reached.
Epoch 1: 100%|██████████| 3/3 [00:00<00:00, 92.01it/s, v_num=0]
2025/09/14 14:08:46 WARNING mlflow.models.model: Model logged without a signature and input example. Please set `input_example` parameter when logging the model to auto infer the model signature.
! calling anonymously, will miss private instances
See the training progress in the mlflow
UI:

See the checkpoints:

If later on, you want to re-use the checkpoint, you can get it via:
ln.Artifact.get(key="testmodels/mlflow/litautoencoder.ckpt").cache()
Show code cell output
PosixUPath('/home/runner/work/lamin-mlops/lamin-mlops/docs/lamin-mlops/.lamindb/2l3qoKmi3TdoKMNn0000.ckpt')
Or on the CLI:
lamin get artifact --key 'testmodels/litautoencoder'
ln.finish()
Show code cell output
! cells [(10, 12)] were not run consecutively
→ finished Run('vCefiQAg') after 10s at 2025-09-14 14:08:47 UTC
Show code cell content
!rm -rf ./lamin-mlops
!lamin delete --force lamin-mlops
! calling anonymously, will miss private instances
• deleting instance anonymous/lamin-mlops