Curate MNIST¶
# pip install lamindb torch torchvision lightning
!lamin init --storage ./lamin-mlops
Show code cell output
! using anonymous user (to identify, call: lamin login)
→ initialized lamindb: anonymous/lamin-mlops
import lamindb as ln
from pathlib import Path
ln.track()
Show code cell output
→ connected lamindb: anonymous/lamin-mlops
→ created Transform('CEclTPhlW1XR0000'), started new Run('F7Rlu8uY...') at 2025-10-07 12:36:52 UTC
→ notebook imports: lamindb==1.12.1 torchvision==0.23.0
• recommendation: to identify the notebook across renames, pass the uid: ln.track("CEclTPhlW1XR")
Download the MNIST dataset and save it in LaminDB to keep track of the training data that is associated with our model.
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
dataset = MNIST(Path.cwd() / "download_mnist", download=True, transform=ToTensor())
Show code cell output
0%| | 0.00/9.91M [00:00<?, ?B/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 151MB/s]
0%| | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 47.7MB/s]
0%| | 0.00/1.65M [00:00<?, ?B/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 34.2MB/s]
0%| | 0.00/4.54k [00:00<?, ?B/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 12.1MB/s]
# no need for the zipped files
!rm -r download_mnist/MNIST/raw/*.gz
!ls -r download_mnist/MNIST/raw
Show code cell output
train-labels-idx1-ubyte t10k-labels-idx1-ubyte
train-images-idx3-ubyte t10k-images-idx3-ubyte
training_data_artifact = ln.Artifact(
"download_mnist/",
key="testdata/mnist",
kind="dataset",
description="Complete MNIST dataset directory containing training and test data",
).save()
training_data_artifact
Show code cell output
! calling anonymously, will miss private instances
Artifact(uid='kbGhkjzNE5xAgTZs0000', is_latest=True, key='testdata/mnist', description='Complete MNIST dataset directory containing training and test data', suffix='', kind='dataset', size=54950048, hash='amFx_vXqnUtJr0kmxxWK2Q', n_files=4, branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-10-07 12:37:01 UTC, is_locked=False)
ln.finish()
Show code cell output
→ finished Run('F7Rlu8uY') after 10s at 2025-10-07 12:37:02 UTC