Curate MNIST

# pip install lamindb torch torchvision lightning
!lamin init --storage ./lamin-mlops
Hide code cell output
! using anonymous user (to identify, call: lamin login)
 initialized lamindb: anonymous/lamin-mlops
import lamindb as ln
from pathlib import Path

ln.track()
Hide code cell output
 connected lamindb: anonymous/lamin-mlops
 created Transform('CEclTPhlW1XR0000'), started new Run('F7Rlu8uY...') at 2025-10-07 12:36:52 UTC
 notebook imports: lamindb==1.12.1 torchvision==0.23.0
 recommendation: to identify the notebook across renames, pass the uid: ln.track("CEclTPhlW1XR")

Download the MNIST dataset and save it in LaminDB to keep track of the training data that is associated with our model.

from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

dataset = MNIST(Path.cwd() / "download_mnist", download=True, transform=ToTensor())
Hide code cell output
  0%|          | 0.00/9.91M [00:00<?, ?B/s]
100%|██████████| 9.91M/9.91M [00:00<00:00, 151MB/s]

  0%|          | 0.00/28.9k [00:00<?, ?B/s]
100%|██████████| 28.9k/28.9k [00:00<00:00, 47.7MB/s]

  0%|          | 0.00/1.65M [00:00<?, ?B/s]
100%|██████████| 1.65M/1.65M [00:00<00:00, 34.2MB/s]

  0%|          | 0.00/4.54k [00:00<?, ?B/s]
100%|██████████| 4.54k/4.54k [00:00<00:00, 12.1MB/s]

# no need for the zipped files
!rm -r download_mnist/MNIST/raw/*.gz
!ls -r download_mnist/MNIST/raw
Hide code cell output
train-labels-idx1-ubyte  t10k-labels-idx1-ubyte
train-images-idx3-ubyte  t10k-images-idx3-ubyte
training_data_artifact = ln.Artifact(
    "download_mnist/",
    key="testdata/mnist",
    kind="dataset",
    description="Complete MNIST dataset directory containing training and test data",
).save()
training_data_artifact
Hide code cell output
! calling anonymously, will miss private instances
Artifact(uid='kbGhkjzNE5xAgTZs0000', is_latest=True, key='testdata/mnist', description='Complete MNIST dataset directory containing training and test data', suffix='', kind='dataset', size=54950048, hash='amFx_vXqnUtJr0kmxxWK2Q', n_files=4, branch_id=1, space_id=1, storage_id=1, run_id=1, created_by_id=1, created_at=2025-10-07 12:37:01 UTC, is_locked=False)
ln.finish()
Hide code cell output
 finished Run('F7Rlu8uY') after 10s at 2025-10-07 12:37:02 UTC