Curate MNIST

# !pip install -q 'lamindb[jupyter,aws]' torch torchvision lightning wandb
!lamin init --storage ./lamin-mlops
Hide code cell output
! using anonymous user (to identify, call: lamin login)
→ connected lamindb: anonymous/lamin-mlops
import lamindb as ln
from pathlib import Path

ln.context.uid = "EgmnhRJ5Hw1S0000"
ln.context.track()
Hide code cell output
→ connected lamindb: anonymous/lamin-mlops
→ notebook imports: lamindb==0.76.6 torchvision==0.19.1
→ created Transform(uid='EgmnhRJ5Hw1S0000') & created Run(started_at='2024-09-10 15:19:32 UTC')

Download the MNIST dataset and save it in LaminDB to keep track of the training data that is associated with our model.

from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
dataset = MNIST(Path.cwd() / "download_mnist", download=True, transform=ToTensor())
Hide code cell output
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/train-images-idx3-ubyte.gz
  0%|          | 0/9912422 [00:00<?, ?it/s]
 14%|█▍        | 1376256/9912422 [00:00<00:00, 12841133.12it/s]
100%|██████████| 9912422/9912422 [00:00<00:00, 56897456.65it/s]

Extracting /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/train-images-idx3-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/train-labels-idx1-ubyte.gz
  0%|          | 0/28881 [00:00<?, ?it/s]
100%|██████████| 28881/28881 [00:00<00:00, 1294862.63it/s]
Extracting /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/train-labels-idx1-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz

Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/t10k-images-idx3-ubyte.gz
  0%|          | 0/1648877 [00:00<?, ?it/s]
 40%|███▉      | 655360/1648877 [00:00<00:00, 5691277.85it/s]
100%|██████████| 1648877/1648877 [00:00<00:00, 11706155.64it/s]

Extracting /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz
  0%|          | 0/4542 [00:00<?, ?it/s]
100%|██████████| 4542/4542 [00:00<00:00, 7601966.79it/s]
Extracting /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw

# no need for the zipped files
!rm -r download_mnist/MNIST/raw/*.gz
!ls -r download_mnist/MNIST/raw
train-labels-idx1-ubyte  t10k-labels-idx1-ubyte
train-images-idx3-ubyte  t10k-images-idx3-ubyte
training_data_artifact = ln.Artifact(
    "download_mnist/",
    key="testdata/mnist",
    type="dataset",
).save()
training_data_artifact
Hide code cell output
Artifact(uid='RvBhDpB4tjkyaxys0000', is_latest=True, key='testdata/mnist', suffix='', type='dataset', size=54950048, hash='amFx_vXqnUtJr0kmxxWK2Q', n_objects=4, _hash_type='md5-d', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1, updated_at='2024-09-10 15:19:40 UTC')

After saving the MNIST training dataset in LaminDB, one can see the dataset showing up in LaminHub:

# save your notebook
# ln.context.finish()