Curate MNIST¶
# !pip install -q 'lamindb[jupyter,aws]' torch torchvision lightning wandb
!lamin init --storage ./lamin-mlops
Show code cell output
! using anonymous user (to identify, call: lamin login)
→ connected lamindb: anonymous/lamin-mlops
import lamindb as ln
from pathlib import Path
ln.context.uid = "EgmnhRJ5Hw1S0000"
ln.context.track()
Show code cell output
→ connected lamindb: anonymous/lamin-mlops
→ notebook imports: lamindb==0.76.11 torchvision==0.19.1
→ created Transform('EgmnhRJ5'), started new Run('eD8WzRua') at 2024-10-08 12:09:07 UTC
Download the MNIST dataset and save it in LaminDB to keep track of the training data that is associated with our model.
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
dataset = MNIST(Path.cwd() / "download_mnist", download=True, transform=ToTensor())
Show code cell output
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/train-images-idx3-ubyte.gz
0%| | 0/9912422 [00:00<?, ?it/s]
100%|██████████| 9912422/9912422 [00:00<00:00, 132408410.44it/s]
Extracting /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/train-images-idx3-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/train-labels-idx1-ubyte.gz
0%| | 0/28881 [00:00<?, ?it/s]
100%|██████████| 28881/28881 [00:00<00:00, 5929594.88it/s]
Extracting /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/train-labels-idx1-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/t10k-images-idx3-ubyte.gz
0%| | 0/1648877 [00:00<?, ?it/s]
100%|██████████| 1648877/1648877 [00:00<00:00, 45687445.64it/s]
Extracting /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/t10k-images-idx3-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 403: Forbidden
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz
0%| | 0/4542 [00:00<?, ?it/s]
100%|██████████| 4542/4542 [00:00<00:00, 8927145.63it/s]
Extracting /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw/t10k-labels-idx1-ubyte.gz to /home/runner/work/lamin-mlops/lamin-mlops/docs/download_mnist/MNIST/raw
# no need for the zipped files
!rm -r download_mnist/MNIST/raw/*.gz
!ls -r download_mnist/MNIST/raw
train-labels-idx1-ubyte t10k-labels-idx1-ubyte
train-images-idx3-ubyte t10k-images-idx3-ubyte
training_data_artifact = ln.Artifact(
"download_mnist/",
key="testdata/mnist",
type="dataset",
).save()
training_data_artifact
Show code cell output
Artifact(uid='eqwwZJwepOKxYl1T0000', is_latest=True, key='testdata/mnist', suffix='', type='dataset', size=54950048, hash='amFx_vXqnUtJr0kmxxWK2Q', n_objects=4, _hash_type='md5-d', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=1, run_id=1, created_by_id=1, created_at=2024-10-08 12:09:13 UTC)
After saving the MNIST training dataset in LaminDB, one can see the dataset showing up in LaminHub:
# save your notebook
# ln.context.finish()