CroissantΒΆ
Croissant π₯ is a high-level format building on schema.org for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file. It works with existing datasets to make them easier to find, use, and support with tools.
Here, we demonstrate how LaminDB can be used to validate Croissant files:
# pip install lamindb
!lamin init --storage ./test-lamin-croissant
Show code cell output
β initialized lamindb: anonymous/test-lamin-croissant
import lamindb as ln
import json
ln.track()
Show code cell output
β connected lamindb: anonymous/test-lamin-croissant
β created Transform('W4Im6iUQV8bK0000', key='croissant.ipynb'), started new Run('RXmr7QcGEwN73f8V') at 2025-11-05 21:34:23 UTC
β notebook imports: lamindb==1.15.0
β’ recommendation: to identify the notebook across renames, pass the uid: ln.track("W4Im6iUQV8bK")
croissant_path, dataset1_path = ln.examples.croissant.mini_immuno()
croissant_path
Show code cell output
PosixPath('mini_immuno.anndata.zarr_metadata.json')
with open(croissant_path) as f:
dictionary = json.load(f)
print(json.dumps(dictionary, indent=2))
Show code cell output
{
"@context": {
"@vocab": "https://schema.org/",
"cr": "https://mlcommons.org/croissant/",
"ml": "http://ml-schema.org/",
"sc": "https://schema.org/",
"dct": "http://purl.org/dc/terms/",
"data": "https://mlcommons.org/croissant/data/",
"rai": "https://mlcommons.org/croissant/rai/",
"format": "https://mlcommons.org/croissant/format/",
"citeAs": "https://mlcommons.org/croissant/citeAs/",
"conformsTo": "https://mlcommons.org/croissant/conformsTo/",
"@language": "en",
"repeated": "https://mlcommons.org/croissant/repeated/",
"field": "https://mlcommons.org/croissant/field/",
"examples": "https://mlcommons.org/croissant/examples/",
"recordSet": "https://mlcommons.org/croissant/recordSet/",
"fileObject": "https://mlcommons.org/croissant/fileObject/",
"fileSet": "https://mlcommons.org/croissant/fileSet/",
"source": "https://mlcommons.org/croissant/source/",
"references": "https://mlcommons.org/croissant/references/",
"key": "https://mlcommons.org/croissant/key/",
"parentField": "https://mlcommons.org/croissant/parentField/",
"isLiveDataset": "https://mlcommons.org/croissant/isLiveDataset/",
"separator": "https://mlcommons.org/croissant/separator/",
"extract": "https://mlcommons.org/croissant/extract/",
"subField": "https://mlcommons.org/croissant/subField/",
"regex": "https://mlcommons.org/croissant/regex/",
"column": "https://mlcommons.org/croissant/column/",
"path": "https://mlcommons.org/croissant/path/",
"fileProperty": "https://mlcommons.org/croissant/fileProperty/",
"md5": "https://mlcommons.org/croissant/md5/",
"jsonPath": "https://mlcommons.org/croissant/jsonPath/",
"transform": "https://mlcommons.org/croissant/transform/",
"replace": "https://mlcommons.org/croissant/replace/",
"dataType": "https://mlcommons.org/croissant/dataType/",
"includes": "https://mlcommons.org/croissant/includes/",
"excludes": "https://mlcommons.org/croissant/excludes/"
},
"@type": "Dataset",
"name": "Mini immuno dataset",
"description": "A few samples from the immunology dataset",
"url": "https://lamin.ai/laminlabs/lamindata/artifact/tCUkRcaEjTjhtozp0000",
"creator": {
"@type": "Person",
"name": "falexwolf"
},
"dateCreated": "2025-07-16",
"cr:projectName": "Mini Immuno Project",
"datePublished": "2025-07-16",
"version": "1.0",
"license": "https://creativecommons.org/licenses/by/4.0/",
"citation": "Please cite this dataset as: mini immuno (2025)",
"encodingFormat": "zarr",
"distribution": [
{
"@type": "cr:FileSet",
"@id": "mini_immuno.anndata.zarr",
"containedIn": {
"@id": "directory"
},
"encodingFormat": "zarr"
}
],
"cr:recordSet": [
{
"@type": "cr:RecordSet",
"@id": "#samples",
"name": "samples",
"description": "my sample"
}
]
}
dataset1_path
Show code cell output
PosixPath('mini_immuno.anndata.zarr')
artifact = ln.integrations.curate_from_croissant(croissant_path)
Show code cell output
! calling anonymously, will miss private instances
! file path mini_immuno.anndata.zarr is not part of a known storage location, will be duplicated to: StorageSettings(root='/home/runner/work/lamin-mlops/lamin-mlops/docs/test-lamin-croissant', uid='5Tqt1lldyO8z')
Project label, license, description, version tag, and file paths are automatically extracted from the Croissant file. More metadata can be supported in the future.
artifact.describe()
Show code cell output
Artifact: mini_immuno.anndata.zarr (1.0) | description: Mini immuno dataset - A few samples from the immunology dataset βββ uid: ljkRXQRE9kX3yKHA0000 run: RXmr7Qc (croissant.ipynb) β kind: dataset otype: AnnData β hash: N4_3ooEU3qkk97bWijn9XA size: 21.5 KB β branch: main space: all β created_at: 2025-11-05 21:34:24 UTC created_by: anonymous β n_files: 95 βββ storage/path: β /home/runner/work/lamin-mlops/lamin-mlops/docs/test-lamin-croissant/.lamindb/ljkRXQRE9kX3yKHA.anndata.zarr βββ Labels βββ .projects Project Mini Immuno Project .ulabels ULabel https://creativecommons.org/licenses/byβ¦
ln.finish()
Show code cell output
β finished Run('RXmr7QcGEwN73f8V') after 2s at 2025-11-05 21:34:25 UTC