CroissantΒΆ

Croissant πŸ₯ is a high-level format building on schema.org for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file. It works with existing datasets to make them easier to find, use, and support with tools.

Here, we demonstrate how LaminDB can be used to validate Croissant files:

# pip install lamindb
!lamin init --storage ./test-lamin-croissant
Hide code cell output
β†’ initialized lamindb: anonymous/test-lamin-croissant
import lamindb as ln
import json

ln.track()
Hide code cell output
β†’ connected lamindb: anonymous/test-lamin-croissant
β†’ created Transform('W4Im6iUQV8bK0000', key='croissant.ipynb'), started new Run('RXmr7QcGEwN73f8V') at 2025-11-05 21:34:23 UTC
β†’ notebook imports: lamindb==1.15.0
β€’ recommendation: to identify the notebook across renames, pass the uid: ln.track("W4Im6iUQV8bK")
croissant_path, dataset1_path = ln.examples.croissant.mini_immuno()
croissant_path
Hide code cell output
PosixPath('mini_immuno.anndata.zarr_metadata.json')
with open(croissant_path) as f:
    dictionary = json.load(f)

print(json.dumps(dictionary, indent=2))
Hide code cell output
{
  "@context": {
    "@vocab": "https://schema.org/",
    "cr": "https://mlcommons.org/croissant/",
    "ml": "http://ml-schema.org/",
    "sc": "https://schema.org/",
    "dct": "http://purl.org/dc/terms/",
    "data": "https://mlcommons.org/croissant/data/",
    "rai": "https://mlcommons.org/croissant/rai/",
    "format": "https://mlcommons.org/croissant/format/",
    "citeAs": "https://mlcommons.org/croissant/citeAs/",
    "conformsTo": "https://mlcommons.org/croissant/conformsTo/",
    "@language": "en",
    "repeated": "https://mlcommons.org/croissant/repeated/",
    "field": "https://mlcommons.org/croissant/field/",
    "examples": "https://mlcommons.org/croissant/examples/",
    "recordSet": "https://mlcommons.org/croissant/recordSet/",
    "fileObject": "https://mlcommons.org/croissant/fileObject/",
    "fileSet": "https://mlcommons.org/croissant/fileSet/",
    "source": "https://mlcommons.org/croissant/source/",
    "references": "https://mlcommons.org/croissant/references/",
    "key": "https://mlcommons.org/croissant/key/",
    "parentField": "https://mlcommons.org/croissant/parentField/",
    "isLiveDataset": "https://mlcommons.org/croissant/isLiveDataset/",
    "separator": "https://mlcommons.org/croissant/separator/",
    "extract": "https://mlcommons.org/croissant/extract/",
    "subField": "https://mlcommons.org/croissant/subField/",
    "regex": "https://mlcommons.org/croissant/regex/",
    "column": "https://mlcommons.org/croissant/column/",
    "path": "https://mlcommons.org/croissant/path/",
    "fileProperty": "https://mlcommons.org/croissant/fileProperty/",
    "md5": "https://mlcommons.org/croissant/md5/",
    "jsonPath": "https://mlcommons.org/croissant/jsonPath/",
    "transform": "https://mlcommons.org/croissant/transform/",
    "replace": "https://mlcommons.org/croissant/replace/",
    "dataType": "https://mlcommons.org/croissant/dataType/",
    "includes": "https://mlcommons.org/croissant/includes/",
    "excludes": "https://mlcommons.org/croissant/excludes/"
  },
  "@type": "Dataset",
  "name": "Mini immuno dataset",
  "description": "A few samples from the immunology dataset",
  "url": "https://lamin.ai/laminlabs/lamindata/artifact/tCUkRcaEjTjhtozp0000",
  "creator": {
    "@type": "Person",
    "name": "falexwolf"
  },
  "dateCreated": "2025-07-16",
  "cr:projectName": "Mini Immuno Project",
  "datePublished": "2025-07-16",
  "version": "1.0",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "citation": "Please cite this dataset as: mini immuno (2025)",
  "encodingFormat": "zarr",
  "distribution": [
    {
      "@type": "cr:FileSet",
      "@id": "mini_immuno.anndata.zarr",
      "containedIn": {
        "@id": "directory"
      },
      "encodingFormat": "zarr"
    }
  ],
  "cr:recordSet": [
    {
      "@type": "cr:RecordSet",
      "@id": "#samples",
      "name": "samples",
      "description": "my sample"
    }
  ]
}
dataset1_path
Hide code cell output
PosixPath('mini_immuno.anndata.zarr')
artifact = ln.integrations.curate_from_croissant(croissant_path)
Hide code cell output
! calling anonymously, will miss private instances
! file path mini_immuno.anndata.zarr is not part of a known storage location, will be duplicated to: StorageSettings(root='/home/runner/work/lamin-mlops/lamin-mlops/docs/test-lamin-croissant', uid='5Tqt1lldyO8z')

Project label, license, description, version tag, and file paths are automatically extracted from the Croissant file. More metadata can be supported in the future.

artifact.describe()
Hide code cell output
Artifact: mini_immuno.anndata.zarr (1.0)
|   description: Mini immuno dataset - A few samples from the immunology dataset
β”œβ”€β”€ uid: ljkRXQRE9kX3yKHA0000            run: RXmr7Qc (croissant.ipynb)
β”‚   kind: dataset                        otype: AnnData                
β”‚   hash: N4_3ooEU3qkk97bWijn9XA         size: 21.5 KB                 
β”‚   branch: main                         space: all                    
β”‚   created_at: 2025-11-05 21:34:24 UTC  created_by: anonymous         
β”‚   n_files: 95                                                        
β”œβ”€β”€ storage/path: 
β”‚   /home/runner/work/lamin-mlops/lamin-mlops/docs/test-lamin-croissant/.lamindb/ljkRXQRE9kX3yKHA.anndata.zarr
└── Labels
    └── .projects                       Project                            Mini Immuno Project                     
        .ulabels                        ULabel                             https://creativecommons.org/licenses/by…
ln.finish()
Hide code cell output
β†’ finished Run('RXmr7QcGEwN73f8V') after 2s at 2025-11-05 21:34:25 UTC