Croissant .md .mdΒΆ

Croissant πŸ₯ is a high-level format building on schema.org for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file. It works with existing datasets to make them easier to find, use, and support with tools.

Here, we demonstrate how LaminDB can be used to validate Croissant files:

# pip install lamindb
!lamin init --storage ./test-lamin-croissant
Hide code cell output
β†’ initialized lamindb: anonymous/test-lamin-croissant
import lamindb as ln
import json

ln.track()
Hide code cell output
β†’ connected lamindb: anonymous/test-lamin-croissant
β†’ created Transform('JrCl7M9EHpVm0000', key='croissant.ipynb'), started new Run('6RypAcsqgR0yYbal') at 2026-03-06 16:25:05 UTC
β†’ notebook imports: lamindb-core==2.3a2
β€’ recommendation: to identify the notebook across renames, pass the uid: ln.track("JrCl7M9EHpVm")
croissant_path, dataset1_path = ln.examples.croissant.mini_immuno()
croissant_path
Hide code cell output
/opt/hostedtoolcache/Python/3.13.12/x64/lib/python3.13/site-packages/anndata/_io/zarr.py:44: UserWarning: Writing zarr v2 data will no longer be the default in the next minor release. v3 data will be written by default. If you are explicitly setting this configuration, consider migrating to the zarr v3 file format.
  f = open_write_group(store)
PosixPath('mini_immuno.anndata.zarr_metadata.json')
with open(croissant_path) as f:
    dictionary = json.load(f)

print(json.dumps(dictionary, indent=2))
Hide code cell output
{
  "@context": {
    "@vocab": "https://schema.org/",
    "cr": "https://mlcommons.org/croissant/",
    "ml": "http://ml-schema.org/",
    "sc": "https://schema.org/",
    "dct": "http://purl.org/dc/terms/",
    "data": "https://mlcommons.org/croissant/data/",
    "rai": "https://mlcommons.org/croissant/rai/",
    "format": "https://mlcommons.org/croissant/format/",
    "citeAs": "https://mlcommons.org/croissant/citeAs/",
    "conformsTo": "https://mlcommons.org/croissant/conformsTo/",
    "@language": "en",
    "repeated": "https://mlcommons.org/croissant/repeated/",
    "field": "https://mlcommons.org/croissant/field/",
    "examples": "https://mlcommons.org/croissant/examples/",
    "recordSet": "https://mlcommons.org/croissant/recordSet/",
    "fileObject": "https://mlcommons.org/croissant/fileObject/",
    "fileSet": "https://mlcommons.org/croissant/fileSet/",
    "source": "https://mlcommons.org/croissant/source/",
    "references": "https://mlcommons.org/croissant/references/",
    "key": "https://mlcommons.org/croissant/key/",
    "parentField": "https://mlcommons.org/croissant/parentField/",
    "isLiveDataset": "https://mlcommons.org/croissant/isLiveDataset/",
    "separator": "https://mlcommons.org/croissant/separator/",
    "extract": "https://mlcommons.org/croissant/extract/",
    "subField": "https://mlcommons.org/croissant/subField/",
    "regex": "https://mlcommons.org/croissant/regex/",
    "column": "https://mlcommons.org/croissant/column/",
    "path": "https://mlcommons.org/croissant/path/",
    "fileProperty": "https://mlcommons.org/croissant/fileProperty/",
    "md5": "https://mlcommons.org/croissant/md5/",
    "jsonPath": "https://mlcommons.org/croissant/jsonPath/",
    "transform": "https://mlcommons.org/croissant/transform/",
    "replace": "https://mlcommons.org/croissant/replace/",
    "dataType": "https://mlcommons.org/croissant/dataType/",
    "includes": "https://mlcommons.org/croissant/includes/",
    "excludes": "https://mlcommons.org/croissant/excludes/"
  },
  "@type": "Dataset",
  "name": "Mini immuno dataset",
  "description": "A few samples from the immunology dataset",
  "url": "https://lamin.ai/laminlabs/lamindata/artifact/tCUkRcaEjTjhtozp0000",
  "creator": {
    "@type": "Person",
    "name": "falexwolf"
  },
  "dateCreated": "2025-07-16",
  "cr:projectName": "Mini Immuno Project",
  "datePublished": "2025-07-16",
  "version": "1.0",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "citation": "Please cite this dataset as: mini immuno (2025)",
  "encodingFormat": "zarr",
  "distribution": [
    {
      "@type": "cr:FileSet",
      "@id": "mini_immuno.anndata.zarr",
      "containedIn": {
        "@id": "directory"
      },
      "encodingFormat": "zarr"
    }
  ],
  "cr:recordSet": [
    {
      "@type": "cr:RecordSet",
      "@id": "#samples",
      "name": "samples",
      "description": "my sample"
    }
  ]
}
dataset1_path
Hide code cell output
PosixPath('mini_immuno.anndata.zarr')
artifact = ln.integrations.curate_from_croissant(croissant_path)
Hide code cell output
! calling anonymously, will miss private instances
! file path mini_immuno.anndata.zarr is not part of a known storage location, will be duplicated to: StorageSettings(root='/home/runner/work/lamin-mlops/lamin-mlops/docs/test-lamin-croissant', uid='si4NcORvhs8t')

Project label, license, description, version tag, and file paths are automatically extracted from the Croissant file. More metadata can be supported in the future.

artifact.describe()
Hide code cell output
Artifact: mini_immuno.anndata.zarr (1.0)
|   description: Mini immuno dataset - A few samples from the immunology dataset
β”œβ”€β”€ uid: gwsbqIaopdgDZBym0000            run: 6RypAcs (croissant.ipynb)
β”‚   kind: dataset                        otype: AnnData                
β”‚   hash: N4_3ooEU3qkk97bWijn9XA         size: 21.5 KB                 
β”‚   branch: main                         space: all                    
β”‚   created_at: 2026-03-06 16:25:07 UTC  created_by: anonymous         
β”‚   n_files: 95                                                        
β”œβ”€β”€ storage/path: 
β”‚   /home/runner/work/lamin-mlops/lamin-mlops/docs/test-lamin-croissant/.lamindb/gwsbqIaopdgDZBym.anndata.zarr
└── Labels
    └── .ulabels                       ULabel                               https://creativecommons.org/licenses/b…
        .projects                      Project                              Mini Immuno Project                    
ln.finish()
Hide code cell output
β†’ finished Run('6RypAcsqgR0yYbal') after 2s at 2026-03-06 16:25:08 UTC