CroissantΒΆ

Croissant πŸ₯ is a high-level format building on schema.org for machine learning datasets that combines metadata, resource file descriptions, data structure, and default ML semantics into a single file. It works with existing datasets to make them easier to find, use, and support with tools.

Here, we demonstrate how LaminDB can be used to validate Croissant files:

!lamin init --storage ./lamin-croissant
Hide code cell output
β†’ initialized lamindb: anonymous/lamin-croissant
import lamindb as ln
import json

ln.track()
Hide code cell output
β†’ connected lamindb: anonymous/lamin-croissant
! creating spaces manually on sqlite databases is possible for demo purposes, but does *not* affect access permissions
β†’ created Transform('uEghcMT1tecm0000'), started new Run('aPDtpvjZ...') at 2025-09-07 19:25:29 UTC
β†’ notebook imports: lamindb==1.11a1
β€’ recommendation: to identify the notebook across renames, pass the uid: ln.track("uEghcMT1tecm")
croissant_path, dataset1_path = ln.examples.croissant.mini_immuno()
croissant_path
Hide code cell output
PosixPath('mini_immuno.anndata.zarr_metadata.json')
with open(croissant_path) as f:
    dictionary = json.load(f)

print(json.dumps(dictionary, indent=2))
Hide code cell output
{
  "@context": {
    "@vocab": "https://schema.org/",
    "cr": "https://mlcommons.org/croissant/",
    "ml": "http://ml-schema.org/",
    "sc": "https://schema.org/",
    "dct": "http://purl.org/dc/terms/",
    "data": "https://mlcommons.org/croissant/data/",
    "rai": "https://mlcommons.org/croissant/rai/",
    "format": "https://mlcommons.org/croissant/format/",
    "citeAs": "https://mlcommons.org/croissant/citeAs/",
    "conformsTo": "https://mlcommons.org/croissant/conformsTo/",
    "@language": "en",
    "repeated": "https://mlcommons.org/croissant/repeated/",
    "field": "https://mlcommons.org/croissant/field/",
    "examples": "https://mlcommons.org/croissant/examples/",
    "recordSet": "https://mlcommons.org/croissant/recordSet/",
    "fileObject": "https://mlcommons.org/croissant/fileObject/",
    "fileSet": "https://mlcommons.org/croissant/fileSet/",
    "source": "https://mlcommons.org/croissant/source/",
    "references": "https://mlcommons.org/croissant/references/",
    "key": "https://mlcommons.org/croissant/key/",
    "parentField": "https://mlcommons.org/croissant/parentField/",
    "isLiveDataset": "https://mlcommons.org/croissant/isLiveDataset/",
    "separator": "https://mlcommons.org/croissant/separator/",
    "extract": "https://mlcommons.org/croissant/extract/",
    "subField": "https://mlcommons.org/croissant/subField/",
    "regex": "https://mlcommons.org/croissant/regex/",
    "column": "https://mlcommons.org/croissant/column/",
    "path": "https://mlcommons.org/croissant/path/",
    "fileProperty": "https://mlcommons.org/croissant/fileProperty/",
    "md5": "https://mlcommons.org/croissant/md5/",
    "jsonPath": "https://mlcommons.org/croissant/jsonPath/",
    "transform": "https://mlcommons.org/croissant/transform/",
    "replace": "https://mlcommons.org/croissant/replace/",
    "dataType": "https://mlcommons.org/croissant/dataType/",
    "includes": "https://mlcommons.org/croissant/includes/",
    "excludes": "https://mlcommons.org/croissant/excludes/"
  },
  "@type": "Dataset",
  "name": "Mini immuno dataset",
  "description": "A few samples from the immunology dataset",
  "url": "https://lamin.ai/laminlabs/lamindata/artifact/tCUkRcaEjTjhtozp0000",
  "creator": {
    "@type": "Person",
    "name": "falexwolf"
  },
  "dateCreated": "2025-07-16",
  "cr:projectName": "Mini Immuno Project",
  "datePublished": "2025-07-16",
  "version": "1.0",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "citation": "Please cite this dataset as: mini immuno (2025)",
  "encodingFormat": "zarr",
  "distribution": [
    {
      "@type": "cr:FileSet",
      "@id": "mini_immuno.anndata.zarr",
      "containedIn": {
        "@id": "directory"
      },
      "encodingFormat": "zarr"
    }
  ],
  "cr:recordSet": [
    {
      "@type": "cr:RecordSet",
      "@id": "#samples",
      "name": "samples",
      "description": "my sample"
    }
  ]
}
dataset1_path
Hide code cell output
PosixPath('mini_immuno.anndata.zarr')
artifact = ln.integrations.curate_from_croissant(croissant_path)
Hide code cell output
! creating spaces manually on sqlite databases is possible for demo purposes, but does *not* affect access permissions
! calling anonymously, will miss private instances
! creating spaces manually on sqlite databases is possible for demo purposes, but does *not* affect access permissions

Project label, license, description, version tag, and file paths are automatically extracted from the Croissant file. More metadata can be supported in the future.

artifact.describe()
Hide code cell output
! creating spaces manually on sqlite databases is possible for demo purposes, but does *not* affect access permissions
! creating spaces manually on sqlite databases is possible for demo purposes, but does *not* affect access permissions
! creating spaces manually on sqlite databases is possible for demo purposes, but does *not* affect access permissions
! creating spaces manually on sqlite databases is possible for demo purposes, but does *not* affect access permissions
Artifact .anndata.zarr Β· AnnData Β· dataset
β”œβ”€β”€ General
β”‚   β”œβ”€β”€ description: Mini immuno dataset (mini_immuno.anndata.zarr) - A few samples from the immunology dataset
β”‚   β”œβ”€β”€ uid: 7ghTXedt67EIhGCT0000          hash: N4_3ooEU3qkk97bWijn9XA
β”‚   β”œβ”€β”€ size: 21.5 KB                      transform: croissant.ipynb
β”‚   β”œβ”€β”€ space: all                         branch: all
β”‚   β”œβ”€β”€ created_by: anonymous              created_at: 2025-09-07 19:25:31
β”‚   β”œβ”€β”€ n_files: 95                        version: 1.0
β”‚   └── storage path: /home/runner/work/lamin-mlops/lamin-mlops/docs/lamin-croissant/7ghTXedt67EIhGCT
└── Labels
    └── .projects                       Project                            Mini Immuno Project                     
        .ulabels                        ULabel                             https://creativecommons.org/licenses/by…
ln.finish()
Hide code cell output
! cells [(7, 9)] were not run consecutively
! creating spaces manually on sqlite databases is possible for demo purposes, but does *not* affect access permissions
! creating spaces manually on sqlite databases is possible for demo purposes, but does *not* affect access permissions
β†’ finished Run('aPDtpvjZ') after 2s at 2025-09-07 19:25:32 UTC
Hide code cell content
!rm -rf ./lamin-croissant
!lamin delete --force lamin-croissant
! calling anonymously, will miss private instances
β€’ deleting instance anonymous/lamin-croissant