Curate datasets

Curating a dataset with LaminDB means three things:

  1. Validate that the dataset matches a desired schema

  2. If validation fails, standardize the dataset (e.g., by fixing typos, mapping synonyms) or update registries

  3. Annotate the dataset by linking it against metadata entities so that it becomes queryable

In this guide we’ll curate common data structures. Here is a guide for the underlying low-level API.

Note: If you know either pydantic or pandera, here is an FAQ that compares LaminDB with both of these tools.

# pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-curate
import lamindb as ln

ln.track("MCeA3reqZG2e")
Hide code cell output
 connected lamindb: testuser1/test-curate
 created Transform('MCeA3reqZG2e0000'), started new Run('ALxOESc5...') at 2025-05-08 07:31:29 UTC
 notebook imports: lamindb==1.5.0

DataFrame

Allow a flexible schema

We’ll be working with the mini immuno dataset:

df = ln.core.datasets.mini_immuno.get_dataset1()
df
Hide code cell output
ENSG00000153563 ENSG00000010610 ENSG00000170458 perturbation sample_note cell_type_by_expert cell_type_by_model assay_oid concentration treatment_time_h donor
sample1 1 3 5 DMSO was ok B cell B cell EFO:0008913 0.1% 24 D0001
sample2 2 4 6 IFNG looks naah CD8-positive, alpha-beta T cell T cell EFO:0008913 200 nM 24 D0002
sample3 3 5 7 DMSO pretty! 🤩 CD8-positive, alpha-beta T cell T cell EFO:0008913 0.1% 6 None

This is how we curate it in a script.

curate_dataframe_flexible.py
import lamindb as ln

ln.core.datasets.mini_immuno.define_features_labels()
schema = ln.examples.schemas.valid_features()
df = ln.core.datasets.small_dataset1(otype="DataFrame")
artifact = ln.Artifact.from_df(
    df, key="examples/dataset1.parquet", schema=schema
).save()
artifact.describe()

Let’s run the script.

!python scripts/curate_dataframe_flexible.py
Hide code cell output
 connected lamindb: testuser1/test-curate
 connected lamindb: testuser1/test-curate
! no run & transform got linked, call `ln.track()` & re-run
Artifact .parquet/DataFrame
├── General
│   ├── .uid = '8LSupwDuFG5eszOL0000'
│   ├── .key = 'examples/dataset1.parquet'
│   ├── .size = 9108
│   ├── .hash = 'D2ZSlO6x7-OIfdf0MkTzRQ'
│   ├── .n_observations = 3
│   ├── .path = 
│   │   /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/8LSupwDuFG5e
│   │   szOL0000.parquet
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-05-08 07:31:38
├── Dataset features
│   └── columns7         [Feature]                                           
assay_oid           cat[bionty.Exper…  single-cell RNA sequencing       
cell_type_by_expe…  cat[bionty.CellT…  B cell, CD8-positive, alpha-beta…
cell_type_by_model  cat[bionty.CellT…  B cell, T cell                   
perturbation        cat[ULabel[Pertu…  DMSO, IFNG                       
donor               str                                                 
concentration       str                                                 
treatment_time_h    num                                                 
└── Labels
    └── .cell_types         bionty.CellType    B cell, T cell, CD8-positive, al…
        .experimental_fac…  bionty.Experimen…  single-cell RNA sequencing       
        .ulabels            ULabel             DMSO, IFNG                       

The script defined the following features & labels through define_features_labels():

import lamindb as ln
import bionty as bt

# define valid labels
perturbation_type = ln.ULabel(name="Perturbation", is_type=True).save()
ln.ULabel(name="DMSO", type=perturbation_type).save()
ln.ULabel(name="IFNG", type=perturbation_type).save()
bt.CellType.from_source(name="B cell").save()
bt.CellType.from_source(name="T cell").save()

# define valid features
ln.Feature(name="perturbation", dtype=perturbation_type).save()
ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save()
ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save()
ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save()
ln.Feature(name="donor", dtype=str, nullable=True).save()
ln.Feature(name="concentration", dtype=str).save()
ln.Feature(name="treatment_time_h", dtype="num", coerce_dtype=True).save()

And the following schema through valid_features():

import lamindb as ln

schema = ln.Schema(name="valid_features", itype=ln.Feature).save()

Require a minimal set of columns

If we’d like to curate the dataframe with a minimal set of required columns, we can use the following schema.

import lamindb as ln

schema = ln.Schema(
    name="Mini immuno schema",
    features=[
        ln.Feature.get(name="perturbation"),
        ln.Feature.get(name="cell_type_by_model"),
        ln.Feature.get(name="assay_oid"),
        ln.Feature.get(name="donor"),
        ln.Feature.get(name="concentration"),
        ln.Feature.get(name="treatment_time_h"),
    ],
    flexible=True,  # _additional_ columns in a dataframe are validated & annotated
).save()

If the dataframe lacks one of the required columns, we’ll get a validation error.

curate_dataframe_minimal_errors.py
import lamindb as ln

schema = ln.core.datasets.mini_immuno.define_mini_immuno_schema_flexible()
df = ln.core.datasets.small_dataset1(otype="DataFrame")
df.pop("donor")  # remove donor column to trigger validation error
try:
    artifact = ln.Artifact.from_df(
        df, key="examples/dataset1.parquet", schema=schema
    ).save()
except ln.errors.ValidationError as error:
    print(error)

Let’s run the script.

!python scripts/curate_dataframe_minimal_errors.py
Hide code cell output
 connected lamindb: testuser1/test-curate
 returning existing ULabel record with same name: 'Perturbation'
 returning existing ULabel record with same name: 'DMSO'
 returning existing ULabel record with same name: 'IFNG'
 returning existing Feature record with same name: 'perturbation'
 returning existing Feature record with same name: 'cell_type_by_model'
 returning existing Feature record with same name: 'cell_type_by_expert'
 returning existing Feature record with same name: 'assay_oid'
 returning existing Feature record with same name: 'donor'
 returning existing Feature record with same name: 'concentration'
 returning existing Feature record with same name: 'treatment_time_h'
! no run & transform got linked, call `ln.track()` & re-run
 creating new artifact version for key='examples/dataset1.parquet' (storage: '/home/runner/work/lamindb/lamindb/docs/test-curate')
column 'donor' not in dataframe. Columns in dataframe: ['ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'perturbation', 'sample_note', 'cell_type_by_expert', 'cell_type_by_model', 'assay_oid', 'concentration', 'treatment_time_h']

Resolve synonyms and typos

Let’s now look at the same dataset but assume there are synonyms and typos.

df = ln.core.datasets.mini_immuno.get_dataset1(
    with_cell_type_synonym=True, with_cell_type_typo=True
)
df
Hide code cell output
ENSG00000153563 ENSG00000010610 ENSG00000170458 perturbation sample_note cell_type_by_expert cell_type_by_model assay_oid concentration treatment_time_h donor
sample1 1 3 5 DMSO was ok B-cell B cell EFO:0008913 0.1% 24 D0001
sample2 2 4 6 IFNG looks naah CD8-pos alpha-beta T cell T cell EFO:0008913 200 nM 24 D0002
sample3 3 5 7 DMSO pretty! 🤩 CD8-pos alpha-beta T cell T cell EFO:0008913 0.1% 6 None

Let’s reuse the schema that defines a minimal set of columns we expect in the dataframe.

schema = ln.core.datasets.mini_immuno.define_mini_immuno_schema_flexible()
schema.describe()
Hide code cell output
 returning existing ULabel record with same name: 'Perturbation'
 returning existing ULabel record with same name: 'DMSO'
 returning existing ULabel record with same name: 'IFNG'
 returning existing Feature record with same name: 'perturbation'
 returning existing Feature record with same name: 'cell_type_by_model'
 returning existing Feature record with same name: 'cell_type_by_expert'
 returning existing Feature record with same name: 'assay_oid'
 returning existing Feature record with same name: 'donor'
 returning existing Feature record with same name: 'concentration'
 returning existing Feature record with same name: 'treatment_time_h'
 returning existing schema with same hash: Schema(uid='57DJnhVJIVwJn18m', name='Mini immuno schema', n=6, is_type=False, itype='Feature', hash='4LJqB7CAbdUdbJcXo8lBVA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:42 UTC)
Schema 
├── .uid = '57DJnhVJIVwJn18m'
├── .name = 'Mini immuno schema'
├── .itype = 'Feature'
├── .ordered_set = False
├── .maximal_set = False
├── .created_by = testuser1 (Test User1)
├── .created_at = 2025-05-08 07:31:42
└── Feature6
    └── name               dtype                                      optional  nullab…  coerce_dtype  default_val…
        perturbation       cat[ULabel[Perturbation]]                  ✗         ✓        ✗             unset       
        cell_type_by_mod…  cat[bionty.CellType]                       ✗         ✓        ✗             unset       
        assay_oid          cat[bionty.ExperimentalFactor.ontology_i…  ✗         ✓        ✗             unset       
        donor              str                                        ✗         ✓        ✗             unset       
        concentration      str                                        ✗         ✓        ✗             unset       
        treatment_time_h   num                                        ✗         ✓        ✓             unset       

Create a curator object using the dataset & the schema.

curator = ln.curators.DataFrameCurator(df, schema)

The validate() method validates that your dataset adheres to the criteria defined by the schema. It identifies which values are already validated (exist in the registries) and which are potentially problematic (do not yet exist in our registries).

try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)
Hide code cell output
! 2 terms not validated in feature 'cell_type_by_expert': 'B-cell', 'CD8-pos alpha-beta T cell'
    1 synonym found: "B-cell" → "B cell"
    → curate synonyms via: .standardize("cell_type_by_expert")
    for remaining terms:
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('cell_type_by_expert')
2 terms not validated in feature 'cell_type_by_expert': 'B-cell', 'CD8-pos alpha-beta T cell'
    1 synonym found: "B-cell" → "B cell"
    → curate synonyms via: .standardize("cell_type_by_expert")
    for remaining terms:
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('cell_type_by_expert')
# check the non-validated terms
curator.cat.non_validated
Hide code cell output
{'cell_type_by_expert': ['B-cell', 'CD8-pos alpha-beta T cell']}

For cell_type, we saw that “cerebral pyramidal neuron”, “astrocytic glia” are not validated.

First, let’s standardize synonym “astrocytic glia” as suggested

curator.cat.standardize("cell_type_by_expert")
# now we have only one non-validated cell type left
curator.cat.non_validated
Hide code cell output
{'cell_type_by_expert': ['CD8-pos alpha-beta T cell']}

For “CD8-pos alpha-beta T cell”, let’s understand which cell type in the public ontology might be the actual match.

# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
lookup = curator.cat.lookup(public=True)
lookup
Hide code cell output
Lookup objects from the public:
 .perturbation
 .cell_type_by_model
 .cell_type_by_expert
 .assay_oid
 .columns
 
Example:
    → categories = curator.lookup()["cell_type"]
    → categories.alveolar_type_1_fibroblast_cell

To look up public ontologies, use .lookup(public=True)
# here is an example for the "cell_type" column
cell_types = lookup["cell_type_by_expert"]
cell_types.cd8_positive_alpha_beta_t_cell
Hide code cell output
CellType(ontology_id='CL:0000625', name='CD8-positive, alpha-beta T cell', definition='A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor.', synonyms='CD8-positive, alpha-beta T-cell|CD8-positive, alpha-beta T lymphocyte|CD8-positive, alpha-beta T-lymphocyte', parents=array(['CL:0000791'], dtype=object))
# fix the cell type name
df["cell_type_by_expert"] = df["cell_type_by_expert"].cat.rename_categories(
    {"CD8-pos alpha-beta T cell": cell_types.cd8_positive_alpha_beta_t_cell.name}
)

For perturbation, we want to add the new values: “DMSO”, “IFNG”

# this adds perturbations that were _not_ validated
curator.cat.add_new_from("perturbation")
# validate again
curator.validate()

Save a curated artifact.

artifact = curator.save_artifact(key="examples/my_curated_dataset.parquet")
Hide code cell output
 returning existing artifact with same hash: Artifact(uid='8LSupwDuFG5eszOL0000', is_latest=True, key='examples/dataset1.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=9108, hash='D2ZSlO6x7-OIfdf0MkTzRQ', n_observations=3, space_id=1, storage_id=1, schema_id=1, created_by_id=1, created_at=2025-05-08 07:31:38 UTC); to track this artifact as an input, use: ln.Artifact.get()
! key examples/dataset1.parquet on existing artifact differs from passed key examples/my_curated_dataset.parquet
 returning existing schema with same hash: Schema(uid='mVKo5vx4I80Pi3gZ', n=7, is_type=False, itype='Feature', hash='LNY9e8vhNpAOJRviIWwMCQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:38 UTC)
artifact.describe()
Hide code cell output
Artifact .parquet/DataFrame
├── General
│   ├── .uid = '8LSupwDuFG5eszOL0000'
│   ├── .key = 'examples/dataset1.parquet'
│   ├── .size = 9108
│   ├── .hash = 'D2ZSlO6x7-OIfdf0MkTzRQ'
│   ├── .n_observations = 3
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/8LSupwDuFG5eszOL0000.parquet
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2025-05-08 07:31:38
│   └── .transform = 'Curate datasets'
├── Dataset features
│   └── columns7                 [Feature]                                                           
assay_oid                   cat[bionty.ExperimentalF…  single-cell RNA sequencing               
cell_type_by_expert         cat[bionty.CellType]       B cell, CD8-positive, alpha-beta T cell  
cell_type_by_model          cat[bionty.CellType]       B cell, T cell                           
perturbation                cat[ULabel[Perturbation]]  DMSO, IFNG                               
donor                       str                                                                 
concentration               str                                                                 
treatment_time_h            num                                                                 
└── Labels
    └── .cell_types                 bionty.CellType            B cell, T cell, CD8-positive, alpha-beta…
        .experimental_factors       bionty.ExperimentalFactor  single-cell RNA sequencing               
        .ulabels                    ULabel                     DMSO, IFNG                               

AnnData

AnnData like all other data structures that follow is a composite structure that stores different arrays in different slots.

Allow a flexible schema

We can also allow a flexible schema for an AnnData and only require that it’s indexed with Ensembl gene IDs.

curate_anndata_flexible.py
import lamindb as ln

ln.core.datasets.mini_immuno.define_features_labels()
adata = ln.core.datasets.mini_immuno.get_dataset1(otype="AnnData")
schema = ln.examples.schemas.anndata_ensembl_gene_ids_and_valid_features_in_obs()
artifact = ln.Artifact.from_anndata(
    adata, key="examples/mini_immuno.h5ad", schema=schema
).save()
artifact.describe()

Let’s run the script.

!python scripts/curate_anndata_flexible.py
Hide code cell output
 connected lamindb: testuser1/test-curate
 returning existing ULabel record with same name: 'Perturbation'
 returning existing ULabel record with same name: 'DMSO'
 returning existing ULabel record with same name: 'IFNG'
 returning existing Feature record with same name: 'perturbation'
 returning existing Feature record with same name: 'cell_type_by_model'
 returning existing Feature record with same name: 'cell_type_by_expert'
 returning existing Feature record with same name: 'assay_oid'
 returning existing Feature record with same name: 'donor'
 returning existing Feature record with same name: 'concentration'
 returning existing Feature record with same name: 'treatment_time_h'
 connected lamindb: testuser1/test-curate
 connected lamindb: testuser1/test-curate
 returning existing schema with same hash: Schema(uid='0000000000000000', name='valid_features', n=-1, is_type=False, itype='Feature', hash='kMi7B_N88uu-YnbTLDU-DA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:37 UTC)
! no run & transform got linked, call `ln.track()` & re-run
 returning existing schema with same hash: Schema(uid='mVKo5vx4I80Pi3gZ', n=7, is_type=False, itype='Feature', hash='LNY9e8vhNpAOJRviIWwMCQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:38 UTC)
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'zagzIJAd5AXNDDFf0000'
│   ├── .key = 'examples/mini_immuno.h5ad'
│   ├── .size = 31672
│   ├── .hash = 'FB3CeMjmg1ivN6HDy6wsSg'
│   ├── .n_observations = 3
│   ├── .path = 
│   │   /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/zagzIJAd5AXN
│   │   DDFf0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-05-08 07:31:56
├── Dataset features
│   ├── obs7             [Feature]                                           
│   │   assay_oid           cat[bionty.Exper…  single-cell RNA sequencing       
│   │   cell_type_by_expe…  cat[bionty.CellT…  B cell, CD8-positive, alpha-beta…
│   │   cell_type_by_model  cat[bionty.CellT…  B cell, T cell                   
│   │   perturbation        cat[ULabel[Pertu…  DMSO, IFNG                       
│   │   donor               str                                                 
│   │   concentration       str                                                 
│   │   treatment_time_h    num                                                 
│   └── var.T3           [bionty.Gene.ens…                                   
CD8A                num                                                 
CD4                 num                                                 
CD14                num                                                 
└── Labels
    └── .cell_types         bionty.CellType    B cell, T cell, CD8-positive, al…
        .experimental_fac…  bionty.Experimen…  single-cell RNA sequencing       
        .ulabels            ULabel             DMSO, IFNG                       

Under-the-hood, this used the following schema:

import lamindb as ln
import bionty as bt

obs_schema = ln.examples.schemas.valid_features()
varT_schema = ln.Schema(
    name="valid_ensembl_gene_ids", itype=bt.Gene.ensembl_gene_id
).save()
schema = ln.Schema(
    name="anndata_ensembl_gene_ids_and_valid_features_in_obs",
    otype="AnnData",
    slots={"obs": obs_schema, "var.T": varT_schema},
).save()

This schema tranposes the var DataFrame during curation, so that one validates and annotates the var.T schema, i.e., [ENSG00000153563, ENSG00000010610, ENSG00000170458]. If one doesn’t transpose, one would annotate with the schema of var, i.e., [gene_symbol, gene_type].

https://lamin-site-assets.s3.amazonaws.com/.lamindb/gLyfToATM7WUzkWW0001.png

Resolve typos

import lamindb as ln
adata = ln.core.datasets.mini_immuno.get_dataset1(
    with_gene_typo=True, with_cell_type_typo=True, otype="AnnData"
)
adata
Hide code cell output
AnnData object with n_obs × n_vars = 3 × 3
    obs: 'perturbation', 'sample_note', 'cell_type_by_expert', 'cell_type_by_model', 'assay_oid', 'concentration', 'treatment_time_h', 'donor'
    uns: 'temperature', 'experiment', 'date_of_study', 'study_note'
Hide code cell content
schema = ln.examples.schemas.anndata_ensembl_gene_ids_and_valid_features_in_obs()
schema.describe()
Schema(uid='0000000000000002', name='anndata_ensembl_gene_ids_and_valid_features_in_obs', n=-1, is_type=False, itype='Composite', otype='AnnData', dtype='num', hash='GTxxM36n9tocphLfdbNt9g', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:53 UTC)
    obs: Schema(uid='0000000000000000', name='valid_features', n=-1, is_type=False, itype='Feature', hash='kMi7B_N88uu-YnbTLDU-DA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:37 UTC)
    var.T: Schema(uid='0000000000000001', name='valid_ensembl_gene_ids', n=-1, is_type=False, itype='bionty.Gene.ensembl_gene_id', dtype='num', hash='1gocc_TJ1RU2bMwDRK-WUA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:53 UTC)

Check the slots of a schema:

schema.slots
Hide code cell output
{'obs': Schema(uid='0000000000000000', name='valid_features', n=-1, is_type=False, itype='Feature', hash='kMi7B_N88uu-YnbTLDU-DA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:37 UTC),
 'var.T': Schema(uid='0000000000000001', name='valid_ensembl_gene_ids', n=-1, is_type=False, itype='bionty.Gene.ensembl_gene_id', dtype='num', hash='1gocc_TJ1RU2bMwDRK-WUA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:53 UTC)}
curator = ln.curators.AnnDataCurator(adata, schema)
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)
Hide code cell output
! 1 term not validated in feature 'cell_type_by_expert' in slot 'obs': 'CD8-pos alpha-beta T cell'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('cell_type_by_expert')
1 term not validated in feature 'cell_type_by_expert' in slot 'obs': 'CD8-pos alpha-beta T cell'
    → fix typos, remove non-existent values, or save terms via: curator.slots['obs'].cat.add_new_from('cell_type_by_expert')

As above, we leverage a lookup object with valid cell types to find the correct name.

valid_cell_types = curator.slots["obs"].cat.lookup()["cell_type_by_expert"]
adata.obs["cell_type_by_expert"] = adata.obs[
    "cell_type_by_expert"
].cat.rename_categories(
    {"CD8-pos alpha-beta T cell": valid_cell_types.cd8_positive_alpha_beta_t_cell.name}
)

The validated AnnData can be subsequently saved as an Artifact:

adata.obs.columns
Index(['perturbation', 'sample_note', 'cell_type_by_expert',
       'cell_type_by_model', 'assay_oid', 'concentration', 'treatment_time_h',
       'donor'],
      dtype='object')
curator.slots["var.T"].cat.add_new_from("columns")
! using default organism = human
! 1 term not validated in feature 'columns' in slot 'var.T': 'GeneTypo'
    → fix typos, remove non-existent values, or save terms via: curator.slots['var.T'].cat.add_new_from('columns')
curator.validate()
artifact = curator.save_artifact(key="examples/my_curated_anndata.h5ad")
Hide code cell output
 returning existing schema with same hash: Schema(uid='mVKo5vx4I80Pi3gZ', n=7, is_type=False, itype='Feature', hash='LNY9e8vhNpAOJRviIWwMCQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:38 UTC)

Access the schema for each slot:

artifact.features.slots
Hide code cell output
{'obs': Schema(uid='mVKo5vx4I80Pi3gZ', n=7, is_type=False, itype='Feature', hash='LNY9e8vhNpAOJRviIWwMCQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:38 UTC),
 'var.T': Schema(uid='8UHCSk6EllJxGTfd', n=3, is_type=False, itype='bionty.Gene.ensembl_gene_id', dtype='num', hash='8e68Zm15DA4DuC39LJr6JA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, run_id=1, created_at=2025-05-08 07:32:09 UTC)}

The saved artifact has been annotated with validated features and labels:

artifact.describe()
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'VANb2PUqJ72nwm4u0000'
│   ├── .key = 'examples/my_curated_anndata.h5ad'
│   ├── .size = 31672
│   ├── .hash = 'yeNWx0-dOGGkANQbocU4Sg'
│   ├── .n_observations = 3
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/VANb2PUqJ72nwm4u0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2025-05-08 07:32:09
│   └── .transform = 'Curate datasets'
├── Dataset features
│   ├── obs7                     [Feature]                                                           
│   │   assay_oid                   cat[bionty.ExperimentalF…  single-cell RNA sequencing               
│   │   cell_type_by_expert         cat[bionty.CellType]       B cell, CD8-positive, alpha-beta T cell  
│   │   cell_type_by_model          cat[bionty.CellType]       B cell, T cell                           
│   │   perturbation                cat[ULabel[Perturbation]]  DMSO, IFNG                               
│   │   donor                       str                                                                 
│   │   concentration               str                                                                 
│   │   treatment_time_h            num                                                                 
│   └── var.T3                   [bionty.Gene.ensembl_gen…                                           
CD8A                        num                                                                 
CD4                         num                                                                 
└── Labels
    └── .cell_types                 bionty.CellType            B cell, T cell, CD8-positive, alpha-beta…
        .experimental_factors       bionty.ExperimentalFactor  single-cell RNA sequencing               
        .ulabels                    ULabel                     DMSO, IFNG                               

MuData

import lamindb as ln
import bionty as bt


# define the global obs schema
obs_schema = ln.Schema(
    name="mudata_papalexi21_subset_obs_schema",
    features=[
        ln.Feature(name="perturbation", dtype="cat[ULabel[Perturbation]]").save(),
        ln.Feature(name="replicate", dtype="cat[ULabel[Replicate]]").save(),
    ],
).save()

# define the ['rna'].obs schema
obs_schema_rna = ln.Schema(
    name="mudata_papalexi21_subset_rna_obs_schema",
    features=[
        ln.Feature(name="nCount_RNA", dtype=int).save(),
        ln.Feature(name="nFeature_RNA", dtype=int).save(),
        ln.Feature(name="percent.mito", dtype=float).save(),
    ],
).save()

# define the ['hto'].obs schema
obs_schema_hto = ln.Schema(
    name="mudata_papalexi21_subset_hto_obs_schema",
    features=[
        ln.Feature(name="nCount_HTO", dtype=int).save(),
        ln.Feature(name="nFeature_HTO", dtype=int).save(),
        ln.Feature(name="technique", dtype=bt.ExperimentalFactor).save(),
    ],
).save()

# define ['rna'].var schema
var_schema_rna = ln.Schema(
    name="mudata_papalexi21_subset_rna_var_schema",
    itype=bt.Gene.symbol,
    dtype=float,
).save()

# define composite schema
mudata_schema = ln.Schema(
    name="mudata_papalexi21_subset_mudata_schema",
    otype="MuData",
    slots={
        "obs": obs_schema,
        "rna:obs": obs_schema_rna,
        "hto:obs": obs_schema_hto,
        "rna:var": var_schema_rna,
    },
).save()

# curate a MuData
mdata = ln.core.datasets.mudata_papalexi21_subset()
bt.settings.organism = "human"  # set the organism to map gene symbols
curator = ln.curators.MuDataCurator(mdata, mudata_schema)
artifact = curator.save_artifact(key="examples/mudata_papalexi21_subset.h5mu")
assert artifact.schema == mudata_schema

SpatialData

define_schema_spatialdata.py
import lamindb as ln
import bionty as bt


attrs_schema = ln.Schema(
    features=[
        ln.Feature(name="bio", dtype=dict).save(),
        ln.Feature(name="tech", dtype=dict).save(),
    ],
).save()

sample_schema = ln.Schema(
    features=[
        ln.Feature(name="disease", dtype=bt.Disease, coerce_dtype=True).save(),
        ln.Feature(
            name="developmental_stage",
            dtype=bt.DevelopmentalStage,
            coerce_dtype=True,
        ).save(),
    ],
).save()

tech_schema = ln.Schema(
    features=[
        ln.Feature(name="assay", dtype=bt.ExperimentalFactor, coerce_dtype=True).save(),
    ],
).save()

obs_schema = ln.Schema(
    features=[
        ln.Feature(name="sample_region", dtype="str").save(),
    ],
).save()

# Schema enforces only registered Ensembl Gene IDs are valid (maximal_set=True)
varT_schema = ln.Schema(itype=bt.Gene.ensembl_gene_id, maximal_set=True).save()

sdata_schema = ln.Schema(
    name="spatialdata_blobs_schema",
    otype="SpatialData",
    slots={
        "attrs:bio": sample_schema,
        "attrs:tech": tech_schema,
        "attrs": attrs_schema,
        "tables:table:obs": obs_schema,
        "tables:table:var.T": varT_schema,
    },
).save()
!python scripts/define_schema_spatialdata.py
Hide code cell output
 connected lamindb: testuser1/test-curate
! record with similar name exists! did you mean to load it?
<BasicQuerySet [Feature(uid='0uW4wRBAMM3i', name='assay_oid', dtype='cat[bionty.ExperimentalFactor.ontology_id]', array_rank=0, array_size=0, space_id=1, created_by_id=1, created_at=2025-05-08 07:31:34 UTC)]>
curate_spatialdata.py
import lamindb as ln

spatialdata = ln.core.datasets.spatialdata_blobs()
sdata_schema = ln.Schema.get(name="spatialdata_blobs_schema")
curator = ln.curators.SpatialDataCurator(spatialdata, sdata_schema)
try:
    curator.validate()
except ln.errors.ValidationError:
    pass

spatialdata.tables["table"].var.drop(index="ENSG00000999999", inplace=True)

# validate again (must pass now) and save artifact
artifact = ln.Artifact.from_spatialdata(
    spatialdata, key="examples/spatialdata1.zarr", schema=sdata_schema
).save()
artifact.describe()
!python scripts/curate_spatialdata.py
Hide code cell output
 connected lamindb: testuser1/test-curate
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/spatialdata/models/models.py:1144: UserWarning: Converting `region_key: region` to categorical dtype.
  return convert_region_column_to_categorical(adata)
! 1 term not validated in feature 'columns' in slot 'tables:table:var.T': 'ENSG00000999999'
    → fix typos, remove non-existent values, or save terms via: curator.slots['tables:table:var.T'].cat.add_new_from('columns')
! no run & transform got linked, call `ln.track()` & re-run
INFO     The Zarr backing store has been changed from None the new file path:   
         /home/runner/.cache/lamindb/Ln3BBrSx2Zsk0OR90000.zarr                  
 returning existing schema with same hash: Schema(uid='OR1h132V0Fqjbjql', n=2, is_type=False, itype='Feature', hash='DNescPFT3WrjT3-SH4BJCw', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:32:12 UTC)
 returning existing schema with same hash: Schema(uid='LUB8N4LN8hRBnigV', n=1, is_type=False, itype='Feature', hash='kz-su5wbYWfHbl6TKSwnFA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:32:12 UTC)
 returning existing schema with same hash: Schema(uid='lduvi4voz5YvbvGo', n=2, is_type=False, itype='Feature', hash='PhntTZsl57lydvnKtGSXfg', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:32:12 UTC)
 returning existing schema with same hash: Schema(uid='eGqwzeiSBbnFEtwL', n=1, is_type=False, itype='Feature', hash='npTwcpHIAUu3wCznPiSGTA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-05-08 07:32:12 UTC)
Artifact .zarr/SpatialData
├── General
│   ├── .uid = 'Ln3BBrSx2Zsk0OR90000'
│   ├── .key = 'examples/spatialdata1.zarr'
│   ├── .size = 12121732
│   ├── .hash = 'ikSJOoKg6sA-nexcJh_s_g'
│   ├── .n_files = 113
│   ├── .path = 
│   │   /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/Ln3BBrSx2Zsk
│   │   0OR9.zarr
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-05-08 07:32:28
├── Dataset features
│   ├── attrs:bio2       [Feature]                                           
│   │   developmental_sta…  cat[bionty.Devel…  adult stage                      
│   │   disease             cat[bionty.Disea…  Alzheimer disease                
│   ├── attrs:tech1      [Feature]                                           
│   │   assay               cat[bionty.Exper…  Visium Spatial Gene Expression   
│   ├── attrs2           [Feature]                                           
│   │   bio                 dict                                                
│   │   tech                dict                                                
│   ├── tables:table:obs  [Feature]                                           
│   │   sample_region       str                                                 
│   └── tables:table:var.…  [bionty.Gene.ens…                                   
BRCA2               num                                                 
BRAF                num                                                 
└── Labels
    └── .diseases           bionty.Disease     Alzheimer disease                
        .experimental_fac…  bionty.Experimen…  Visium Spatial Gene Expression   
        .developmental_st…  bionty.Developme…  adult stage                      

Other data structures

If you have other data structures, read: How do I validate & annotate arbitrary data structures?.

Hide code cell content
!rm -rf ./test-curate
!lamin delete --force test-curate
 deleting instance testuser1/test-curate