Curate DataFrames and AnnDatas

Curating a dataset with LaminDB means three things:

  1. Validate that the dataset matches a desired schema

  2. In case the dataset doesn’t validate, standardize it, e.g., by fixing typos or mapping synonyms

  3. Annotate the dataset by linking it against metadata entities so that it becomes queryable

Curate a DataFrame

# pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-curate

Let’s start with a DataFrame that we’d like to validate.

import lamindb as ln
import bionty as bt
import pandas as pd


df = pd.DataFrame(
    {
        "perturbation": pd.Categorical(["DMSO", "IFNG", "DMSO"]),
        "temperature": [37.2, 36.3, 38.2],
        "cell_type": pd.Categorical(
            [
                "cerebral pyramidal neuron",
                "astrocytic glia",
                "oligodendrocyte",
            ]
        ),
        "assay_ontology_id": pd.Categorical(
            ["EFO:0008913", "EFO:0008913", "EFO:0008913"]
        ),
        "donor": ["D0001", "D0002", None],
    },
    index=["obs1", "obs2", "obs3"],
)
df
Hide code cell output
 connected lamindb: testuser1/test-curate
perturbation temperature cell_type assay_ontology_id donor
obs1 DMSO 37.2 cerebral pyramidal neuron EFO:0008913 D0001
obs2 IFNG 36.3 astrocytic glia EFO:0008913 D0002
obs3 DMSO 38.2 oligodendrocyte EFO:0008913 None

Define a schema to validate this dataset.

schema = ln.Schema(
    name="My example schema",
    features=[
        ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
        ln.Feature(name="temperature", dtype=float).save(),
        ln.Feature(name="cell_type", dtype=bt.CellType).save(),
        ln.Feature(
            name="assay_ontology_id", dtype=bt.ExperimentalFactor.ontology_id
        ).save(),
        ln.Feature(name="donor", dtype=str, nullable=True).save(),
    ],
).save()
# display the associated features as a dataframe
schema.features.df()
Hide code cell output
uid name dtype is_type unit description array_rank array_size array_shape proxy_dtype synonyms _expect_many _curation space_id type_id run_id created_at created_by_id _aux _branch_code
id
1 TbmOCTGMLFJL perturbation cat[ULabel] None None None 0 0 None None None True None 1 None None 2025-02-27 13:54:48.107000+00:00 1 {'af': {'0': None, '1': True}} 1
2 OUEC5i8v9HC6 temperature float None None None 0 0 None None None True None 1 None None 2025-02-27 13:54:48.115000+00:00 1 {'af': {'0': None, '1': True}} 1
3 D144tFa2O4nE cell_type cat[bionty.CellType] None None None 0 0 None None None True None 1 None None 2025-02-27 13:54:48.528000+00:00 1 {'af': {'0': None, '1': True}} 1
4 u0vEQlt6YSsX assay_ontology_id cat[bionty.ExperimentalFactor.ontology_id] None None None 0 0 None None None True None 1 None None 2025-02-27 13:54:48.534000+00:00 1 {'af': {'0': None, '1': True}} 1
5 XRPmRFt98s4W donor str None None None 0 0 None None None True None 1 None None 2025-02-27 13:54:48.540000+00:00 1 {'af': {'0': None, '1': True}} 1

Create a Curator using the dataset & the schema.

curator = ln.curators.DataFrameCurator(df, schema)

The validate() method validates that your dataset adheres to the criteria defined by the schema. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).

try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)
Hide code cell output
 saving validated records of 'cell_type'
 added 2 records from public with CellType.name for "cell_type": 'astrocyte', 'oligodendrocyte'
 saving validated records of 'assay_ontology_id'
 added 1 record from public with ExperimentalFactor.ontology_id for "assay_ontology_id": 'EFO:0008913'
 mapping "perturbation" on ULabel.name
!   2 terms are not validated: 'DMSO', 'IFNG'
    → fix typos, remove non-existent values, or save terms via .add_new_from("perturbation")
 mapping "cell_type" on CellType.name
!   2 terms are not validated: 'cerebral pyramidal neuron', 'astrocytic glia'
    1 synonym found: "astrocytic glia" → "astrocyte"
    → curate synonyms via .standardize("cell_type")
    for remaining terms:
    → fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
2 terms are not validated: 'cerebral pyramidal neuron', 'astrocytic glia'
    1 synonym found: "astrocytic glia" → "astrocyte"
    → curate synonyms via .standardize("cell_type")
    for remaining terms:
    → fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
# check the non-validated terms
curator.cat.non_validated
Hide code cell output
{'perturbation': ['DMSO', 'IFNG'],
 'cell_type': ['cerebral pyramidal neuron', 'astrocytic glia']}

For cell_type, we saw that “cerebral pyramidal neuron”, “astrocytic glia” are not validated.

First, let’s standardize synonym “astrocytic glia” as suggested

curator.cat.standardize("cell_type")
 standardized 1 synonym in "cell_type": "astrocytic glia" → "astrocyte"
# now we have only one non-validated cell type left
curator.cat.non_validated
Hide code cell output
{'perturbation': ['DMSO', 'IFNG'], 'cell_type': ['cerebral pyramidal neuron']}

For “cerebral pyramidal neuron”, let’s understand which cell type in the public ontology might be the actual match.

# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
lookup = curator.cat.lookup(public=True)
lookup
Hide code cell output
Lookup objects from the public:
 .perturbation
 .cell_type
 .assay_ontology_id
 .columns
 
Example:
    → categories = curator.lookup()["cell_type"]
    → categories.alveolar_type_1_fibroblast_cell

To look up public ontologies, use .lookup(public=True)
# here is an example for the "cell_type" column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron
Hide code cell output
CellType(ontology_id='CL:4023111', name='cerebral cortex pyramidal neuron', definition='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', synonyms=None, parents=array(['CL:0010012', 'CL:0000598'], dtype=object))
# fix the cell type
df.cell_type = df.cell_type.cat.rename_categories(
    {"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name}
)

For perturbation, we want to add the new values: “DMSO”, “IFNG”

# this adds perturbations that were _not_ validated
curator.cat.add_new_from("perturbation")
Hide code cell output
 added 2 records with ULabel.name for "perturbation": 'IFNG', 'DMSO'
# validate again
curator.validate()
Hide code cell output
 saving validated records of 'cell_type'
 added 1 record from public with CellType.name for "cell_type": 'cerebral cortex pyramidal neuron'
 "perturbation" is validated against ULabel.name
 "cell_type" is validated against CellType.name
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id

Save a curated artifact.

artifact = curator.save_artifact(key="my_datasets/my_curated_dataset.parquet")
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
• path content will be copied to default storage upon `save()` with key 'my_datasets/my_curated_dataset.parquet'
 storing artifact 'UE3miar5VH3ZwCNr0000' at '/home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/UE3miar5VH3ZwCNr0000.parquet'
! run input wasn't tracked, call `ln.track()` and re-run
 5 unique terms (100.00%) are validated for name
 returning existing schema with same hash: Schema(uid='I3GBKPfRJnlq9DEEIEzF', name='My example schema', n=5, itype='Feature', is_type=False, hash='x_Wetns1Gi_r8gjlRGQBIg', minimal_set=True, ordered_set=False, maximal_set=False, created_by_id=1, space_id=1, created_at=2025-02-27 13:54:48 UTC)
! updated otype from None to DataFrame
artifact.describe()
Hide code cell output
Artifact .parquet/DataFrame
├── General
│   ├── .uid = 'UE3miar5VH3ZwCNr0000'
│   ├── .key = 'my_datasets/my_curated_dataset.parquet'
│   ├── .size = 4759
│   ├── .hash = 'maTJHYf0LMnN08S4ZqQL9Q'
│   ├── .n_observations = 3
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/UE3miar5VH3ZwCNr0000.parquet
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-02-27 13:54:52
├── Dataset features/schema
│   └── columns5                 [Feature]                                                           
assay_ontology_id           cat[bionty.ExperimentalF…  single-cell RNA sequencing               
cell_type                   cat[bionty.CellType]       astrocyte, cerebral cortex pyramidal neu…
perturbation                cat[ULabel]                DMSO, IFNG                               
temperature                 float                                                               
donor                       str                                                                 
└── Labels
    └── .cell_types                 bionty.CellType            astrocyte, oligodendrocyte, cerebral cor…
        .experimental_factors       bionty.ExperimentalFactor  single-cell RNA sequencing               
        .ulabels                    ULabel                     IFNG, DMSO                               

Curate an AnnData

Here we additionally specify which var_index to validate against.

import anndata as ad

X = pd.DataFrame(
    {
        "ENSG00000081059": [1, 2, 3],
        "ENSG00000276977": [4, 5, 6],
        "ENSG00000198851": [7, 8, 9],
        "ENSG00000010610": [10, 11, 12],
        "ENSG00000153563": [13, 14, 15],
        "ENSGcorrupted": [16, 17, 18],
    },
    index=df.index,  # because we already curated the dataframe above, it will validate
)
adata = ad.AnnData(X=X, obs=df)
adata
Hide code cell output
AnnData object with n_obs × n_vars = 3 × 6
    obs: 'perturbation', 'temperature', 'cell_type', 'assay_ontology_id', 'donor'
# define var schema
var_schema = ln.Schema(
    name="my_var_schema",
    itype=bt.Gene.ensembl_gene_id,  # identifier type
    dtype=int,
).save()

# define composite schema
anndata_schema = ln.Schema(
    name="small_dataset1_anndata_schema",
    otype="AnnData",  # object type
    components={"obs": schema, "var": var_schema},
).save()
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)
Hide code cell output
 saving validated records of 'columns'
 added 5 records from public with Gene.ensembl_gene_id for "columns": 'ENSG00000081059', 'ENSG00000276977', 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000198851'
 "perturbation" is validated against ULabel.name
 "cell_type" is validated against CellType.name
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id

Subset the AnnData to validated genes only:

adata_validated = adata[:, ~adata.var.index.isin(["ENSGcorrupted"])].copy()

Now let’s validate the subsetted object:

curator = ln.curators.AnnDataCurator(adata_validated, anndata_schema)
curator.validate()
Hide code cell output
 "perturbation" is validated against ULabel.name
 "cell_type" is validated against CellType.name
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id

The validated AnnData can be subsequently saved as an Artifact:

artifact = curator.save_artifact(key="my_datasets/my_curated_anndata.h5ad")
Hide code cell output
 "perturbation" is validated against ULabel.name
 "cell_type" is validated against CellType.name
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
! no run & transform got linked, call `ln.track()` & re-run
• path content will be copied to default storage upon `save()` with key 'my_datasets/my_curated_anndata.h5ad'
 storing artifact 'P5C6wPIeTP25hRBd0000' at '/home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/P5C6wPIeTP25hRBd0000.h5ad'
! run input wasn't tracked, call `ln.track()` and re-run
 parsing feature names of X stored in slot 'var'
    5 unique terms (100.00%) are validated for ensembl_gene_id
    linked: Schema(uid='mR4OllbjkY9VkzsH6ueO', n=5, dtype='int', itype='bionty.Gene', is_type=False, hash='nmFTQkXy239ruKDl8gDLSw', minimal_set=True, ordered_set=False, maximal_set=False, created_by_id=1, space_id=1, created_at=<django.db.models.expressions.DatabaseDefault object at 0x7f94dda7f230>)
 parsing feature names of slot 'obs'
    5 unique terms (100.00%) are validated for name
    returning existing schema with same hash: Schema(uid='I3GBKPfRJnlq9DEEIEzF', name='My example schema', n=5, itype='Feature', is_type=False, hash='x_Wetns1Gi_r8gjlRGQBIg', minimal_set=True, ordered_set=False, maximal_set=False, created_by_id=1, space_id=1, created_at=2025-02-27 13:54:48 UTC)
!    updated otype from None to DataFrame
    linked: Schema(uid='I3GBKPfRJnlq9DEEIEzF', name='My example schema', n=5, itype='Feature', is_type=False, otype='DataFrame', hash='x_Wetns1Gi_r8gjlRGQBIg', minimal_set=True, ordered_set=False, maximal_set=False, created_by_id=1, space_id=1, created_at=2025-02-27 13:54:48 UTC)
 saved 1 feature set for slot: 'var'

The saved artifact has been annotated with validated features and labels:

artifact.describe()
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'P5C6wPIeTP25hRBd0000'
│   ├── .key = 'my_datasets/my_curated_anndata.h5ad'
│   ├── .size = 25672
│   ├── .hash = 'RdKHw6itdYPBD88Pjoxe4g'
│   ├── .n_observations = 3
│   ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/P5C6wPIeTP25hRBd0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   └── .created_at = 2025-02-27 13:54:55
├── Dataset features/schema
│   ├── var5                     [bionty.Gene]                                                       
│   │   TCF7                        int                                                                 
│   │   PDCD1                       int                                                                 
│   │   CD8A                        int                                                                 
│   │   CD4                         int                                                                 
│   │   CD3E                        int                                                                 
│   └── obs5                     [Feature]                                                           
assay_ontology_id           cat[bionty.ExperimentalF…  single-cell RNA sequencing               
cell_type                   cat[bionty.CellType]       astrocyte, cerebral cortex pyramidal neu…
perturbation                cat[ULabel]                DMSO, IFNG                               
temperature                 float                                                               
donor                       str                                                                 
└── Labels
    └── .cell_types                 bionty.CellType            astrocyte, oligodendrocyte, cerebral cor…
        .experimental_factors       bionty.ExperimentalFactor  single-cell RNA sequencing               
        .ulabels                    ULabel                     IFNG, DMSO                               

Standardize an AnnData

If you need more control, you can access the underlying "var" and "obs" DataFrameCurator objects directly.

curator.slots["var"]
curator.slots["obs"]
Hide code cell output
<lamindb.curators.DataFrameCurator at 0x7f94e8b47d10>
# revert the previous cell type standardization
df["cell_type"] = df["cell_type"].cat.rename_categories(
    {"astrocyte": "astrocytic glia"}
)
# an AnnData where a cell type matches a synonym
adata_with_synonym = ad.AnnData(X=adata_validated.X, var=adata_validated.var, obs=df)
adata_with_synonym
AnnData object with n_obs × n_vars = 3 × 5
    obs: 'perturbation', 'temperature', 'cell_type', 'assay_ontology_id', 'donor'
curator = ln.curators.AnnDataCurator(adata_with_synonym, anndata_schema)
try:
    curator.validate()
except ln.errors.ValidationError as error:
    print(error)
 "perturbation" is validated against ULabel.name
 mapping "cell_type" on CellType.name
!   1 term is not validated: 'astrocytic glia'
    1 synonym found: "astrocytic glia" → "astrocyte"
    → curate synonyms via .standardize("cell_type")
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
1 term is not validated: 'astrocytic glia'
    1 synonym found: "astrocytic glia" → "astrocyte"
    → curate synonyms via .standardize("cell_type")
curator.slots["obs"].cat.standardize("cell_type")
 standardized 1 synonym in "cell_type": "astrocytic glia" → "astrocyte"
curator.validate()
 "perturbation" is validated against ULabel.name
 "cell_type" is validated against CellType.name
 "assay_ontology_id" is validated against ExperimentalFactor.ontology_id

Summary

We’ve walked through the process of validating, standardizing, and annotating datasets going through these key steps:

  1. Defining validation criteria

  2. Validating data against existing registries

  3. Adding new validated entries to registries

  4. Annotating artifacts with validated metadata

By following these steps, you can ensure your data is standardized and well-curated.

If you have datasets that aren’t DataFrame-like or AnnData-like, read: Curate datasets of any format.

Hide code cell content
!rm -rf ./test-curate
!lamin delete --force test-curate
 deleting instance testuser1/test-curate