Curate datasets¶
Curating a dataset with LaminDB means three things:
Validate that the dataset matches a desired schema
In case the dataset doesn’t validate, standardize it, e.g., by fixing typos or mapping synonyms
Annotate the dataset by linking it against metadata entities so that it becomes queryable
In this guide we’ll curate common data structures. Here is a guide for the underlying low-level API.
Curate a DataFrame¶
# pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-curate
Let’s start with a DataFrame that we’d like to validate.
import lamindb as ln
import bionty as bt
import pandas as pd
df = ln.core.datasets.small_dataset1(
with_cell_type_synonym=True, with_cell_type_typo=True
)
df
Show code cell output
→ connected lamindb: testuser1/test-curate
ENSG00000153563 | ENSG00000010610 | ENSG00000170458 | perturbation | sample_note | cell_type_by_expert | cell_type_by_model | assay_oid | concentration | treatment_time_h | donor | |
---|---|---|---|---|---|---|---|---|---|---|---|
sample1 | 1 | 3 | 5 | DMSO | was ok | B-cell | B cell | EFO:0008913 | 0.1% | 24 | D0001 |
sample2 | 2 | 4 | 6 | IFNG | looks naah | CD8-pos alpha-beta T cell | T cell | EFO:0008913 | 200 nM | 24 | D0002 |
sample3 | 3 | 5 | 7 | DMSO | pretty! 🤩 | CD8-pos alpha-beta T cell | T cell | EFO:0008913 | 0.1% | 6 | None |
Define a schema to define the minimal columns we expect in such a dataset.
schema = ln.Schema(
name="My immuno schema",
features=[
ln.Feature(name="perturbation", dtype=ln.ULabel).save(),
ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save(),
ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save(),
ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save(),
ln.Feature(name="donor", dtype=str, nullable=True).save(),
ln.Feature(name="concentration", dtype=str).save(),
ln.Feature(name="treatment_time_h", dtype=float, coerce_dtype=True).save(),
],
).save()
# display the associated features as a dataframe
schema.features.df()
Show code cell output
uid | name | dtype | is_type | unit | description | array_rank | array_size | array_shape | proxy_dtype | synonyms | _expect_many | _curation | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | FAwDqDmJehnC | perturbation | cat[ULabel] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-03-16 20:57:45.935000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
2 | PYnOIEn8zcSt | cell_type_by_model | cat[bionty.CellType] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-03-16 20:57:46.217000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
3 | EZzWJLMxuVEd | cell_type_by_expert | cat[bionty.CellType] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-03-16 20:57:46.223000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
4 | 8nMeyS4dM9M9 | assay_oid | cat[bionty.ExperimentalFactor.ontology_id] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-03-16 20:57:46.229000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
5 | 8ZTLDeWt1Qbe | donor | str | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-03-16 20:57:46.235000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
6 | cO1g6Ht9piw3 | concentration | str | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-03-16 20:57:46.240000+00:00 | 1 | {'af': {'0': None, '1': True, '2': False}} | 1 |
7 | o8qclzwfRg5W | treatment_time_h | float | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | None | 2025-03-16 20:57:46.246000+00:00 | 1 | {'af': {'0': None, '1': True, '2': True}} | 1 |
Create a Curator
using the dataset & the schema.
curator = ln.curators.DataFrameCurator(df, schema)
The validate()
method validates that your dataset adheres to the criteria defined by the schema
. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).
try:
curator.validate()
except ln.errors.ValidationError:
pass
Show code cell output
• saving validated records of 'cell_type_by_model'
✓ added 2 records from public with CellType.name for "cell_type_by_model": 'T cell', 'B cell'
• saving validated records of 'assay_oid'
✓ added 1 record from public with ExperimentalFactor.ontology_id for "assay_oid": 'EFO:0008913'
• mapping "perturbation" on ULabel.name
! 2 terms are not validated: 'DMSO', 'IFNG'
→ fix typos, remove non-existent values, or save terms via .add_new_from("perturbation")
✓ "cell_type_by_model" is validated against CellType.name
• mapping "cell_type_by_expert" on CellType.name
! 2 terms are not validated: 'B-cell', 'CD8-pos alpha-beta T cell'
1 synonym found: "B-cell" → "B cell"
→ curate synonyms via .standardize("cell_type_by_expert")
for remaining terms:
→ fix typos, remove non-existent values, or save terms via .add_new_from("cell_type_by_expert")
✓ "assay_oid" is validated against ExperimentalFactor.ontology_id
# check the non-validated terms
curator.cat.non_validated
Show code cell output
{'perturbation': ['DMSO', 'IFNG'],
'cell_type_by_expert': ['B-cell', 'CD8-pos alpha-beta T cell']}
For cell_type
, we saw that “cerebral pyramidal neuron”, “astrocytic glia” are not validated.
First, let’s standardize synonym “astrocytic glia” as suggested
curator.cat.standardize("cell_type_by_expert")
Show code cell output
✓ standardized 1 synonym in "cell_type_by_expert": "B-cell" → "B cell"
# now we have only one non-validated cell type left
curator.cat.non_validated
Show code cell output
{'perturbation': ['DMSO', 'IFNG'],
'cell_type_by_expert': ['CD8-pos alpha-beta T cell']}
For “cerebral pyramidal neuron”, let’s understand which cell type in the public ontology might be the actual match.
# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curator.cat.lookup()` to get a lookup object of existing records in your instance
lookup = curator.cat.lookup(public=True)
lookup
Show code cell output
Lookup objects from the public:
.perturbation
.cell_type_by_model
.cell_type_by_expert
.assay_oid
.columns
Example:
→ categories = curator.lookup()["cell_type"]
→ categories.alveolar_type_1_fibroblast_cell
To look up public ontologies, use .lookup(public=True)
# here is an example for the "cell_type" column
cell_types = lookup["cell_type_by_expert"]
cell_types.cd8_positive_alpha_beta_t_cell
Show code cell output
CellType(ontology_id='CL:0000625', name='CD8-positive, alpha-beta T cell', definition='A T Cell Expressing An Alpha-Beta T Cell Receptor And The Cd8 Coreceptor.', synonyms='CD8-positive, alpha-beta T-cell|CD8-positive, alpha-beta T lymphocyte|CD8-positive, alpha-beta T-lymphocyte', parents=array(['CL:0000791'], dtype=object))
# fix the cell type name
df["cell_type_by_expert"] = df["cell_type_by_expert"].cat.rename_categories(
{"CD8-pos alpha-beta T cell": cell_types.cd8_positive_alpha_beta_t_cell.name}
)
For perturbation, we want to add the new values: “DMSO”, “IFNG”
# this adds perturbations that were _not_ validated
curator.cat.add_new_from("perturbation")
Show code cell output
✓ added 2 records with ULabel.name for "perturbation": 'IFNG', 'DMSO'
# validate again
curator.validate()
Show code cell output
• saving validated records of 'cell_type_by_expert'
✓ added 1 record from public with CellType.name for "cell_type_by_expert": 'CD8-positive, alpha-beta T cell'
✓ "perturbation" is validated against ULabel.name
✓ "cell_type_by_model" is validated against CellType.name
✓ "cell_type_by_expert" is validated against CellType.name
✓ "assay_oid" is validated against ExperimentalFactor.ontology_id
Save a curated artifact.
artifact = curator.save_artifact(key="my_datasets/my_curated_dataset.parquet")
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
• path content will be copied to default storage upon `save()` with key 'my_datasets/my_curated_dataset.parquet'
✓ storing artifact 'NrBenO34pMhO1TUA0000' at '/home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/NrBenO34pMhO1TUA0000.parquet'
! run input wasn't tracked, call `ln.track()` and re-run
✓ 7 unique terms (63.60%) are validated for name
! 4 unique terms (36.40%) are not validated for name: 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note'
✓ loaded 7 Feature records matching name: 'perturbation', 'cell_type_by_expert', 'cell_type_by_model', 'assay_oid', 'concentration', 'treatment_time_h', 'donor'
! did not create Feature records for 4 non-validated names: 'ENSG00000010610', 'ENSG00000153563', 'ENSG00000170458', 'sample_note'
→ returning existing schema with same hash: Schema(uid='w5vFPFAWKul4xgzITGx5', name='My immuno schema', n=7, itype='Feature', is_type=False, hash='7xSKmcRDwl2GnTv9sG_ivQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-03-16 20:57:46 UTC)
! updated otype from None to DataFrame
artifact.describe()
Show code cell output
Artifact .parquet/DataFrame ├── General │ ├── .uid = 'NrBenO34pMhO1TUA0000' │ ├── .key = 'my_datasets/my_curated_dataset.parquet' │ ├── .size = 9012 │ ├── .hash = 'iBiiWBkIitgFtLcru2CLyA' │ ├── .n_observations = 3 │ ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/NrBenO34pMhO1TUA0000.parquet │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2025-03-16 20:57:50 ├── Dataset features/.feature_sets │ └── columns • 7 [Feature] │ assay_oid cat[bionty.ExperimentalF… single-cell RNA sequencing │ cell_type_by_expert cat[bionty.CellType] B cell, CD8-positive, alpha-beta T cell │ cell_type_by_model cat[bionty.CellType] B cell, T cell │ perturbation cat[ULabel] DMSO, IFNG │ donor str │ concentration str │ treatment_time_h float └── Labels └── .cell_types bionty.CellType T cell, B cell, CD8-positive, alpha-beta… .experimental_factors bionty.ExperimentalFactor single-cell RNA sequencing .ulabels ULabel IFNG, DMSO
Curate an AnnData¶
Here we additionally specify which var_index
to validate against.
import anndata as ad
X = pd.DataFrame(
{
"ENSG00000081059": [1, 2, 3],
"ENSG00000276977": [4, 5, 6],
"ENSG00000198851": [7, 8, 9],
"ENSG00000010610": [10, 11, 12],
"ENSG00000153563": [13, 14, 15],
"ENSGcorrupted": [16, 17, 18],
},
index=df.index, # because we already curated the dataframe above, it will validate
)
adata = ad.AnnData(X=X, obs=df)
adata
Show code cell output
AnnData object with n_obs × n_vars = 3 × 6
obs: 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'perturbation', 'sample_note', 'cell_type_by_expert', 'cell_type_by_model', 'assay_oid', 'concentration', 'treatment_time_h', 'donor'
# define var schema
var_schema = ln.Schema(
name="my_var_schema",
itype=bt.Gene.ensembl_gene_id, # identifier type
dtype=int,
).save()
# define composite schema
anndata_schema = ln.Schema(
name="small_dataset1_anndata_schema",
otype="AnnData", # object type
components={"obs": schema, "var": var_schema},
).save()
Check the slots of a schema:
anndata_schema.slots
Show code cell output
{'obs': Schema(uid='w5vFPFAWKul4xgzITGx5', name='My immuno schema', n=7, itype='Feature', is_type=False, hash='7xSKmcRDwl2GnTv9sG_ivQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-03-16 20:57:46 UTC),
'var': Schema(uid='FcZnUq79tgCbPxugGQdg', name='my_var_schema', n=-1, dtype='int', itype='bionty.Gene.ensembl_gene_id', is_type=False, hash='EQaIs3JSpQGzwUVoubcUbA', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-03-16 20:57:50 UTC)}
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
try:
curator.validate()
except ln.errors.ValidationError as error:
print(error)
Show code cell output
• saving validated records of 'columns'
✓ added 5 records from public with Gene.ensembl_gene_id for "columns": 'ENSG00000081059', 'ENSG00000198851', 'ENSG00000010610', 'ENSG00000276977', 'ENSG00000153563'
✓ "perturbation" is validated against ULabel.name
✓ "cell_type_by_model" is validated against CellType.name
✓ "cell_type_by_expert" is validated against CellType.name
✓ "assay_oid" is validated against ExperimentalFactor.ontology_id
Subset the AnnData
to validated genes only:
adata_validated = adata[:, ~adata.var.index.isin(["ENSGcorrupted"])].copy()
Now let’s validate the subsetted object:
curator = ln.curators.AnnDataCurator(adata_validated, anndata_schema)
curator.validate()
Show code cell output
✓ "perturbation" is validated against ULabel.name
✓ "cell_type_by_model" is validated against CellType.name
✓ "cell_type_by_expert" is validated against CellType.name
✓ "assay_oid" is validated against ExperimentalFactor.ontology_id
The validated AnnData
can be subsequently saved as an Artifact
:
artifact = curator.save_artifact(key="my_datasets/my_curated_anndata.h5ad")
Show code cell output
✓ "perturbation" is validated against ULabel.name
✓ "cell_type_by_model" is validated against CellType.name
✓ "cell_type_by_expert" is validated against CellType.name
✓ "assay_oid" is validated against ExperimentalFactor.ontology_id
! no run & transform got linked, call `ln.track()` & re-run
• path content will be copied to default storage upon `save()` with key 'my_datasets/my_curated_anndata.h5ad'
✓ storing artifact 'QvRVF7drw5qTvIdq0000' at '/home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/QvRVF7drw5qTvIdq0000.h5ad'
! run input wasn't tracked, call `ln.track()` and re-run
• parsing feature names of X stored in slot 'var'
✓ 5 unique terms (100.00%) are validated for ensembl_gene_id
✓ linked: Schema(uid='D6Z7pHSctTHeawGW9OBx', n=5, dtype='int', itype='bionty.Gene', is_type=False, hash='nmFTQkXy239ruKDl8gDLSw', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=<django.db.models.expressions.DatabaseDefault object at 0x7f2d9c72ce50>)
• parsing feature names of slot 'obs'
✓ 7 unique terms (63.60%) are validated for name
! 4 unique terms (36.40%) are not validated for name: 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'sample_note'
✓ loaded 7 Feature records matching name: 'perturbation', 'cell_type_by_expert', 'cell_type_by_model', 'assay_oid', 'concentration', 'treatment_time_h', 'donor'
! did not create Feature records for 4 non-validated names: 'ENSG00000010610', 'ENSG00000153563', 'ENSG00000170458', 'sample_note'
→ returning existing schema with same hash: Schema(uid='w5vFPFAWKul4xgzITGx5', name='My immuno schema', n=7, itype='Feature', is_type=False, hash='7xSKmcRDwl2GnTv9sG_ivQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-03-16 20:57:46 UTC)
! updated otype from None to DataFrame
✓ linked: Schema(uid='w5vFPFAWKul4xgzITGx5', name='My immuno schema', n=7, itype='Feature', is_type=False, otype='DataFrame', hash='7xSKmcRDwl2GnTv9sG_ivQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-03-16 20:57:46 UTC)
✓ saved 1 feature set for slot: 'var'
Access the schema for each slot:
artifact.features.slots
Show code cell output
{'var': Schema(uid='D6Z7pHSctTHeawGW9OBx', n=5, dtype='int', itype='bionty.Gene', is_type=False, hash='nmFTQkXy239ruKDl8gDLSw', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-03-16 20:57:54 UTC),
'obs': Schema(uid='w5vFPFAWKul4xgzITGx5', name='My immuno schema', n=7, itype='Feature', is_type=False, hash='7xSKmcRDwl2GnTv9sG_ivQ', minimal_set=True, ordered_set=False, maximal_set=False, space_id=1, created_by_id=1, created_at=2025-03-16 20:57:46 UTC)}
The saved artifact has been annotated with validated features and labels:
artifact.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'QvRVF7drw5qTvIdq0000' │ ├── .key = 'my_datasets/my_curated_anndata.h5ad' │ ├── .size = 31400 │ ├── .hash = 'DqsPrPqnKBCg7zrD5IPBtw' │ ├── .n_observations = 3 │ ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/QvRVF7drw5qTvIdq0000.h5ad │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2025-03-16 20:57:54 ├── Dataset features/.feature_sets │ ├── var • 5 [bionty.Gene] │ │ TCF7 int │ │ CD3E int │ │ CD4 int │ │ PDCD1 int │ │ CD8A int │ └── obs • 7 [Feature] │ assay_oid cat[bionty.ExperimentalF… single-cell RNA sequencing │ cell_type_by_expert cat[bionty.CellType] B cell, CD8-positive, alpha-beta T cell │ cell_type_by_model cat[bionty.CellType] B cell, T cell │ perturbation cat[ULabel] DMSO, IFNG │ donor str │ concentration str │ treatment_time_h float └── Labels └── .cell_types bionty.CellType T cell, B cell, CD8-positive, alpha-beta… .experimental_factors bionty.ExperimentalFactor single-cell RNA sequencing .ulabels ULabel IFNG, DMSO
Standardize an AnnData¶
If you need more control, you can access DataFrameCurator
objects for the "var"
and "obs"
slots, respectively.
curator.slots
Show code cell output
{'obs': <lamindb.curators.DataFrameCurator at 0x7f2df071afe0>,
'var': <lamindb.curators.DataFrameCurator at 0x7f2da8eaf970>}
# revert the previous cell type standardization
df["cell_type_by_expert"] = df["cell_type_by_expert"].cat.rename_categories(
{"B cell": "B-cell"}
)
# an AnnData where a cell type matches a synonym
adata_with_synonym = ad.AnnData(X=adata_validated.X, var=adata_validated.var, obs=df)
adata_with_synonym
Show code cell output
AnnData object with n_obs × n_vars = 3 × 5
obs: 'ENSG00000153563', 'ENSG00000010610', 'ENSG00000170458', 'perturbation', 'sample_note', 'cell_type_by_expert', 'cell_type_by_model', 'assay_oid', 'concentration', 'treatment_time_h', 'donor'
curator = ln.curators.AnnDataCurator(adata_with_synonym, anndata_schema)
try:
curator.validate()
except ln.errors.ValidationError:
pass
Show code cell output
✓ "perturbation" is validated against ULabel.name
✓ "cell_type_by_model" is validated against CellType.name
• mapping "cell_type_by_expert" on CellType.name
! 1 term is not validated: 'B-cell'
1 synonym found: "B-cell" → "B cell"
→ curate synonyms via .standardize("cell_type_by_expert")
✓ "assay_oid" is validated against ExperimentalFactor.ontology_id
curator.slots["obs"].cat.standardize("cell_type_by_expert")
Show code cell output
✓ standardized 1 synonym in "cell_type_by_expert": "B-cell" → "B cell"
curator.validate()
Show code cell output
✓ "perturbation" is validated against ULabel.name
✓ "cell_type_by_model" is validated against CellType.name
✓ "cell_type_by_expert" is validated against CellType.name
✓ "assay_oid" is validated against ExperimentalFactor.ontology_id
Summary¶
We’ve walked through the process of validating, standardizing, and annotating datasets going through these key steps:
Defining validation criteria
Validating data against existing registries
Adding new validated entries to registries
Annotating artifacts with validated metadata
By following these steps, you can ensure your data is standardized and well-curated.
If you have other data structures, read: How do I validate & annotate arbitrary data structures?.
Show code cell content
!rm -rf ./test-curate
!lamin delete --force test-curate
• deleting instance testuser1/test-curate