Curate DataFrames and AnnDatas¶
Curating a dataset with LaminDB means three things:
Validate: ensure the dataset meets predefined validation criteria
Standardize: transform the dataset so that it meets validation criteria, e.g., by fixing typos or using standard instead of ad hoc identifiers
Annotate: link the dataset against validated metadata so that it becomes queryable
If a dataset passes validation, curating it takes two lines of code:
curator = ln.Curator.from_df(df, ...) # create a Curator and pass criteria in "..."
curator.save_artifact() # validates the content of the dataset and saves it as annotated artifact
Beyond having valid content, the curated dataset is now queryable via metadata identifiers found in the dataset because they have been validated & linked against LaminDB registries.
Beyond validating metadata identifiers, LaminDB also validates data types and dataset schema.
How does validation in LaminDB compare to validation in pandera?
Like LaminDB, pandera validates the dataset schema (i.e., column names and dtype
s).
pandera
is only available for DataFrame
-like datasets and cannot annotate datasets; i.e., can’t make datasets queryable.
However, it offers an API for range-checks, both for numerical and string-like data. If you need such checks, you can combine LaminDB and pandera-based validation.
import pandas as pd
import pandera as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
})
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(str, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
validated_df = schema(df) # this corresponds to curator.validate() in LaminDB
print(validated_df)
# !pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --schema bionty
Show code cell output
→ connected lamindb: testuser1/test-curate
Curate a DataFrame¶
Let’s start with a DataFrame that we’d like to validate.
import lamindb as ln
import bionty as bt
import pandas as pd
df = pd.DataFrame(
{
"temperature": [37.2, 36.3, 38.2],
"cell_type": ["cerebral pyramidal neuron", "astrocytic glia", "oligodendrocyte"],
"assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
"donor": ["D0001", "D0002", "D0003"]
},
index = ["obs1", "obs2", "obs3"]
)
df
Show code cell output
→ connected lamindb: testuser1/test-curate
temperature | cell_type | assay_ontology_id | donor | |
---|---|---|---|---|
obs1 | 37.2 | cerebral pyramidal neuron | EFO:0008913 | D0001 |
obs2 | 36.3 | astrocytic glia | EFO:0008913 | D0002 |
obs3 | 38.2 | oligodendrocyte | EFO:0008913 | D0003 |
Define validation criteria and create a Curator
object.
# in the dictionary, each key is a column name of the dataframe, and each value
# is a registry field onto which values are mapped
categoricals = {
"cell_type": bt.CellType.name,
"assay_ontology_id": bt.ExperimentalFactor.ontology_id,
"donor": ln.ULabel.name,
}
# pass validation criteria
curate = ln.Curator.from_df(df, categoricals=categoricals)
Show code cell output
✓ added 3 records with Feature.name for "columns": 'cell_type', 'assay_ontology_id', 'donor'
The validate()
method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).
curate.validate()
Show code cell output
• saving validated records of 'cell_type'
✓ added 2 records from public with CellType.name for "cell_type": 'oligodendrocyte', 'astrocyte'
• saving validated records of 'assay_ontology_id'
✓ added 1 record from public with ExperimentalFactor.ontology_id for "assay_ontology_id": 'EFO:0008913'
• mapping "cell_type" on CellType.name
! 2 terms are not validated: 'cerebral pyramidal neuron', 'astrocytic glia'
1 synonym found: "astrocytic glia" → "astrocyte"
→ curate synonyms via .standardize("cell_type") for remaining terms:
→ fix typos, remove non-existent values, or save terms via .add_new_from("cell_type")
✓ "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
• mapping "donor" on ULabel.name
! 3 terms are not validated: 'D0001', 'D0002', 'D0003'
→ fix typos, remove non-existent values, or save terms via .add_new_from("donor")
False
# check the non-validated terms
curate.non_validated
{'cell_type': ['cerebral pyramidal neuron', 'astrocytic glia'],
'donor': ['D0001', 'D0002', 'D0003']}
For cell_type
, we saw that “cerebral pyramidal neuron”, “astrocytic glia” are not validated.
First, let’s standardize synonym “astrocytic glia” as suggested
curate.standardize("cell_type")
✓ standardized 1 synonym in "cell_type": "astrocytic glia" → "astrocyte"
# now we have only one non-validated term left
curate.non_validated
{'cell_type': ['cerebral pyramidal neuron'],
'donor': ['D0001', 'D0002', 'D0003']}
For “cerebral pyramidal neuron”, let’s understand which cell type in the public ontology might be the actual match.
# to check the correct spelling of categories, pass `public=True` to get a lookup object from public ontologies
# use `lookup = curate.lookup()` to get a lookup object of existing records in your instance
lookup = curate.lookup(public=True)
lookup
Show code cell output
Lookup objects from the public:
.cell_type
.assay_ontology_id
.donor
.columns
Example:
→ categories = curator.lookup()["cell_type"]
→ categories.alveolar_type_1_fibroblast_cell
To look up public ontologies, use .lookup(public=True)
# here is an example for the "cell_type" column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron
Show code cell output
CellType(ontology_id='CL:4023111', name='cerebral cortex pyramidal neuron', definition='A Pyramidal Neuron With Soma Located In The Cerebral Cortex.', synonyms=None, parents=array(['CL:0000598', 'CL:0010012'], dtype=object))
# fix the cell type
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})
For donor, we want to add the new donors: “D0001”, “D0002”, “D0003”
# this adds donors that were _not_ validated
curate.add_new_from("donor")
Show code cell output
✓ added 3 records with ULabel.name for "donor": 'D0001', 'D0003', 'D0002'
# validate again
curate.validate()
Show code cell output
• saving validated records of 'cell_type'
✓ added 1 record from public with CellType.name for "cell_type": 'cerebral cortex pyramidal neuron'
✓ "cell_type" is validated against CellType.name
✓ "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
✓ "donor" is validated against ULabel.name
True
Save a curated artifact.
artifact = curate.save_artifact(description="My curated dataframe")
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
! 1 unique term (25.00%) is not validated for name: 'temperature'
! did not create Feature record for 1 non-validated name: 'temperature'
artifact.describe(print_types=True)
Artifact .parquet/DataFrame ├── General │ ├── .uid = 'kqg4mqd2qzm5fJJN0000' │ ├── .size = 3786 │ ├── .hash = 'LZCfO2VdCCz0bzQ2cGpDEw' │ ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/kqg4mqd2qzm5fJJN0000.parquet │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2024-12-20 15:03:42 ├── Dataset features/.feature_sets │ └── columns • 3 [Feature] │ assay_ontology_id cat[bionty.ExperimentalF… single-cell RNA sequencing │ cell_type cat[bionty.CellType] astrocyte, cerebral cortex pyramidal neu… │ donor cat[ULabel] D0001, D0002, D0003 └── Labels └── .cell_types bionty.CellType oligodendrocyte, astrocyte, cerebral cor… .experimental_factors bionty.ExperimentalFactor single-cell RNA sequencing .ulabels ULabel D0001, D0003, D0002
Curate an AnnData¶
Here we additionally specify which var_index
to validate against.
import anndata as ad
X = pd.DataFrame(
{
"ENSG00000081059": [1, 2, 3],
"ENSG00000276977": [4, 5, 6],
"ENSG00000198851": [7, 8, 9],
"ENSG00000010610": [10, 11, 12],
"ENSG00000153563": [13, 14, 15],
"ENSGcorrupted": [16, 17, 18]
},
index=df.index # because we already curated the dataframe above, it will validate
)
adata = ad.AnnData(X=X, obs=df)
adata
Show code cell output
AnnData object with n_obs × n_vars = 3 × 6
obs: 'temperature', 'cell_type', 'assay_ontology_id', 'donor'
curate = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.ensembl_gene_id, # validate var.index against Gene.ensembl_gene_id
categoricals=categoricals,
organism="human",
)
curate.validate()
Show code cell output
• saving validated records of 'var_index'
✓ added 5 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000081059', 'ENSG00000276977', 'ENSG00000198851', 'ENSG00000010610', 'ENSG00000153563'
• mapping "var_index" on Gene.ensembl_gene_id
! 1 term is not validated: 'ENSGcorrupted'
→ fix typos, remove non-existent values, or save terms via .add_new_from_var_index()
✓ "cell_type" is validated against CellType.name
✓ "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
✓ "donor" is validated against ULabel.name
False
Non-validated terms can be accessed via:
curate.non_validated
Show code cell output
{'var_index': ['ENSGcorrupted']}
Subset the AnnData
to validated genes only:
adata_validated = adata[:, ~adata.var.index.isin(curate.non_validated["var_index"])].copy()
Now let’s validate the subsetted object:
curate = ln.Curator.from_anndata(
adata_validated,
var_index=bt.Gene.ensembl_gene_id, # validate var.index against Gene.ensembl_gene_id
categoricals=categoricals,
organism="human",
)
curate.validate()
Show code cell output
✓ "var_index" is validated against Gene.ensembl_gene_id
✓ "cell_type" is validated against CellType.name
✓ "assay_ontology_id" is validated against ExperimentalFactor.ontology_id
✓ "donor" is validated against ULabel.name
True
The validated object can be subsequently saved as an Artifact
:
artifact = curate.save_artifact(description="test AnnData")
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
! 1 unique term (25.00%) is not validated for name: 'temperature'
! did not create Feature record for 1 non-validated name: 'temperature'
Saved artifact has been annotated with validated features and labels:
artifact.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = '2vIY4wTZeETfaqV60000' │ ├── .size = 20336 │ ├── .hash = '8z6kAdTVBaDIDuA6aivzNg' │ ├── .n_observations = 3 │ ├── .path = /home/runner/work/lamindb/lamindb/docs/test-curate/.lamindb/2vIY4wTZeETfaqV60000.h5ad │ ├── .created_by = testuser1 (Test User1) │ └── .created_at = 2024-12-20 15:03:48 ├── Dataset features/.feature_sets │ ├── obs • 3 [Feature] │ │ assay_ontology_id cat[bionty.ExperimentalF… single-cell RNA sequencing │ │ cell_type cat[bionty.CellType] astrocyte, cerebral cortex pyramidal neu… │ │ donor cat[ULabel] D0001, D0002, D0003 │ └── var • 5 [bionty.Gene] │ TCF7 int │ PDCD1 int │ CD3E int │ CD4 int │ CD8A int └── Labels └── .cell_types bionty.CellType oligodendrocyte, astrocyte, cerebral cor… .experimental_factors bionty.ExperimentalFactor single-cell RNA sequencing .ulabels ULabel D0001, D0003, D0002
We’ve walked through the process of validating, standardizing, and annotating datasets going through these key steps:
Defining validation criteria
Validating data against existing registries
Adding new validated entries to registries
Annotating artifacts with validated metadata
By following these steps, you can ensure your data is standardized and well-curated.
If you have datasets that aren’t DataFrame-like or AnnData-like, read: Curate datasets of any format.