Curate dataframes with an EHR schema¶
In a previous guide, you defined generic Schema
for DataFrame
and other objects.
This guide walks through an exemplary EHR schema.
For a comparable schema related to scRNA-seq data, see the CELLxGENE schema (Curate AnnData based on the CELLxGENE schema).
# pip install 'lamindb[bionty]'
!lamin init --storage ./test-ehrschema --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-ehrschema
import lamindb as ln
import bionty as bt
import pandas as pd
ln.track("2XEr2IA4n1w40000")
Show code cell output
→ connected lamindb: testuser1/test-ehrschema
→ created Transform('2XEr2IA4n1w40000'), started new Run('6G5Xm9AQ...') at 2025-02-20 07:29:38 UTC
→ notebook imports: bionty==1.1.0 lamindb==1.1.0 pandas==2.2.3
We want to ensure that
the dataframe has columns
disease
,phenotype
,developmental_stage
, andage
if columns or values are missing, we standardize the dataframe with default values
any values that are present map against specific versions of pre-defined ontologies
Define a schema¶
Let us first define the ontology versions we want to use.
disease_ontology = bt.Source.get(
entity="bionty.Disease", name="mondo", version="2023-04-04"
)
developmental_stage_ontology = bt.Source.get(
entity="bionty.DevelopmentalStage", name="hsapdv", version="2020-03-10"
)
phenotype_ontology = bt.Source.get(
entity="bionty.Phenotype",
name="hp",
version="2023-06-17",
organism="human",
)
Let us now create a schema by defining the features that it measures. The ontology versions are captured via their uid
.
schema = ln.Schema(
name="My EHR schema",
features=[
ln.Feature(name="age", dtype=int).save(),
ln.Feature(
name="disease",
dtype=bt.Disease,
default_value="normal",
nullable=False,
cat_filters={"source__uid": disease_ontology.uid},
).save(),
ln.Feature(
name="developmental_stage",
dtype=bt.DevelopmentalStage,
default_value="unknown",
nullable=False,
cat_filters={"source__uid": developmental_stage_ontology.uid},
).save(),
ln.Feature(
name="phenotype",
dtype=bt.Phenotype,
default_value="unknown",
nullable=False,
cat_filters={"source__uid": phenotype_ontology.uid},
).save(),
],
).save()
# look at a dataframe of the features that are part of the schema
schema.features.df()
Show code cell output
uid | name | dtype | is_type | unit | description | array_rank | array_size | array_shape | proxy_dtype | synonyms | _expect_many | _curation | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | MDqWMXLb5IaZ | age | int | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | 1 | 2025-02-20 07:29:40.653000+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
2 | Zt0PCVFDkg5i | disease | cat[bionty.Disease[source__uid='Hgw08Vk3']] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | 1 | 2025-02-20 07:29:40.659000+00:00 | 1 | {'af': {'0': 'normal', '1': False}} | 1 |
3 | a1egkOJDZ7jc | developmental_stage | cat[bionty.DevelopmentalStage[source__uid='7Zm... | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | 1 | 2025-02-20 07:29:40.665000+00:00 | 1 | {'af': {'0': 'unknown', '1': False}} | 1 |
4 | ICQ9masISOdM | phenotype | cat[bionty.Phenotype[source__uid='451W7iJS']] | None | None | None | 0 | 0 | None | None | None | True | None | 1 | None | 1 | 2025-02-20 07:29:40.670000+00:00 | 1 | {'af': {'0': 'unknown', '1': False}} | 1 |
Curate an example dataset¶
Create an example DataFrame
that has all required columns but one is misnamed.
dataset = {
"disease": pd.Categorical(
[
"Alzheimer disease",
"diabetes mellitus",
pd.NA,
"Hypertension",
"asthma",
]
),
"phenotype": pd.Categorical(
[
"Mental deterioration",
"Hyperglycemia",
"Tumor growth",
"Increased blood pressure",
"Airway inflammation",
]
),
"developmental_stage": pd.Categorical(
["Adult", "Adult", "Adult", "Adult", "Child"]
),
"patient_age": [70, 55, 60, 65, 12],
}
df = pd.DataFrame(dataset)
df
Show code cell output
disease | phenotype | developmental_stage | patient_age | |
---|---|---|---|---|
0 | Alzheimer disease | Mental deterioration | Adult | 70 |
1 | diabetes mellitus | Hyperglycemia | Adult | 55 |
2 | NaN | Tumor growth | Adult | 60 |
3 | Hypertension | Increased blood pressure | Adult | 65 |
4 | asthma | Airway inflammation | Child | 12 |
Let’s validate it.
curator = ln.curators.DataFrameCurator(df, schema)
try:
curator.validate()
except ln.errors.ValidationError as e:
assert str(e).startswith("column 'age' not in dataframe")
print(e)
Show code cell output
column 'age' not in dataframe. Columns in dataframe: ['disease', 'phenotype', 'developmental_stage', 'patient_age']
Fix the name of the patient_age
column to be age
.
df.columns = df.columns.str.replace("patient_age", "age")
try:
curator.validate()
except ln.errors.ValidationError as e:
assert str(e).startswith("non-nullable series 'disease' contains null values")
print(e)
Show code cell output
non-nullable series 'disease' contains null values:
2 NaN
Name: disease, dtype: category
Categories (4, object): ['Alzheimer disease', 'Hypertension', 'asthma', 'diabetes mellitus']
Standardize the dataframe so that the missing value gets populated with the default value.
curator.standardize()
try:
curator.validate()
except ln.errors.ValidationError as e:
assert str(e).startswith(
"2 terms are not validated: 'Tumor growth', 'Airway inflammation'"
)
print(e)
• saving validated records of 'disease'
✓ added 3 records from public with Disease.name for "disease": 'diabetes mellitus', 'asthma', 'Alzheimer disease'
• saving validated records of 'phenotype'
✓ added 3 records from public with Phenotype.name for "phenotype": 'Hyperglycemia', 'Increased blood pressure', 'Mental deterioration'
• mapping "disease" on Disease.name
! 2 terms are not validated: 'normal', 'Hypertension'
→ fix typos, remove non-existent values, or save terms via .add_new_from("disease")
• mapping "developmental_stage" on DevelopmentalStage.name
! 2 terms are not validated: 'Adult', 'Child'
→ fix typos, remove non-existent values, or save terms via .add_new_from("developmental_stage")
• mapping "phenotype" on Phenotype.name
! 2 terms are not validated: 'Tumor growth', 'Airway inflammation'
→ fix typos, remove non-existent values, or save terms via .add_new_from("phenotype")
2 terms are not validated: 'Tumor growth', 'Airway inflammation'
→ fix typos, remove non-existent values, or save terms via .add_new_from("phenotype")
Add the ‘normal’ term to the disease registry.
bt.Disease(name="normal", description="Healthy condition").save()
Disease(uid='7kTPatVd', name='normal', description='Healthy condition', created_by_id=1, run_id=1, space_id=1, created_at=2025-02-20 07:29:48 UTC)
Curate the remaining mismatches manually.
diseases = bt.Disease.public().lookup()
phenotypes = bt.Phenotype.public().lookup()
developmental_stages = bt.DevelopmentalStage.public().lookup()
df["disease"] = df["disease"].cat.rename_categories(
{"Hypertension": diseases.hypertensive_disorder.name}
)
df["phenotype"] = df["phenotype"].cat.rename_categories(
{
"Tumor growth": phenotypes.neoplasm.name,
"Airway inflammation": phenotypes.bronchitis.name,
}
)
df["developmental_stage"] = df["developmental_stage"].cat.rename_categories(
{
"Adult": developmental_stages.adolescent_stage.name,
"Child": developmental_stages.child_stage.name,
}
)
curator.validate()
Show code cell output
• saving validated records of 'disease'
✓ added 1 record from public with Disease.name for "disease": 'hypertensive disorder'
• saving validated records of 'developmental_stage'
✓ added 2 records from public with DevelopmentalStage.name for "developmental_stage": 'child stage', 'adolescent stage'
• saving validated records of 'phenotype'
✓ added 2 records from public with Phenotype.name for "phenotype": 'Bronchitis', 'Neoplasm'
✓ "disease" is validated against Disease.name
✓ "developmental_stage" is validated against DevelopmentalStage.name
✓ "phenotype" is validated against Phenotype.name
Show code cell content
!rm -rf test-ehrschema
!lamin delete --force test-ehrschema
• deleting instance testuser1/test-ehrschema