Curate dataframes with an EHR schema

In a previous guide, you defined generic Schema for DataFrame and other objects. This guide walks through an exemplary EHR schema.

For a comparable schema related to scRNA-seq data, see the CELLxGENE schema (Curate AnnData based on the CELLxGENE schema).

# pip install 'lamindb[bionty]'
!lamin init --storage ./test-ehrschema --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-ehrschema
import lamindb as ln
import bionty as bt
import pandas as pd

ln.track("2XEr2IA4n1w40000")
Hide code cell output
 connected lamindb: testuser1/test-ehrschema
 created Transform('2XEr2IA4n1w40000'), started new Run('6G5Xm9AQ...') at 2025-02-20 07:29:38 UTC
 notebook imports: bionty==1.1.0 lamindb==1.1.0 pandas==2.2.3

We want to ensure that

  1. the dataframe has columns disease, phenotype, developmental_stage, and age

  2. if columns or values are missing, we standardize the dataframe with default values

  3. any values that are present map against specific versions of pre-defined ontologies

Define a schema

Let us first define the ontology versions we want to use.

disease_ontology = bt.Source.get(
    entity="bionty.Disease", name="mondo", version="2023-04-04"
)
developmental_stage_ontology = bt.Source.get(
    entity="bionty.DevelopmentalStage", name="hsapdv", version="2020-03-10"
)
phenotype_ontology = bt.Source.get(
    entity="bionty.Phenotype",
    name="hp",
    version="2023-06-17",
    organism="human",
)

Let us now create a schema by defining the features that it measures. The ontology versions are captured via their uid.

schema = ln.Schema(
    name="My EHR schema",
    features=[
        ln.Feature(name="age", dtype=int).save(),
        ln.Feature(
            name="disease",
            dtype=bt.Disease,
            default_value="normal",
            nullable=False,
            cat_filters={"source__uid": disease_ontology.uid},
        ).save(),
        ln.Feature(
            name="developmental_stage",
            dtype=bt.DevelopmentalStage,
            default_value="unknown",
            nullable=False,
            cat_filters={"source__uid": developmental_stage_ontology.uid},
        ).save(),
        ln.Feature(
            name="phenotype",
            dtype=bt.Phenotype,
            default_value="unknown",
            nullable=False,
            cat_filters={"source__uid": phenotype_ontology.uid},
        ).save(),
    ],
).save()
# look at a dataframe of the features that are part of the schema
schema.features.df()
Hide code cell output
uid name dtype is_type unit description array_rank array_size array_shape proxy_dtype synonyms _expect_many _curation space_id type_id run_id created_at created_by_id _aux _branch_code
id
1 MDqWMXLb5IaZ age int None None None 0 0 None None None True None 1 None 1 2025-02-20 07:29:40.653000+00:00 1 {'af': {'0': None, '1': True}} 1
2 Zt0PCVFDkg5i disease cat[bionty.Disease[source__uid='Hgw08Vk3']] None None None 0 0 None None None True None 1 None 1 2025-02-20 07:29:40.659000+00:00 1 {'af': {'0': 'normal', '1': False}} 1
3 a1egkOJDZ7jc developmental_stage cat[bionty.DevelopmentalStage[source__uid='7Zm... None None None 0 0 None None None True None 1 None 1 2025-02-20 07:29:40.665000+00:00 1 {'af': {'0': 'unknown', '1': False}} 1
4 ICQ9masISOdM phenotype cat[bionty.Phenotype[source__uid='451W7iJS']] None None None 0 0 None None None True None 1 None 1 2025-02-20 07:29:40.670000+00:00 1 {'af': {'0': 'unknown', '1': False}} 1

Curate an example dataset

Create an example DataFrame that has all required columns but one is misnamed.

dataset = {
    "disease": pd.Categorical(
        [
            "Alzheimer disease",
            "diabetes mellitus",
            pd.NA,
            "Hypertension",
            "asthma",
        ]
    ),
    "phenotype": pd.Categorical(
        [
            "Mental deterioration",
            "Hyperglycemia",
            "Tumor growth",
            "Increased blood pressure",
            "Airway inflammation",
        ]
    ),
    "developmental_stage": pd.Categorical(
        ["Adult", "Adult", "Adult", "Adult", "Child"]
    ),
    "patient_age": [70, 55, 60, 65, 12],
}
df = pd.DataFrame(dataset)
df
Hide code cell output
disease phenotype developmental_stage patient_age
0 Alzheimer disease Mental deterioration Adult 70
1 diabetes mellitus Hyperglycemia Adult 55
2 NaN Tumor growth Adult 60
3 Hypertension Increased blood pressure Adult 65
4 asthma Airway inflammation Child 12

Let’s validate it.

curator = ln.curators.DataFrameCurator(df, schema)
try:
    curator.validate()
except ln.errors.ValidationError as e:
    assert str(e).startswith("column 'age' not in dataframe")
    print(e)
Hide code cell output
column 'age' not in dataframe. Columns in dataframe: ['disease', 'phenotype', 'developmental_stage', 'patient_age']

Fix the name of the patient_age column to be age.

df.columns = df.columns.str.replace("patient_age", "age")
try:
    curator.validate()
except ln.errors.ValidationError as e:
    assert str(e).startswith("non-nullable series 'disease' contains null values")
    print(e)
Hide code cell output
non-nullable series 'disease' contains null values:
2    NaN
Name: disease, dtype: category
Categories (4, object): ['Alzheimer disease', 'Hypertension', 'asthma', 'diabetes mellitus']

Standardize the dataframe so that the missing value gets populated with the default value.

curator.standardize()
try:
    curator.validate()
except ln.errors.ValidationError as e:
    assert str(e).startswith(
        "2 terms are not validated: 'Tumor growth', 'Airway inflammation'"
    )
    print(e)
 saving validated records of 'disease'
 added 3 records from public with Disease.name for "disease": 'diabetes mellitus', 'asthma', 'Alzheimer disease'
 saving validated records of 'phenotype'
 added 3 records from public with Phenotype.name for "phenotype": 'Hyperglycemia', 'Increased blood pressure', 'Mental deterioration'
 mapping "disease" on Disease.name
!   2 terms are not validated: 'normal', 'Hypertension'
    → fix typos, remove non-existent values, or save terms via .add_new_from("disease")
 mapping "developmental_stage" on DevelopmentalStage.name
!   2 terms are not validated: 'Adult', 'Child'
    → fix typos, remove non-existent values, or save terms via .add_new_from("developmental_stage")
 mapping "phenotype" on Phenotype.name
!   2 terms are not validated: 'Tumor growth', 'Airway inflammation'
    → fix typos, remove non-existent values, or save terms via .add_new_from("phenotype")
2 terms are not validated: 'Tumor growth', 'Airway inflammation'
    → fix typos, remove non-existent values, or save terms via .add_new_from("phenotype")

Add the ‘normal’ term to the disease registry.

bt.Disease(name="normal", description="Healthy condition").save()
Disease(uid='7kTPatVd', name='normal', description='Healthy condition', created_by_id=1, run_id=1, space_id=1, created_at=2025-02-20 07:29:48 UTC)

Curate the remaining mismatches manually.

diseases = bt.Disease.public().lookup()
phenotypes = bt.Phenotype.public().lookup()
developmental_stages = bt.DevelopmentalStage.public().lookup()

df["disease"] = df["disease"].cat.rename_categories(
    {"Hypertension": diseases.hypertensive_disorder.name}
)
df["phenotype"] = df["phenotype"].cat.rename_categories(
    {
        "Tumor growth": phenotypes.neoplasm.name,
        "Airway inflammation": phenotypes.bronchitis.name,
    }
)
df["developmental_stage"] = df["developmental_stage"].cat.rename_categories(
    {
        "Adult": developmental_stages.adolescent_stage.name,
        "Child": developmental_stages.child_stage.name,
    }
)

curator.validate()
Hide code cell output
 saving validated records of 'disease'
 added 1 record from public with Disease.name for "disease": 'hypertensive disorder'
 saving validated records of 'developmental_stage'
 added 2 records from public with DevelopmentalStage.name for "developmental_stage": 'child stage', 'adolescent stage'
 saving validated records of 'phenotype'
 added 2 records from public with Phenotype.name for "phenotype": 'Bronchitis', 'Neoplasm'
 "disease" is validated against Disease.name
 "developmental_stage" is validated against DevelopmentalStage.name
 "phenotype" is validated against Phenotype.name
Hide code cell content
!rm -rf test-ehrschema
!lamin delete --force test-ehrschema
 deleting instance testuser1/test-ehrschema