Analysis flow¶

Here, we’ll track typical data transformations like subsetting that occur during analysis.

# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-analysis-flow --modules bionty

import lamindb as ln
import bionty as bt

→ connected lamindb: testuser1/test-analysis-flow

Save an initial dataset¶

register_example_file.py¶

import lamindb as ln
import bionty as bt

ln.track("K4wsS5DTYdFp0000")

# an example dataset that has a few cell type, tissue and disease annotations
adata = ln.core.datasets.anndata_with_obs()

# validate and register features
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "cell_type_id": bt.CellType.ontology_id,
        "tissue": bt.Tissue.name,
        "disease": bt.Disease.name,
    },
    organism="human",
)
curate.add_new_from("cell_type")
curate.validate()
curate.save_artifact(description="anndata with obs")

ln.finish()

!python analysis-flow-scripts/register_example_file.py

Show code cell output

Hide code cell output

→ connected lamindb: testuser1/test-analysis-flow

→ created Transform('K4wsS5DTYdFp0000'), started new Run('bZRMg33A...') at 2025-07-14 06:43:38 UTC

! organism is ignored, define it on the dtype level
! 4 terms not validated in feature 'columns': 'cell_type', 'cell_type_id', 'tissue', 'disease'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')

✓ added 4 records with Feature for "columns": 'cell_type', 'cell_type_id', 'tissue', 'disease'

✓ added 3 records from_public with bionty.CellType for "cell_type": 'T cell', 'hematopoietic stem cell', 'hepatocyte'

! 1 term not validated in feature 'cell_type': 'my new cell type'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('cell_type')
✓ added 1 record with bionty.CellType for "cell_type": 'my new cell type'
✓ "columns" is validated against Feature.name
✓ "cell_type" is validated against CellType.name
✓ "cell_type_id" is validated against CellType.ontology_id

✓ added 4 records from_public with bionty.Tissue for "tissue": 'kidney', 'liver', 'heart', 'brain'
✓ "tissue" is validated against Tissue.name

✓ added 4 records from_public with bionty.Disease for "disease": 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
✓ "disease" is validated against Disease.name

✓ created 1 Organism record from Bionty matching name: 'human'

✓ added 99 records from_public with bionty.Gene for "var_index": 'ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', ...
✓ "var_index" is validated against Gene.ensembl_gene_id

✓ 99 unique terms (100.00%) are validated for ensembl_gene_id

✓ 4 unique terms (100.00%) are validated for name

Open a dataset, subset it, and register the result¶

Track the current notebook:

ln.track("eNef4Arw8nNM")

artifact = ln.Artifact.get(description="anndata with obs")
artifact.describe()

Show code cell output

Hide code cell output

Artifact .h5ad · AnnData · dataset
├── General
│   ├── uid: 7W6s68YULuzFzbOh0000          hash: IJORtcQUSS11QBqD-nTD0A
│   ├── size: 45.9 KB                      n_observations: 40
│   ├── space: all                         branch: main
│   ├── created_at: 2025-07-14 06:43:42    created_by: testuser1 (Test User1)
│   ├── storage location / path: 
│   │   /home/runner/work/lamin-usecases/lamin-usecases/docs/test-analysis-flow/.lamindb/7W6s68YULuzFzbOh0000.h5ad
│   ├── description: anndata with obs
│   └── transform: register_example_file.py
├── Dataset features
│   ├── var • 99                        [bionty.Gene]                                                              
│   │   TSPAN6                          float                                                                      
│   │   TNMD                            float                                                                      
│   │   DPM1                            float                                                                      
│   │   SCYL3                           float                                                                      
│   │   FIRRM                           float                                                                      
│   │   FGR                             float                                                                      
│   │   CFH                             float                                                                      
│   │   FUCA2                           float                                                                      
│   │   GCLC                            float                                                                      
│   │   NFYA                            float                                                                      
│   │   STPG1                           float                                                                      
│   │   NIPAL3                          float                                                                      
│   │   LAS1L                           float                                                                      
│   │   ENPP4                           float                                                                      
│   │   SEMA3F                          float                                                                      
│   │   CFTR                            float                                                                      
│   │   ANKIB1                          float                                                                      
│   │   CYP51A1                         float                                                                      
│   │   KRIT1                           float                                                                      
│   │   RAD52                           float                                                                      
│   └── obs • 4                         [Feature]                                                                  
│       cell_type                       cat[bionty.CellType]               T cell, hematopoietic stem cell, hepato…
│       cell_type_id                    cat[bionty.CellType]               T cell, hematopoietic stem cell, hepato…
│       disease                         cat[bionty.Disease]                Alzheimer disease, cardiac ventricle di…
│       tissue                          cat[bionty.Tissue]                 brain, heart, kidney, liver             
└── Labels
    └── .tissues                        bionty.Tissue                      kidney, liver, heart, brain             
        .cell_types                     bionty.CellType                    T cell, hematopoietic stem cell, hepato…
        .diseases                       bionty.Disease                     chronic kidney disease, liver lymphoma,…

Get a backed AnnData object¶

adata = artifact.open()
adata

Subset dataset to specific cell types and diseases¶

cell_types = artifact.cell_types.all().distinct().lookup(return_field="name")
diseases = artifact.diseases.all().distinct().lookup(return_field="name")

Create the subset:

subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))

adata_subset = adata[subset_obs]
adata_subset

adata_subset.obs[["cell_type", "disease"]].value_counts()

Register the subsetted AnnData:

curate = ln.Curator.from_anndata(
    adata_subset.to_memory(),
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "disease": bt.Disease.name,
        "tissue": bt.Tissue.name,
    },
    organism="human",
)
curate.validate()

artifact = curate.save_artifact(description="anndata with obs subset")
artifact.describe()

Show code cell output

Hide code cell output

→ returning existing schema with same hash: Schema(uid='QoZQpRl4B9A9MPFZ', n=99, is_type=False, itype='bionty.Gene', dtype='float', hash='QogdpqbT704yi5K-Ag5zhg', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, run_id=1, created_at=2025-07-14 06:43:43 UTC)

→ returning existing schema with same hash: Schema(uid='VakdjdC1fF5N5VBi', n=4, is_type=False, itype='Feature', otype='DataFrame', hash='CISEOEpq4uXGbUz3Nkoylw', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, run_id=1, created_at=2025-07-14 06:43:43 UTC)

Artifact .h5ad · AnnData · dataset
├── General
│   ├── uid: Bib8KD4gfdxVZIGW0000          hash: RgGUx7ndRplZZSmalTAWiw
│   ├── size: 38.1 KB                      n_observations: 20
│   ├── space: all                         branch: main
│   ├── created_at: 2025-07-14 06:43:45    created_by: testuser1 (Test User1)
│   ├── storage location / path: 
│   │   /home/runner/work/lamin-usecases/lamin-usecases/docs/test-analysis-flow/.lamindb/Bib8KD4gfdxVZIGW0000.h5ad
│   ├── description: anndata with obs subset
│   └── transform: analysis-flow.ipynb
├── Dataset features
│   ├── var • 99                        [bionty.Gene]                                                              
│   │   TSPAN6                          float                                                                      
│   │   TNMD                            float                                                                      
│   │   DPM1                            float                                                                      
│   │   SCYL3                           float                                                                      
│   │   FIRRM                           float                                                                      
│   │   FGR                             float                                                                      
│   │   CFH                             float                                                                      
│   │   FUCA2                           float                                                                      
│   │   GCLC                            float                                                                      
│   │   NFYA                            float                                                                      
│   │   STPG1                           float                                                                      
│   │   NIPAL3                          float                                                                      
│   │   LAS1L                           float                                                                      
│   │   ENPP4                           float                                                                      
│   │   SEMA3F                          float                                                                      
│   │   CFTR                            float                                                                      
│   │   ANKIB1                          float                                                                      
│   │   CYP51A1                         float                                                                      
│   │   KRIT1                           float                                                                      
│   │   RAD52                           float                                                                      
│   └── obs • 4                         [Feature]                                                                  
│       cell_type                       cat[bionty.CellType]               T cell, hematopoietic stem cell         
│       disease                         cat[bionty.Disease]                chronic kidney disease, liver lymphoma  
│       tissue                          cat[bionty.Tissue]                 kidney, liver                           
│       cell_type_id                    cat[bionty.CellType]                                                       
└── Labels
    └── .tissues                        bionty.Tissue                      kidney, liver                           
        .cell_types                     bionty.CellType                    T cell, hematopoietic stem cell         
        .diseases                       bionty.Disease                     chronic kidney disease, liver lymphoma

Examine data lineage¶

Query a subsetted .h5ad artifact containing “hematopoietic stem cell” and “T cell”:

cell_types = bt.CellType.lookup()

my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()
my_subset

Common questions that might arise are:

What is the history of this artifact?
Which features and labels are associated with it?
Which notebook analyzed and registered this artifact?
By whom?
And which artifact is its parent?

Let’s answer this using LaminDB:

artifact.features

Artifact .h5ad · AnnData · dataset
└── Dataset features
    ├── var • 99                        [bionty.Gene]                                                              
    │   TSPAN6                          float                                                                      
    │   TNMD                            float                                                                      
    │   DPM1                            float                                                                      
    │   SCYL3                           float                                                                      
    │   FIRRM                           float                                                                      
    │   FGR                             float                                                                      
    │   CFH                             float                                                                      
    │   FUCA2                           float                                                                      
    │   GCLC                            float                                                                      
    │   NFYA                            float                                                                      
    │   STPG1                           float                                                                      
    │   NIPAL3                          float                                                                      
    │   LAS1L                           float                                                                      
    │   ENPP4                           float                                                                      
    │   SEMA3F                          float                                                                      
    │   CFTR                            float                                                                      
    │   ANKIB1                          float                                                                      
    │   CYP51A1                         float                                                                      
    │   KRIT1                           float                                                                      
    │   RAD52                           float                                                                      
    └── obs • 4                         [Feature]                                                                  
        cell_type                       cat[bionty.CellType]               T cell, hematopoietic stem cell         
        disease                         cat[bionty.Disease]                chronic kidney disease, liver lymphoma  
        tissue                          cat[bionty.Tissue]                 kidney, liver                           
        cell_type_id                    cat[bionty.CellType]

print("--> What is the lineage of this artifact?\n")
artifact.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
print(artifact.features)
print(artifact.labels)

print("\n\n--> Which notebook analyzed and saved this artifact\n")
print(artifact.transform)

print("\n\n--> Who save this artifact?\n")
print(artifact.created_by)

print("\n\n--> Which artifacts were inputs?\n")
display(artifact.run.input_artifacts.df())

--> What is the lineage of this artifact?

_images/3f47d4393fa9829262aa74f7705580cfa59be9d60b4cb07d7ff3cf8af489aea7.svg

--> Which features and labels are associated with it?

Artifact .h5ad · AnnData · dataset
└── Dataset features
    ├── var • 99                        [bionty.Gene]                                                              
    │   TSPAN6                          float                                                                      
    │   TNMD                            float                                                                      
    │   DPM1                            float                                                                      
    │   SCYL3                           float                                                                      
    │   FIRRM                           float                                                                      
    │   FGR                             float                                                                      
    │   CFH                             float                                                                      
    │   FUCA2                           float                                                                      
    │   GCLC                            float                                                                      
    │   NFYA                            float                                                                      
    │   STPG1                           float                                                                      
    │   NIPAL3                          float                                                                      
    │   LAS1L                           float                                                                      
    │   ENPP4                           float                                                                      
    │   SEMA3F                          float                                                                      
    │   CFTR                            float                                                                      
    │   ANKIB1                          float                                                                      
    │   CYP51A1                         float                                                                      
    │   KRIT1                           float                                                                      
    │   RAD52                           float                                                                      
    └── obs • 4                         [Feature]                                                                  
        cell_type                       cat[bionty.CellType]               T cell, hematopoietic stem cell         
        disease                         cat[bionty.Disease]                chronic kidney disease, liver lymphoma  
        tissue                          cat[bionty.Tissue]                 kidney, liver                           
        cell_type_id                    cat[bionty.CellType]

Artifact .h5ad · AnnData · dataset
└── Labels
    └── .tissues                        bionty.Tissue                      kidney, liver                           
        .cell_types                     bionty.CellType                    T cell, hematopoietic stem cell         
        .diseases                       bionty.Disease                     chronic kidney disease, liver lymphoma

--> Which notebook analyzed and saved this artifact

Transform(uid='eNef4Arw8nNM0000', is_latest=True, key='analysis-flow.ipynb', description='Analysis flow', type='notebook', branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-14 06:43:44 UTC)

--> Who save this artifact?

User object (1)

--> Which artifacts were inputs?

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
2	7W6s68YULuzFzbOh0000	None	anndata with obs	.h5ad	dataset	AnnData	46992	IJORtcQUSS11QBqD-nTD0A	None	40	md5	True	False	1	1	None	None	True	1	2025-07-14 06:43:42.949000+00:00	1	{'af': {'0': True}}	1