Analysis flow¶
Here, we’ll track typical data transformations like subsetting that occur during analysis.
# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-analysis-flow --modules bionty
Show code cell output
→ initialized lamindb: testuser1/test-analysis-flow
import lamindb as ln
import bionty as bt
→ connected lamindb: testuser1/test-analysis-flow
Save an initial dataset¶
register_example_file.py¶
import lamindb as ln
import bionty as bt
ln.track("K4wsS5DTYdFp0000")
# an example dataset that has a few cell type, tissue and disease annotations
adata = ln.core.datasets.anndata_with_obs()
# validate and register features
curate = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.ensembl_gene_id,
categoricals={
"cell_type": bt.CellType.name,
"cell_type_id": bt.CellType.ontology_id,
"tissue": bt.Tissue.name,
"disease": bt.Disease.name,
},
organism="human",
)
curate.add_new_from("cell_type")
curate.validate()
curate.save_artifact(description="anndata with obs")
ln.finish()
!python analysis-flow-scripts/register_example_file.py
Show code cell output
→ connected lamindb: testuser1/test-analysis-flow
→ created Transform('K4wsS5DTYdFp0000', key='register_example_file.py'), started new Run('J1AD0HwU1ILspQdA') at 2025-10-27 08:30:51 UTC
! organism is ignored, define it on the dtype level
! 4 terms not validated in feature 'columns': 'cell_type', 'cell_type_id', 'tissue', 'disease'
→ fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')
✓ added 4 records with Feature for "columns": 'cell_type', 'cell_type_id', 'tissue', 'disease'
✓ added 3 records from_public with bionty.CellType for "cell_type": 'T cell', 'hematopoietic stem cell', 'hepatocyte'
! 1 term not validated in feature 'cell_type': 'my new cell type'
→ fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('cell_type')
✓ added 1 record with bionty.CellType for "cell_type": 'my new cell type'
✓ "columns" is validated against Feature.name
✓ "cell_type" is validated against CellType.name
✓ "cell_type_id" is validated against CellType.ontology_id
✓ added 4 records from_public with bionty.Tissue for "tissue": 'kidney', 'liver', 'heart', 'brain'
✓ "tissue" is validated against Tissue.name
✓ added 4 records from_public with bionty.Disease for "disease": 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
✓ "disease" is validated against Disease.name
✓ created 1 Organism record from Bionty matching name: 'human'
✓ added 99 records from_public with bionty.Gene for "var_index": 'ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', ...
✓ "var_index" is validated against Gene.ensembl_gene_id
→ writing the in-memory object into cache
✓ 99 unique terms (100.00%) are validated for ensembl_gene_id
✓ 4 unique terms (100.00%) are validated for name
Open a dataset, subset it, and register the result¶
Track the current notebook:
ln.track("eNef4Arw8nNM")
Show code cell output
→ created Transform('eNef4Arw8nNM0000', key='analysis-flow.ipynb'), started new Run('8VX87x7UsimUJHYb') at 2025-10-27 08:30:58 UTC
→ notebook imports: bionty==1.8.1 lamindb==1.14a1
artifact = ln.Artifact.get(description="anndata with obs")
artifact.describe()
Show code cell output
Artifact: (0000) | description: anndata with obs ├── uid: wea5XnlgkpjdwN6y0000 run: J1AD0Hw (register_example_file.py) │ kind: dataset otype: AnnData │ hash: IJORtcQUSS11QBqD-nTD0A size: 45.9 KB │ branch: main space: all │ created_at: 2025-10-27 08:30:56 UTC created_by: testuser1 │ n_observations: 40 ├── storage/path: │ /home/runner/work/lamin-usecases/lamin-usecases/docs/test-analysis-flow/.lamindb/wea5XnlgkpjdwN6y0000.h5ad ├── Dataset features │ ├── var (99 bionty.Gene) │ │ TSPAN6 float │ │ TNMD float │ │ DPM1 float │ │ SCYL3 float │ │ FIRRM float │ │ FGR float │ │ CFH float │ │ FUCA2 float │ │ GCLC float │ │ NFYA float │ │ STPG1 float │ │ NIPAL3 float │ │ LAS1L float │ │ ENPP4 float │ │ SEMA3F float │ │ CFTR float │ │ ANKIB1 float │ │ CYP51A1 float │ │ KRIT1 float │ │ RAD52 float │ └── obs (4) │ cell_type bionty.CellType T cell, hematopoietic stem cell, hepato… │ cell_type_id bionty.CellType T cell, hematopoietic stem cell, hepato… │ disease bionty.Disease Alzheimer disease, cardiac ventricle di… │ tissue bionty.Tissue brain, heart, kidney, liver └── Labels └── .tissues bionty.Tissue kidney, liver, heart, brain .cell_types bionty.CellType T cell, hematopoietic stem cell, hepato… .diseases bionty.Disease chronic kidney disease, liver lymphoma,…
Get a backed AnnData object¶
adata = artifact.open()
adata
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 40 × 100
constructed for the AnnData object wea5XnlgkpjdwN6y0000.h5ad
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
Subset dataset to specific cell types and diseases¶
cell_types = artifact.cell_types.all().distinct().lookup(return_field="name")
diseases = artifact.diseases.all().distinct().lookup(return_field="name")
Create the subset:
subset_obs = adata.obs.cell_type.isin(
[cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
Show code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
Show code cell output
cell_type disease
T cell chronic kidney disease 10
hematopoietic stem cell liver lymphoma 10
Name: count, dtype: int64
Register the subsetted AnnData:
curate = ln.Curator.from_anndata(
adata_subset.to_memory(),
var_index=bt.Gene.ensembl_gene_id,
categoricals={
"cell_type": bt.CellType.name,
"disease": bt.Disease.name,
"tissue": bt.Tissue.name,
},
organism="human",
)
curate.validate()
Show code cell output
! organism is ignored, define it on the dtype level
/opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/anndata/_core/anndata.py:1793: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
True
artifact = curate.save_artifact(description="anndata with obs subset")
artifact.describe()
Show code cell output
→ writing the in-memory object into cache
→ returning schema with same hash: Schema(uid='wGhYTvvM7XSjprk3', name=None, description=None, n=99, is_type=False, itype='bionty.Gene', otype=None, dtype='float', hash='QogdpqbT704yi5K-Ag5zhg', minimal_set=True, ordered_set=False, maximal_set=False, slot=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, validated_by_id=None, composite_id=None, created_at=2025-10-27 08:30:57 UTC, is_locked=False)
→ returning schema with same hash: Schema(uid='Ri9T9VD2SZ7WFig3', name=None, description=None, n=4, is_type=False, itype='Feature', otype='DataFrame', dtype=None, hash='T11hosOLps0JnRkHxIy4gQ', minimal_set=True, ordered_set=False, maximal_set=False, slot=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, validated_by_id=None, composite_id=None, created_at=2025-10-27 08:30:57 UTC, is_locked=False)
Artifact: (0000) | description: anndata with obs subset ├── uid: aYPzvvDy5CtdEY000000 run: 8VX87x7 (analysis-flow.ipynb) │ kind: dataset otype: AnnData │ hash: RgGUx7ndRplZZSmalTAWiw size: 38.1 KB │ branch: main space: all │ created_at: 2025-10-27 08:30:59 UTC created_by: testuser1 │ n_observations: 20 ├── storage/path: │ /home/runner/work/lamin-usecases/lamin-usecases/docs/test-analysis-flow/.lamindb/aYPzvvDy5CtdEY000000.h5ad ├── Dataset features │ ├── var (99 bionty.Gene) │ │ TSPAN6 float │ │ TNMD float │ │ DPM1 float │ │ SCYL3 float │ │ FIRRM float │ │ FGR float │ │ CFH float │ │ FUCA2 float │ │ GCLC float │ │ NFYA float │ │ STPG1 float │ │ NIPAL3 float │ │ LAS1L float │ │ ENPP4 float │ │ SEMA3F float │ │ CFTR float │ │ ANKIB1 float │ │ CYP51A1 float │ │ KRIT1 float │ │ RAD52 float │ └── obs (4) │ cell_type bionty.CellType T cell, hematopoietic stem cell │ disease bionty.Disease chronic kidney disease, liver lymphoma │ tissue bionty.Tissue kidney, liver │ cell_type_id bionty.CellType └── Labels └── .tissues bionty.Tissue kidney, liver .cell_types bionty.CellType T cell, hematopoietic stem cell .diseases bionty.Disease chronic kidney disease, liver lymphoma
Examine data lineage¶
Query a subsetted .h5ad artifact containing “hematopoietic stem cell” and “T cell”:
cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
suffix=".h5ad",
description__endswith="subset",
cell_types__in=[
cell_types.hematopoietic_stem_cell,
cell_types.t_cell,
],
).first()
my_subset
Show code cell output
Artifact(uid='aYPzvvDy5CtdEY000000', version=None, is_latest=True, key=None, description='anndata with obs subset', suffix='.h5ad', kind='dataset', otype='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_files=None, n_observations=20, branch_id=1, space_id=1, storage_id=1, run_id=2, schema_id=None, created_by_id=1, created_at=2025-10-27 08:30:59 UTC, is_locked=False)
Common questions that might arise are:
What is the history of this artifact?
Which features and labels are associated with it?
Which notebook analyzed and registered this artifact?
By whom?
And which artifact is its parent?
Let’s answer this using LaminDB:
artifact.features
Artifact: (0000) | description: anndata with obs subset └── Dataset features ├── var (99 bionty.Gene) │ TSPAN6 float │ TNMD float │ DPM1 float │ SCYL3 float │ FIRRM float │ FGR float │ CFH float │ FUCA2 float │ GCLC float │ NFYA float │ STPG1 float │ NIPAL3 float │ LAS1L float │ ENPP4 float │ SEMA3F float │ CFTR float │ ANKIB1 float │ CYP51A1 float │ KRIT1 float │ RAD52 float └── obs (4) cell_type bionty.CellType T cell, hematopoietic stem cell disease bionty.Disease chronic kidney disease, liver lymphoma tissue bionty.Tissue kidney, liver cell_type_id bionty.CellType
print("--> What is the lineage of this artifact?\n")
artifact.view_lineage()
print("\n\n--> Which features and labels are associated with it?\n")
print(artifact.features)
print(artifact.labels)
print("\n\n--> Which notebook analyzed and saved this artifact\n")
print(artifact.transform)
print("\n\n--> Who save this artifact?\n")
print(artifact.created_by)
print("\n\n--> Which artifacts were inputs?\n")
display(artifact.run.input_artifacts.to_dataframe())
--> What is the lineage of this artifact?
--> Which features and labels are associated with it?
Artifact: (0000) | description: anndata with obs subset └── Dataset features ├── var (99 bionty.Gene) │ TSPAN6 float │ TNMD float │ DPM1 float │ SCYL3 float │ FIRRM float │ FGR float │ CFH float │ FUCA2 float │ GCLC float │ NFYA float │ STPG1 float │ NIPAL3 float │ LAS1L float │ ENPP4 float │ SEMA3F float │ CFTR float │ ANKIB1 float │ CYP51A1 float │ KRIT1 float │ RAD52 float └── obs (4) cell_type bionty.CellType T cell, hematopoietic stem cell disease bionty.Disease chronic kidney disease, liver lymphoma tissue bionty.Tissue kidney, liver cell_type_id bionty.CellType
Labels └── .tissues bionty.Tissue kidney, liver .cell_types bionty.CellType T cell, hematopoietic stem cell .diseases bionty.Disease chronic kidney disease, liver lymphoma
--> Which notebook analyzed and saved this artifact
Transform(uid='eNef4Arw8nNM0000', version=None, is_latest=True, key='analysis-flow.ipynb', description='Analysis flow', type='notebook', hash=None, reference=None, reference_type=None, branch_id=1, space_id=1, created_by_id=1, created_at=2025-10-27 08:30:58 UTC, is_locked=False)
--> Who save this artifact?
User object (1)
--> Which artifacts were inputs?
| uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | version | is_latest | is_locked | created_at | branch_id | space_id | storage_id | run_id | schema_id | created_by_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||||||||||||
| 2 | wea5XnlgkpjdwN6y0000 | None | anndata with obs | .h5ad | dataset | AnnData | 46992 | IJORtcQUSS11QBqD-nTD0A | None | 40 | None | True | False | 2025-10-27 08:30:56.700000+00:00 | 1 | 1 | 1 | 1 | None | 1 |
Show code cell content
!rm -r ./analysis-flow
!lamin delete --force analysis-flow
rm: cannot remove './analysis-flow': No such file or directory
'testuser1/analysis-flow' not found: 'instance-not-found'
Check your permissions: https://lamin.ai/testuser1/analysis-flow