Jupyter Notebook

Analysis flow

Here, we’ll track typical data transformations like subsetting that occur during analysis.

# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-analysis-flow --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-analysis-flow
import lamindb as ln
import bionty as bt
 connected lamindb: testuser1/test-analysis-flow

Save an initial dataset

register_example_file.py
import lamindb as ln
import bionty as bt

ln.track("K4wsS5DTYdFp0000")

# an example dataset that has a few cell type, tissue and disease annotations
adata = ln.core.datasets.anndata_with_obs()

# validate and register features
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "cell_type_id": bt.CellType.ontology_id,
        "tissue": bt.Tissue.name,
        "disease": bt.Disease.name,
    },
    organism="human",
)
curate.add_new_from("cell_type")
curate.validate()
curate.save_artifact(description="anndata with obs")

ln.finish()
!python analysis-flow-scripts/register_example_file.py
Hide code cell output
 connected lamindb: testuser1/test-analysis-flow
 created Transform('K4wsS5DTYdFp0000', key='register_example_file.py'), started new Run('J1AD0HwU1ILspQdA') at 2025-10-27 08:30:51 UTC
! organism is ignored, define it on the dtype level
! 4 terms not validated in feature 'columns': 'cell_type', 'cell_type_id', 'tissue', 'disease'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')
 added 4 records with Feature for "columns": 'cell_type', 'cell_type_id', 'tissue', 'disease'
 added 3 records from_public with bionty.CellType for "cell_type": 'T cell', 'hematopoietic stem cell', 'hepatocyte'
! 1 term not validated in feature 'cell_type': 'my new cell type'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('cell_type')
 added 1 record with bionty.CellType for "cell_type": 'my new cell type'
 "columns" is validated against Feature.name
 "cell_type" is validated against CellType.name
 "cell_type_id" is validated against CellType.ontology_id
 added 4 records from_public with bionty.Tissue for "tissue": 'kidney', 'liver', 'heart', 'brain'
 "tissue" is validated against Tissue.name
 added 4 records from_public with bionty.Disease for "disease": 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
 "disease" is validated against Disease.name
 created 1 Organism record from Bionty matching name: 'human'
 added 99 records from_public with bionty.Gene for "var_index": 'ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', ...
 "var_index" is validated against Gene.ensembl_gene_id
 writing the in-memory object into cache
 99 unique terms (100.00%) are validated for ensembl_gene_id
 4 unique terms (100.00%) are validated for name

Open a dataset, subset it, and register the result

Track the current notebook:

ln.track("eNef4Arw8nNM")
Hide code cell output
 created Transform('eNef4Arw8nNM0000', key='analysis-flow.ipynb'), started new Run('8VX87x7UsimUJHYb') at 2025-10-27 08:30:58 UTC
 notebook imports: bionty==1.8.1 lamindb==1.14a1
artifact = ln.Artifact.get(description="anndata with obs")
artifact.describe()
Hide code cell output
Artifact:  (0000)
|   description: anndata with obs
├── uid: wea5XnlgkpjdwN6y0000            run: J1AD0Hw (register_example_file.py)
kind: dataset                        otype: AnnData                         
hash: IJORtcQUSS11QBqD-nTD0A         size: 45.9 KB                          
branch: main                         space: all                             
created_at: 2025-10-27 08:30:56 UTC  created_by: testuser1                  
n_observations: 40                                                          
├── storage/path: 
/home/runner/work/lamin-usecases/lamin-usecases/docs/test-analysis-flow/.lamindb/wea5XnlgkpjdwN6y0000.h5ad
├── Dataset features
├── var (99 bionty.Gene)                                                                                       
│   TSPAN6                          float                                                                      
│   TNMD                            float                                                                      
│   DPM1                            float                                                                      
│   SCYL3                           float                                                                      
│   FIRRM                           float                                                                      
│   FGR                             float                                                                      
│   CFH                             float                                                                      
│   FUCA2                           float                                                                      
│   GCLC                            float                                                                      
│   NFYA                            float                                                                      
│   STPG1                           float                                                                      
│   NIPAL3                          float                                                                      
│   LAS1L                           float                                                                      
│   ENPP4                           float                                                                      
│   SEMA3F                          float                                                                      
│   CFTR                            float                                                                      
│   ANKIB1                          float                                                                      
│   CYP51A1                         float                                                                      
│   KRIT1                           float                                                                      
│   RAD52                           float                                                                      
└── obs (4)                                                                                                    
    cell_type                       bionty.CellType                    T cell, hematopoietic stem cell, hepato…
    cell_type_id                    bionty.CellType                    T cell, hematopoietic stem cell, hepato…
    disease                         bionty.Disease                     Alzheimer disease, cardiac ventricle di…
    tissue                          bionty.Tissue                      brain, heart, kidney, liver             
└── Labels
    └── .tissues                        bionty.Tissue                      kidney, liver, heart, brain             
        .cell_types                     bionty.CellType                    T cell, hematopoietic stem cell, hepato…
        .diseases                       bionty.Disease                     chronic kidney disease, liver lymphoma,…

Get a backed AnnData object

adata = artifact.open()
adata
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object wea5XnlgkpjdwN6y0000.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

Subset dataset to specific cell types and diseases

cell_types = artifact.cell_types.all().distinct().lookup(return_field="name")
diseases = artifact.diseases.all().distinct().lookup(return_field="name")

Create the subset:

subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
Hide code cell output
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
Name: count, dtype: int64

Register the subsetted AnnData:

curate = ln.Curator.from_anndata(
    adata_subset.to_memory(),
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "disease": bt.Disease.name,
        "tissue": bt.Tissue.name,
    },
    organism="human",
)
curate.validate()
Hide code cell output
! organism is ignored, define it on the dtype level
/opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/anndata/_core/anndata.py:1793: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
True
artifact = curate.save_artifact(description="anndata with obs subset")
artifact.describe()
Hide code cell output
 writing the in-memory object into cache
 returning schema with same hash: Schema(uid='wGhYTvvM7XSjprk3', name=None, description=None, n=99, is_type=False, itype='bionty.Gene', otype=None, dtype='float', hash='QogdpqbT704yi5K-Ag5zhg', minimal_set=True, ordered_set=False, maximal_set=False, slot=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, validated_by_id=None, composite_id=None, created_at=2025-10-27 08:30:57 UTC, is_locked=False)
 returning schema with same hash: Schema(uid='Ri9T9VD2SZ7WFig3', name=None, description=None, n=4, is_type=False, itype='Feature', otype='DataFrame', dtype=None, hash='T11hosOLps0JnRkHxIy4gQ', minimal_set=True, ordered_set=False, maximal_set=False, slot=None, branch_id=1, space_id=1, created_by_id=1, run_id=1, type_id=None, validated_by_id=None, composite_id=None, created_at=2025-10-27 08:30:57 UTC, is_locked=False)
Artifact:  (0000)
|   description: anndata with obs subset
├── uid: aYPzvvDy5CtdEY000000            run: 8VX87x7 (analysis-flow.ipynb)
kind: dataset                        otype: AnnData                    
hash: RgGUx7ndRplZZSmalTAWiw         size: 38.1 KB                     
branch: main                         space: all                        
created_at: 2025-10-27 08:30:59 UTC  created_by: testuser1             
n_observations: 20                                                     
├── storage/path: 
/home/runner/work/lamin-usecases/lamin-usecases/docs/test-analysis-flow/.lamindb/aYPzvvDy5CtdEY000000.h5ad
├── Dataset features
├── var (99 bionty.Gene)                                                                                       
│   TSPAN6                          float                                                                      
│   TNMD                            float                                                                      
│   DPM1                            float                                                                      
│   SCYL3                           float                                                                      
│   FIRRM                           float                                                                      
│   FGR                             float                                                                      
│   CFH                             float                                                                      
│   FUCA2                           float                                                                      
│   GCLC                            float                                                                      
│   NFYA                            float                                                                      
│   STPG1                           float                                                                      
│   NIPAL3                          float                                                                      
│   LAS1L                           float                                                                      
│   ENPP4                           float                                                                      
│   SEMA3F                          float                                                                      
│   CFTR                            float                                                                      
│   ANKIB1                          float                                                                      
│   CYP51A1                         float                                                                      
│   KRIT1                           float                                                                      
│   RAD52                           float                                                                      
└── obs (4)                                                                                                    
    cell_type                       bionty.CellType                    T cell, hematopoietic stem cell         
    disease                         bionty.Disease                     chronic kidney disease, liver lymphoma  
    tissue                          bionty.Tissue                      kidney, liver                           
    cell_type_id                    bionty.CellType                                                            
└── Labels
    └── .tissues                        bionty.Tissue                      kidney, liver                           
        .cell_types                     bionty.CellType                    T cell, hematopoietic stem cell         
        .diseases                       bionty.Disease                     chronic kidney disease, liver lymphoma  

Examine data lineage

Query a subsetted .h5ad artifact containing “hematopoietic stem cell” and “T cell”:

cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()
my_subset
Hide code cell output
Artifact(uid='aYPzvvDy5CtdEY000000', version=None, is_latest=True, key=None, description='anndata with obs subset', suffix='.h5ad', kind='dataset', otype='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_files=None, n_observations=20, branch_id=1, space_id=1, storage_id=1, run_id=2, schema_id=None, created_by_id=1, created_at=2025-10-27 08:30:59 UTC, is_locked=False)

Common questions that might arise are:

  • What is the history of this artifact?

  • Which features and labels are associated with it?

  • Which notebook analyzed and registered this artifact?

  • By whom?

  • And which artifact is its parent?

Let’s answer this using LaminDB:

artifact.features
Artifact:  (0000)
|   description: anndata with obs subset
└── Dataset features
    ├── var (99 bionty.Gene)                                                                                       
    │   TSPAN6                          float                                                                      
    │   TNMD                            float                                                                      
    │   DPM1                            float                                                                      
    │   SCYL3                           float                                                                      
    │   FIRRM                           float                                                                      
    │   FGR                             float                                                                      
    │   CFH                             float                                                                      
    │   FUCA2                           float                                                                      
    │   GCLC                            float                                                                      
    │   NFYA                            float                                                                      
    │   STPG1                           float                                                                      
    │   NIPAL3                          float                                                                      
    │   LAS1L                           float                                                                      
    │   ENPP4                           float                                                                      
    │   SEMA3F                          float                                                                      
    │   CFTR                            float                                                                      
    │   ANKIB1                          float                                                                      
    │   CYP51A1                         float                                                                      
    │   KRIT1                           float                                                                      
    │   RAD52                           float                                                                      
    └── obs (4)                                                                                                    
        cell_type                       bionty.CellType                    T cell, hematopoietic stem cell         
        disease                         bionty.Disease                     chronic kidney disease, liver lymphoma  
        tissue                          bionty.Tissue                      kidney, liver                           
        cell_type_id                    bionty.CellType                                                            

print("--> What is the lineage of this artifact?\n")
artifact.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
print(artifact.features)
print(artifact.labels)

print("\n\n--> Which notebook analyzed and saved this artifact\n")
print(artifact.transform)

print("\n\n--> Who save this artifact?\n")
print(artifact.created_by)

print("\n\n--> Which artifacts were inputs?\n")
display(artifact.run.input_artifacts.to_dataframe())
--> What is the lineage of this artifact?
_images/09f9b34d8dbb567211e14b22b5aadf8617ffab32054b7f4f7c89e7423962cd85.svg
--> Which features and labels are associated with it?
Artifact:  (0000)
|   description: anndata with obs subset
└── Dataset features
    ├── var (99 bionty.Gene)                                                                                       
    │   TSPAN6                          float                                                                      
    │   TNMD                            float                                                                      
    │   DPM1                            float                                                                      
    │   SCYL3                           float                                                                      
    │   FIRRM                           float                                                                      
    │   FGR                             float                                                                      
    │   CFH                             float                                                                      
    │   FUCA2                           float                                                                      
    │   GCLC                            float                                                                      
    │   NFYA                            float                                                                      
    │   STPG1                           float                                                                      
    │   NIPAL3                          float                                                                      
    │   LAS1L                           float                                                                      
    │   ENPP4                           float                                                                      
    │   SEMA3F                          float                                                                      
    │   CFTR                            float                                                                      
    │   ANKIB1                          float                                                                      
    │   CYP51A1                         float                                                                      
    │   KRIT1                           float                                                                      
    │   RAD52                           float                                                                      
    └── obs (4)                                                                                                    
        cell_type                       bionty.CellType                    T cell, hematopoietic stem cell         
        disease                         bionty.Disease                     chronic kidney disease, liver lymphoma  
        tissue                          bionty.Tissue                      kidney, liver                           
        cell_type_id                    bionty.CellType                                                            

Labels
└── .tissues                         bionty.Tissue                        kidney, liver                            
    .cell_types                      bionty.CellType                      T cell, hematopoietic stem cell          
    .diseases                        bionty.Disease                       chronic kidney disease, liver lymphoma   
--> Which notebook analyzed and saved this artifact

Transform(uid='eNef4Arw8nNM0000', version=None, is_latest=True, key='analysis-flow.ipynb', description='Analysis flow', type='notebook', hash=None, reference=None, reference_type=None, branch_id=1, space_id=1, created_by_id=1, created_at=2025-10-27 08:30:58 UTC, is_locked=False)


--> Who save this artifact?

User object (1)


--> Which artifacts were inputs?
uid key description suffix kind otype size hash n_files n_observations version is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
2 wea5XnlgkpjdwN6y0000 None anndata with obs .h5ad dataset AnnData 46992 IJORtcQUSS11QBqD-nTD0A None 40 None True False 2025-10-27 08:30:56.700000+00:00 1 1 1 1 None 1
Hide code cell content
!rm -r ./analysis-flow
!lamin delete --force analysis-flow
rm: cannot remove './analysis-flow': No such file or directory
'testuser1/analysis-flow' not found: 'instance-not-found'
Check your permissions: https://lamin.ai/testuser1/analysis-flow