Jupyter Notebook

Analysis flow

Here, we’ll track typical data transformations like subsetting that occur during analysis.

# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-analysis-flow --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-analysis-flow
import lamindb as ln
import bionty as bt
 connected lamindb: testuser1/test-analysis-flow

Save an initial dataset

register_example_file.py
import lamindb as ln
import bionty as bt

ln.track("K4wsS5DTYdFp0000")

# an example dataset that has a few cell type, tissue and disease annotations
adata = ln.core.datasets.anndata_with_obs()

# validate and register features
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "cell_type_id": bt.CellType.ontology_id,
        "tissue": bt.Tissue.name,
        "disease": bt.Disease.name,
    },
    organism="human",
)
curate.add_new_from("cell_type")
curate.validate()
curate.save_artifact(description="anndata with obs")

ln.finish()
!python analysis-flow-scripts/register_example_file.py
Hide code cell output
 connected lamindb: testuser1/test-analysis-flow
 created Transform('K4wsS5DTYdFp0000'), started new Run('99emMAMD...') at 2025-07-08 11:05:47 UTC
! organism is ignored, define it on the dtype level
! 4 terms not validated in feature 'columns': 'cell_type', 'cell_type_id', 'tissue', 'disease'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('columns')
 added 4 records with Feature for "columns": 'cell_type', 'cell_type_id', 'tissue', 'disease'
 added 3 records from_public with bionty.CellType for "cell_type": 'T cell', 'hematopoietic stem cell', 'hepatocyte'
! 1 term not validated in feature 'cell_type': 'my new cell type'
    → fix typos, remove non-existent values, or save terms via: curator.cat.add_new_from('cell_type')
 added 1 record with bionty.CellType for "cell_type": 'my new cell type'
 "columns" is validated against Feature.name
 "cell_type" is validated against CellType.name
 "cell_type_id" is validated against CellType.ontology_id
 added 4 records from_public with bionty.Tissue for "tissue": 'kidney', 'liver', 'heart', 'brain'
 "tissue" is validated against Tissue.name
 added 4 records from_public with bionty.Disease for "disease": 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
 "disease" is validated against Disease.name
 created 1 Organism record from Bionty matching name: 'human'
 added 99 records from_public with bionty.Gene for "var_index": 'ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', ...
 "var_index" is validated against Gene.ensembl_gene_id
 99 unique terms (100.00%) are validated for ensembl_gene_id
 4 unique terms (100.00%) are validated for name

Open a dataset, subset it, and register the result

Track the current notebook:

ln.track("eNef4Arw8nNM")
Hide code cell output
 created Transform('eNef4Arw8nNM0000'), started new Run('b4aH3CnP...') at 2025-07-08 11:05:53 UTC
 notebook imports: bionty==1.6.0 lamindb==1.7.1
artifact = ln.Artifact.get(description="anndata with obs")
artifact.describe()
Hide code cell output
Artifact .h5ad · AnnData · dataset
├── General
│   ├── uid: DGmQ34VdZFRcd7uT0000          hash: IJORtcQUSS11QBqD-nTD0A
│   ├── size: 45.9 KB                      n_observations: 40
│   ├── space: all                         branch: main
│   ├── created_at: 2025-07-08 11:05:52    created_by: testuser1 (Test User1)
│   ├── storage location / path: 
│   │   /home/runner/work/lamin-usecases/lamin-usecases/docs/test-analysis-flow/.lamindb/DGmQ34VdZFRcd7uT0000.h5ad
│   ├── description: anndata with obs
│   └── transform: register_example_file.py
├── Dataset features
│   ├── var99                        [bionty.Gene]                                                              
│   │   TSPAN6                          float                                                                      
│   │   TNMD                            float                                                                      
│   │   DPM1                            float                                                                      
│   │   SCYL3                           float                                                                      
│   │   FIRRM                           float                                                                      
│   │   FGR                             float                                                                      
│   │   CFH                             float                                                                      
│   │   FUCA2                           float                                                                      
│   │   GCLC                            float                                                                      
│   │   NFYA                            float                                                                      
│   │   STPG1                           float                                                                      
│   │   NIPAL3                          float                                                                      
│   │   LAS1L                           float                                                                      
│   │   ENPP4                           float                                                                      
│   │   SEMA3F                          float                                                                      
│   │   CFTR                            float                                                                      
│   │   ANKIB1                          float                                                                      
│   │   CYP51A1                         float                                                                      
│   │   KRIT1                           float                                                                      
│   │   RAD52                           float                                                                      
│   └── obs4                         [Feature]                                                                  
cell_type                       cat[bionty.CellType]               T cell, hematopoietic stem cell, hepato…
cell_type_id                    cat[bionty.CellType]               T cell, hematopoietic stem cell, hepato…
disease                         cat[bionty.Disease]                Alzheimer disease, cardiac ventricle di…
tissue                          cat[bionty.Tissue]                 brain, heart, kidney, liver             
└── Labels
    └── .tissues                        bionty.Tissue                      kidney, liver, heart, brain             
        .cell_types                     bionty.CellType                    T cell, hematopoietic stem cell, hepato…
        .diseases                       bionty.Disease                     chronic kidney disease, liver lymphoma,…

Get a backed AnnData object

adata = artifact.open()
adata
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object DGmQ34VdZFRcd7uT0000.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

Subset dataset to specific cell types and diseases

cell_types = artifact.cell_types.all().distinct().lookup(return_field="name")
diseases = artifact.diseases.all().distinct().lookup(return_field="name")

Create the subset:

subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
Hide code cell output
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
Name: count, dtype: int64

Register the subsetted AnnData:

curate = ln.Curator.from_anndata(
    adata_subset.to_memory(),
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "disease": bt.Disease.name,
        "tissue": bt.Tissue.name,
    },
    organism="human",
)
curate.validate()
Hide code cell output
! organism is ignored, define it on the dtype level
/opt/hostedtoolcache/Python/3.12.11/x64/lib/python3.12/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
True
artifact = curate.save_artifact(description="anndata with obs subset")
artifact.describe()
Hide code cell output
 returning existing schema with same hash: Schema(uid='0CB04nWsWQkT63Rs', n=99, is_type=False, itype='bionty.Gene', dtype='float', hash='QogdpqbT704yi5K-Ag5zhg', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, run_id=1, created_at=2025-07-08 11:05:52 UTC)
 returning existing schema with same hash: Schema(uid='Kfp8Ypf07kNRtuIY', n=4, is_type=False, itype='Feature', otype='DataFrame', hash='v6iWSSZ4b79X9o8Trk-SEg', minimal_set=True, ordered_set=False, maximal_set=False, branch_id=1, space_id=1, created_by_id=1, run_id=1, created_at=2025-07-08 11:05:52 UTC)
Artifact .h5ad · AnnData · dataset
├── General
│   ├── uid: Tx3neL0fptMPwNSo0000          hash: RgGUx7ndRplZZSmalTAWiw
│   ├── size: 38.1 KB                      n_observations: 20
│   ├── space: all                         branch: main
│   ├── created_at: 2025-07-08 11:05:54    created_by: testuser1 (Test User1)
│   ├── storage location / path: 
│   │   /home/runner/work/lamin-usecases/lamin-usecases/docs/test-analysis-flow/.lamindb/Tx3neL0fptMPwNSo0000.h5ad
│   ├── description: anndata with obs subset
│   └── transform: analysis-flow.ipynb
├── Dataset features
│   ├── var99                        [bionty.Gene]                                                              
│   │   TSPAN6                          float                                                                      
│   │   TNMD                            float                                                                      
│   │   DPM1                            float                                                                      
│   │   SCYL3                           float                                                                      
│   │   FIRRM                           float                                                                      
│   │   FGR                             float                                                                      
│   │   CFH                             float                                                                      
│   │   FUCA2                           float                                                                      
│   │   GCLC                            float                                                                      
│   │   NFYA                            float                                                                      
│   │   STPG1                           float                                                                      
│   │   NIPAL3                          float                                                                      
│   │   LAS1L                           float                                                                      
│   │   ENPP4                           float                                                                      
│   │   SEMA3F                          float                                                                      
│   │   CFTR                            float                                                                      
│   │   ANKIB1                          float                                                                      
│   │   CYP51A1                         float                                                                      
│   │   KRIT1                           float                                                                      
│   │   RAD52                           float                                                                      
│   └── obs4                         [Feature]                                                                  
cell_type                       cat[bionty.CellType]               T cell, hematopoietic stem cell         
disease                         cat[bionty.Disease]                chronic kidney disease, liver lymphoma  
tissue                          cat[bionty.Tissue]                 kidney, liver                           
cell_type_id                    cat[bionty.CellType]                                                       
└── Labels
    └── .tissues                        bionty.Tissue                      kidney, liver                           
        .cell_types                     bionty.CellType                    T cell, hematopoietic stem cell         
        .diseases                       bionty.Disease                     chronic kidney disease, liver lymphoma  

Examine data lineage

Query a subsetted .h5ad artifact containing “hematopoietic stem cell” and “T cell”:

cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()
my_subset
Hide code cell output
Artifact(uid='Tx3neL0fptMPwNSo0000', is_latest=True, description='anndata with obs subset', suffix='.h5ad', kind='dataset', otype='AnnData', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_observations=20, branch_id=1, space_id=1, storage_id=1, run_id=2, created_by_id=1, created_at=2025-07-08 11:05:54 UTC)

Common questions that might arise are:

  • What is the history of this artifact?

  • Which features and labels are associated with it?

  • Which notebook analyzed and registered this artifact?

  • By whom?

  • And which artifact is its parent?

Let’s answer this using LaminDB:

artifact.features
Artifact .h5ad · AnnData · dataset
└── Dataset features
    ├── var99                        [bionty.Gene]                                                              
TSPAN6                          float                                                                      
TNMD                            float                                                                      
DPM1                            float                                                                      
SCYL3                           float                                                                      
FIRRM                           float                                                                      
FGR                             float                                                                      
CFH                             float                                                                      
FUCA2                           float                                                                      
GCLC                            float                                                                      
NFYA                            float                                                                      
STPG1                           float                                                                      
NIPAL3                          float                                                                      
LAS1L                           float                                                                      
ENPP4                           float                                                                      
SEMA3F                          float                                                                      
CFTR                            float                                                                      
ANKIB1                          float                                                                      
CYP51A1                         float                                                                      
KRIT1                           float                                                                      
RAD52                           float                                                                      
    └── obs4                         [Feature]                                                                  
        cell_type                       cat[bionty.CellType]               T cell, hematopoietic stem cell         
        disease                         cat[bionty.Disease]                chronic kidney disease, liver lymphoma  
        tissue                          cat[bionty.Tissue]                 kidney, liver                           
        cell_type_id                    cat[bionty.CellType]                                                       

print("--> What is the lineage of this artifact?\n")
artifact.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
print(artifact.features)
print(artifact.labels)

print("\n\n--> Which notebook analyzed and saved this artifact\n")
print(artifact.transform)

print("\n\n--> Who save this artifact?\n")
print(artifact.created_by)

print("\n\n--> Which artifacts were inputs?\n")
display(artifact.run.input_artifacts.df())
--> What is the lineage of this artifact?
_images/49f8796e0a416bcb2b1ce92da004e2b7963e9b1616b69c77a17f52e68142a7b4.svg
--> Which features and labels are associated with it?
Artifact .h5ad · AnnData · dataset
└── Dataset features
    ├── var99                        [bionty.Gene]                                                              
TSPAN6                          float                                                                      
TNMD                            float                                                                      
DPM1                            float                                                                      
SCYL3                           float                                                                      
FIRRM                           float                                                                      
FGR                             float                                                                      
CFH                             float                                                                      
FUCA2                           float                                                                      
GCLC                            float                                                                      
NFYA                            float                                                                      
STPG1                           float                                                                      
NIPAL3                          float                                                                      
LAS1L                           float                                                                      
ENPP4                           float                                                                      
SEMA3F                          float                                                                      
CFTR                            float                                                                      
ANKIB1                          float                                                                      
CYP51A1                         float                                                                      
KRIT1                           float                                                                      
RAD52                           float                                                                      
    └── obs4                         [Feature]                                                                  
        cell_type                       cat[bionty.CellType]               T cell, hematopoietic stem cell         
        disease                         cat[bionty.Disease]                chronic kidney disease, liver lymphoma  
        tissue                          cat[bionty.Tissue]                 kidney, liver                           
        cell_type_id                    cat[bionty.CellType]                                                       

Artifact .h5ad · AnnData · dataset
└── Labels
    └── .tissues                        bionty.Tissue                      kidney, liver                           
        .cell_types                     bionty.CellType                    T cell, hematopoietic stem cell         
        .diseases                       bionty.Disease                     chronic kidney disease, liver lymphoma  
--> Which notebook analyzed and saved this artifact

Transform(uid='eNef4Arw8nNM0000', is_latest=True, key='analysis-flow.ipynb', description='Analysis flow', type='notebook', branch_id=1, space_id=1, created_by_id=1, created_at=2025-07-08 11:05:53 UTC)


--> Who save this artifact?

User object (1)


--> Which artifacts were inputs?
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux branch_id
id
2 DGmQ34VdZFRcd7uT0000 None anndata with obs .h5ad dataset AnnData 46992 IJORtcQUSS11QBqD-nTD0A None 40 md5 True False 1 1 None None True 1 2025-07-08 11:05:52.246000+00:00 1 {'af': {'0': True}} 1
Hide code cell content
!rm -r ./analysis-flow
!lamin delete --force analysis-flow
rm: cannot remove './analysis-flow': No such file or directory
'testuser1/analysis-flow' not found: 'instance-not-found'
Check your permissions: https://lamin.ai/testuser1/analysis-flow