Jupyter Notebook

Analysis flow

Here, we’ll track typical data transformations like subsetting that occur during analysis.

# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./analysis-flow --schema bionty
Hide code cell output
 connected lamindb: testuser1/analysis-flow
import lamindb as ln
import bionty as bt
 connected lamindb: testuser1/analysis-flow

Save an initial dataset

register_example_file.py
import lamindb as ln
import bionty as bt

ln.track("K4wsS5DTYdFp0000")

# an example dataset that has a few cell type, tissue and disease annotations
adata = ln.core.datasets.anndata_with_obs()

# validate and register features
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "cell_type_id": bt.CellType.ontology_id,
        "tissue": bt.Tissue.name,
        "disease": bt.Disease.name,
    },
    organism="human",
)
curate.add_new_from("cell_type")
curate.validate()
curate.save_artifact(description="anndata with obs")

ln.finish()
!python analysis-flow-scripts/register_example_file.py
Hide code cell output
 connected lamindb: testuser1/analysis-flow
 created Transform('K4wsS5DT'), started new Run('wiH23FQe') at 2024-12-20 15:09:20 UTC
 added 4 records with Feature.name for "columns": 'cell_type', 'cell_type_id', 'tissue', 'disease'
 saving validated records of 'cell_type'
 added 3 records from public with CellType.name for "cell_type": 'T cell', 'hematopoietic stem cell', 'hepatocyte'
 added 1 record with CellType.name for "cell_type": 'my new cell type'
 saving validated records of 'var_index'
 added 99 records from public with Gene.ensembl_gene_id for "var_index": 'ENSG00000000003', 'ENSG00000000005', 'ENSG00000000419', 'ENSG00000000457', 'ENSG00000000460', 'ENSG00000000938', 'ENSG00000000971', 'ENSG00000001036', 'ENSG00000001084', 'ENSG00000001167', 'ENSG00000001460', 'ENSG00000001461', 'ENSG00000001497', 'ENSG00000001561', 'ENSG00000001617', 'ENSG00000001626', 'ENSG00000001629', 'ENSG00000001630', 'ENSG00000001631', 'ENSG00000002016', ...
 saving validated records of 'tissue'
 added 4 records from public with Tissue.name for "tissue": 'kidney', 'brain', 'liver', 'heart'
 saving validated records of 'disease'
 added 4 records from public with Disease.name for "disease": 'cardiac ventricle disorder', 'liver lymphoma', 'Alzheimer disease', 'chronic kidney disease'
 "var_index" is validated against Gene.ensembl_gene_id
 "cell_type" is validated against CellType.name
 "cell_type_id" is validated against CellType.ontology_id
 "tissue" is validated against Tissue.name
 "disease" is validated against Disease.name
 finished Run('wiH23FQe') after 0d 0h 0m 4s at 2024-12-20 15:09:24 UTC

Open a dataset, subset it, and register the result

Track the current notebook:

ln.track("eNef4Arw8nNM0000")
Hide code cell output
 created Transform('eNef4Arw'), started new Run('S2TzrspR') at 2024-12-20 15:09:25 UTC
 notebook imports: bionty==0.53.2 lamindb==0.77.3
artifact = ln.Artifact.get(description="anndata with obs")
artifact.describe()
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'pUEdDHrOk3Ky3Z740000'
│   ├── .size = 46992
│   ├── .hash = 'IJORtcQUSS11QBqD-nTD0A'
│   ├── .n_observations = 40
│   ├── .path = 
│   │   /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-flow/.lamindb/pUEdDHrOk3Ky3Z740000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2024-12-20 15:09:24
│   └── .transform = 'register_example_file.py'
├── Dataset features/.feature_sets
│   ├── var99                    [bionty.Gene]                                                       
│   │   TSPAN6                      float                                                               
│   │   TNMD                        float                                                               
│   │   DPM1                        float                                                               
│   │   SCYL3                       float                                                               
│   │   FIRRM                       float                                                               
│   │   FGR                         float                                                               
│   │   CFH                         float                                                               
│   │   FUCA2                       float                                                               
│   │   GCLC                        float                                                               
│   │   NFYA                        float                                                               
│   │   STPG1                       float                                                               
│   │   NIPAL3                      float                                                               
│   │   LAS1L                       float                                                               
│   │   ENPP4                       float                                                               
│   │   SEMA3F                      float                                                               
│   │   CFTR                        float                                                               
│   │   ANKIB1                      float                                                               
│   │   CYP51A1                     float                                                               
│   │   KRIT1                       float                                                               
│   │   RAD52                       float                                                               
│   └── obs4                     [Feature]                                                           
cell_type                   cat[bionty.CellType]       T cell, hematopoietic stem cell, hepatoc…
cell_type_id                cat[bionty.CellType]       T cell, hematopoietic stem cell, hepatoc…
disease                     cat[bionty.Disease]        Alzheimer disease, cardiac ventricle dis…
tissue                      cat[bionty.Tissue]         brain, heart, kidney, liver              
└── Labels
    └── .tissues                    bionty.Tissue              kidney, brain, liver, heart              
        .cell_types                 bionty.CellType            T cell, hematopoietic stem cell, hepatoc…
        .diseases                   bionty.Disease             cardiac ventricle disorder, liver lympho…

Get a backed AnnData object

adata = artifact.open()
adata
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object pUEdDHrOk3Ky3Z740000.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

Subset dataset to specific cell types and diseases

cell_types = artifact.cell_types.all().distinct().lookup(return_field="name")
diseases = artifact.diseases.all().distinct().lookup(return_field="name")

Create the subset:

subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
Hide code cell output
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
Name: count, dtype: int64

Register the subsetted AnnData:

curate = ln.Curator.from_anndata(
    adata_subset.to_memory(),
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "disease": bt.Disease.name,
        "tissue": bt.Tissue.name,
    },
    organism="human",
)
curate.validate()
Hide code cell output
/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
 "var_index" is validated against Gene.ensembl_gene_id
 "cell_type" is validated against CellType.name
 "disease" is validated against Disease.name
 "tissue" is validated against Tissue.name
True
artifact = curate.save_artifact(description="anndata with obs subset")
artifact.describe()
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'M6rzRSXmPhQyy18z0000'
│   ├── .size = 38992
│   ├── .hash = 'RgGUx7ndRplZZSmalTAWiw'
│   ├── .n_observations = 20
│   ├── .path = 
│   │   /home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-flow/.lamindb/M6rzRSXmPhQyy18z0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2024-12-20 15:09:26
│   └── .transform = 'Analysis flow'
├── Dataset features/.feature_sets
│   ├── var99                    [bionty.Gene]                                                       
│   │   TSPAN6                      float                                                               
│   │   TNMD                        float                                                               
│   │   DPM1                        float                                                               
│   │   SCYL3                       float                                                               
│   │   FIRRM                       float                                                               
│   │   FGR                         float                                                               
│   │   CFH                         float                                                               
│   │   FUCA2                       float                                                               
│   │   GCLC                        float                                                               
│   │   NFYA                        float                                                               
│   │   STPG1                       float                                                               
│   │   NIPAL3                      float                                                               
│   │   LAS1L                       float                                                               
│   │   ENPP4                       float                                                               
│   │   SEMA3F                      float                                                               
│   │   CFTR                        float                                                               
│   │   ANKIB1                      float                                                               
│   │   CYP51A1                     float                                                               
│   │   KRIT1                       float                                                               
│   │   RAD52                       float                                                               
│   └── obs4                     [Feature]                                                           
cell_type                   cat[bionty.CellType]       T cell, hematopoietic stem cell          
disease                     cat[bionty.Disease]        chronic kidney disease, liver lymphoma   
tissue                      cat[bionty.Tissue]         kidney, liver                            
cell_type_id                cat[bionty.CellType]                                                
└── Labels
    └── .tissues                    bionty.Tissue              kidney, liver                            
        .cell_types                 bionty.CellType            T cell, hematopoietic stem cell          
        .diseases                   bionty.Disease             liver lymphoma, chronic kidney disease   

Examine data lineage

Query a subsetted .h5ad artifact containing “hematopoietic stem cell” and “T cell”:

cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()
my_subset
Hide code cell output
Artifact(uid='M6rzRSXmPhQyy18z0000', is_latest=True, description='anndata with obs subset', suffix='.h5ad', type='dataset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_observations=20, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=2, run_id=2, created_by_id=1, created_at=2024-12-20 15:09:26 UTC)

Common questions that might arise are:

  • What is the history of this artifact?

  • Which features and labels are associated with it?

  • Which notebook analyzed and registered this artifact?

  • By whom?

  • And which artifact is its parent?

Let’s answer this using LaminDB:

print("--> What is the lineage of this artifact?\n")
artifact.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
print(artifact.features)
print(artifact.labels)

print("\n\n--> Which notebook analyzed and saved this artifact\n")
print(artifact.transform)

print("\n\n--> Who save this artifact?\n")
print(artifact.created_by)

print("\n\n--> Which artifacts were inputs?\n")
display(artifact.run.input_artifacts.df())
--> What is the lineage of this artifact?
_images/07338b3b0ae08a83a60f935b9fc3233582fa510ec1f90c0c1657e57d2d287e5d.svg
--> Which features and labels are associated with it?
Artifact .h5ad/AnnData
└── Dataset features/.feature_sets
    ├── var99                    [bionty.Gene]                                                       
TSPAN6                      float                                                               
TNMD                        float                                                               
DPM1                        float                                                               
SCYL3                       float                                                               
FIRRM                       float                                                               
FGR                         float                                                               
CFH                         float                                                               
FUCA2                       float                                                               
GCLC                        float                                                               
NFYA                        float                                                               
STPG1                       float                                                               
NIPAL3                      float                                                               
LAS1L                       float                                                               
ENPP4                       float                                                               
SEMA3F                      float                                                               
CFTR                        float                                                               
ANKIB1                      float                                                               
CYP51A1                     float                                                               
KRIT1                       float                                                               
RAD52                       float                                                               
    └── obs4                     [Feature]                                                           
        cell_type                   cat[bionty.CellType]       T cell, hematopoietic stem cell          
        disease                     cat[bionty.Disease]        chronic kidney disease, liver lymphoma   
        tissue                      cat[bionty.Tissue]         kidney, liver                            
        cell_type_id                cat[bionty.CellType]                                                

Artifact .h5ad/AnnData
└── Labels
    └── .tissues                    bionty.Tissue              kidney, liver                            
        .cell_types                 bionty.CellType            T cell, hematopoietic stem cell          
        .diseases                   bionty.Disease             liver lymphoma, chronic kidney disease   
--> Which notebook analyzed and saved this artifact

Transform(uid='eNef4Arw8nNM0000', is_latest=True, name='Analysis flow', key='analysis-flow.ipynb', type='notebook', created_by_id=1, created_at=2024-12-20 15:09:25 UTC)


--> Who save this artifact?

User(uid='DzTjkKse', handle='testuser1', name='Test User1', created_at=2024-12-20 15:09:17 UTC)


--> Which artifacts were inputs?
uid key description suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id version is_latest run_id created_at created_by_id
id
1 pUEdDHrOk3Ky3Z740000 None anndata with obs .h5ad dataset 46992 IJORtcQUSS11QBqD-nTD0A None 40 md5 AnnData 1 True 1 1 None True 1 2024-12-20 15:09:24.476574+00:00 1
Hide code cell content
!rm -r ./analysis-flow
!lamin delete --force analysis-flow
 deleting instance testuser1/analysis-flow