Jupyter Notebook

Analysis flow

Here, we’ll track typical data transformations like subsetting that occur during analysis.

# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./analysis-flow --schema bionty
Hide code cell output
→ connected lamindb: testuser1/analysis-flow
import lamindb as ln
import bionty as bt
→ connected lamindb: testuser1/analysis-flow

Save an initial dataset

register_example_file.py
import lamindb as ln
import bionty as bt

ln.track("K4wsS5DTYdFp0000")

# an example dataset that has a few cell type, tissue and disease annotations
adata = ln.core.datasets.anndata_with_obs()

# validate and register features
curate = ln.Curator.from_anndata(
    adata,
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "cell_type_id": bt.CellType.ontology_id,
        "tissue": bt.Tissue.name,
        "disease": bt.Disease.name,
    },
    organism="human",
)
curate.add_new_from("cell_type")
curate.validate()
curate.save_artifact(description="anndata with obs")

ln.finish()
!python analysis-flow-scripts/register_example_file.py
Hide code cell output
→ connected lamindb: testuser1/analysis-flow
→ created Transform('K4wsS5DT'), started new Run('6OGILuiH') at 2024-11-21 06:57:23 UTC
✓ added 4 records with Feature.name for columns: 'cell_type', 'cell_type_id', 'tissue', 'disease'
• saving validated records of 'cell_type'
✓ added 3 records from public with CellType.name for cell_type: 'hematopoietic stem cell', 'hepatocyte', 'T cell'
✓ added 1 record with CellType.name for cell_type: 'my new cell type'
• saving validated records of 'var_index'
• saving validated records of 'tissue'
• saving validated records of 'disease'
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'cell_type' is validated against CellType.name
✓ 'cell_type_id' is validated against CellType.ontology_id
✓ 'tissue' is validated against Tissue.name
✓ 'disease' is validated against Disease.name
→ finished Run('6OGILuiH') after 0d 0h 0m 4s at 2024-11-21 06:57:27 UTC

Open a dataset, subset it, and register the result

Track the current notebook:

ln.track("eNef4Arw8nNM0000")
Hide code cell output
→ created Transform('eNef4Arw'), started new Run('6ZPUWM1L') at 2024-11-21 06:57:28 UTC
→ notebook imports: bionty==0.53.1 lamindb==0.76.16
artifact = ln.Artifact.get(description="anndata with obs")
artifact.describe()
Hide code cell output
Artifact(uid='3ur15azB5SSYkYcc0000', is_latest=True, description='anndata with obs', suffix='.h5ad', type='dataset', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', n_observations=40, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-21 06:57:27 UTC)
  Provenance
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-flow'
    .transform = 'register_example_file.py'
    .run = 2024-11-21 06:57:23 UTC
    .created_by = 'testuser1'
  Labels
    .tissues = 'heart', 'liver', 'kidney', 'brain'
    .cell_types = 'hematopoietic stem cell', 'hepatocyte', 'T cell', 'my new cell type'
    .diseases = 'cardiac ventricle disorder', 'liver lymphoma', 'chronic kidney disease', 'Alzheimer disease'
  Features
    'cell_type' = 'T cell', 'hematopoietic stem cell', 'hepatocyte', 'my new cell type'
    'cell_type_id' = 'hematopoietic stem cell', 'T cell', 'hepatocyte'
    'disease' = 'Alzheimer disease', 'cardiac ventricle disorder', 'chronic kidney disease', 'liver lymphoma'
    'tissue' = 'brain', 'heart', 'kidney', 'liver'
  Feature sets
    'var' = 'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52'
    'obs' = 'cell_type', 'cell_type_id', 'tissue', 'disease'

Get a backed AnnData object

adata = artifact.open()
adata
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 40 × 100
  constructed for the AnnData object 3ur15azB5SSYkYcc0000.h5ad
    obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
    var: ['_index']

Subset dataset to specific cell types and diseases

cell_types = artifact.cell_types.all().distinct().lookup(return_field="name")
diseases = artifact.diseases.all().distinct().lookup(return_field="name")

Create the subset:

subset_obs = adata.obs.cell_type.isin(
    [cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
Hide code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
  obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
  var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
Hide code cell output
cell_type                disease               
T cell                   chronic kidney disease    10
hematopoietic stem cell  liver lymphoma            10
Name: count, dtype: int64

Register the subsetted AnnData:

curate = ln.Curator.from_anndata(
    adata_subset.to_memory(),
    var_index=bt.Gene.ensembl_gene_id,
    categoricals={
        "cell_type": bt.CellType.name,
        "disease": bt.Disease.name,
        "tissue": bt.Tissue.name,
    },
    organism="human",
)
curate.validate()
Hide code cell output
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
  utils.warn_names_duplicates("var")
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'cell_type' is validated against CellType.name
✓ 'disease' is validated against Disease.name
✓ 'tissue' is validated against Tissue.name
True
artifact = curate.save_artifact(description="anndata with obs subset")
artifact.describe()
Hide code cell output
Artifact(uid='yKHoRaAHydl0VWe90000', is_latest=True, description='anndata with obs subset', suffix='.h5ad', type='dataset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_observations=20, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-21 06:57:28 UTC)
  Provenance
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-flow'
    .transform = 'Analysis flow'
    .run = 2024-11-21 06:57:28 UTC
    .created_by = 'testuser1'
  Labels
    .tissues = 'liver', 'kidney'
    .cell_types = 'hematopoietic stem cell', 'T cell'
    .diseases = 'liver lymphoma', 'chronic kidney disease'
  Features
    'cell_type' = 'T cell', 'hematopoietic stem cell'
    'disease' = 'chronic kidney disease', 'liver lymphoma'
    'tissue' = 'kidney', 'liver'
  Feature sets
    'var' = 'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52'
    'obs' = 'cell_type', 'cell_type_id', 'tissue', 'disease'

Examine data lineage

Query a subsetted .h5ad artifact containing “hematopoietic stem cell” and “T cell”:

cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
    suffix=".h5ad",
    description__endswith="subset",
    cell_types__in=[
        cell_types.hematopoietic_stem_cell,
        cell_types.t_cell,
    ],
).first()
my_subset
Hide code cell output
Artifact(uid='yKHoRaAHydl0VWe90000', is_latest=True, description='anndata with obs subset', suffix='.h5ad', type='dataset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_observations=20, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=2, run_id=2, created_by_id=1, created_at=2024-11-21 06:57:28 UTC)

Common questions that might arise are:

  • What is the history of this artifact?

  • Which features and labels are associated with it?

  • Which notebook analyzed and registered this artifact?

  • By whom?

  • And which artifact is its parent?

Let’s answer this using LaminDB:

print("--> What is the lineage of this artifact?\n")
artifact.view_lineage()

print("\n\n--> Which features and labels are associated with it?\n")
print(artifact.features)
print(artifact.labels)

print("\n\n--> Which notebook analyzed and saved this artifact\n")
print(artifact.transform)

print("\n\n--> Who save this artifact?\n")
print(artifact.created_by)

print("\n\n--> Which artifacts were inputs?\n")
display(artifact.run.input_artifacts.df())
--> What is the lineage of this artifact?
_images/e84698b7cbfd3d977e48777d45f55749e831c756bcdba70cd31b265bab02c51e.svg
--> Which features and labels are associated with it?
  Features
    'cell_type' = 'T cell', 'hematopoietic stem cell'
    'disease' = 'chronic kidney disease', 'liver lymphoma'
    'tissue' = 'kidney', 'liver'
  Feature sets
    'var' = 'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52'
    'obs' = 'cell_type', 'cell_type_id', 'tissue', 'disease'

  Labels
    .tissues = 'liver', 'kidney'
    .cell_types = 'hematopoietic stem cell', 'T cell'
    .diseases = 'liver lymphoma', 'chronic kidney disease'



--> Which notebook analyzed and saved this artifact

Transform(uid='eNef4Arw8nNM0000', is_latest=True, name='Analysis flow', key='analysis-flow.ipynb', type='notebook', created_by_id=1, created_at=2024-11-21 06:57:28 UTC)


--> Who save this artifact?

User(uid='DzTjkKse', handle='testuser1', name='Test User1', created_at=2024-11-21 06:57:19 UTC)


--> Which artifacts were inputs?
uid version is_latest description key suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_at created_by_id
id
1 3ur15azB5SSYkYcc0000 None True anndata with obs None .h5ad dataset 46992 IJORtcQUSS11QBqD-nTD0A None 40 md5 AnnData 1 True 1 1 1 2024-11-21 06:57:27.156126+00:00 1
Hide code cell content
!rm -r ./analysis-flow
!lamin delete --force analysis-flow
• deleting instance testuser1/analysis-flow