Analysis flow¶
Here, we’ll track typical data transformations like subsetting that occur during analysis.
# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./analysis-flow --schema bionty
Show code cell output
→ connected lamindb: testuser1/analysis-flow
import lamindb as ln
import bionty as bt
→ connected lamindb: testuser1/analysis-flow
Save an initial dataset¶
import lamindb as ln
import bionty as bt
ln.track("K4wsS5DTYdFp0000")
# an example dataset that has a few cell type, tissue and disease annotations
adata = ln.core.datasets.anndata_with_obs()
# validate and register features
curate = ln.Curator.from_anndata(
adata,
var_index=bt.Gene.ensembl_gene_id,
categoricals={
"cell_type": bt.CellType.name,
"cell_type_id": bt.CellType.ontology_id,
"tissue": bt.Tissue.name,
"disease": bt.Disease.name,
},
organism="human",
)
curate.add_new_from("cell_type")
curate.validate()
curate.save_artifact(description="anndata with obs")
ln.finish()
!python analysis-flow-scripts/register_example_file.py
Show code cell output
→ connected lamindb: testuser1/analysis-flow
→ created Transform('K4wsS5DT'), started new Run('6OGILuiH') at 2024-11-21 06:57:23 UTC
✓ added 4 records with Feature.name for columns: 'cell_type', 'cell_type_id', 'tissue', 'disease'
• saving validated records of 'cell_type'
✓ added 3 records from public with CellType.name for cell_type: 'hematopoietic stem cell', 'hepatocyte', 'T cell'
✓ added 1 record with CellType.name for cell_type: 'my new cell type'
• saving validated records of 'var_index'
• saving validated records of 'tissue'
• saving validated records of 'disease'
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'cell_type' is validated against CellType.name
✓ 'cell_type_id' is validated against CellType.ontology_id
✓ 'tissue' is validated against Tissue.name
✓ 'disease' is validated against Disease.name
→ finished Run('6OGILuiH') after 0d 0h 0m 4s at 2024-11-21 06:57:27 UTC
Open a dataset, subset it, and register the result¶
Track the current notebook:
ln.track("eNef4Arw8nNM0000")
Show code cell output
→ created Transform('eNef4Arw'), started new Run('6ZPUWM1L') at 2024-11-21 06:57:28 UTC
→ notebook imports: bionty==0.53.1 lamindb==0.76.16
artifact = ln.Artifact.get(description="anndata with obs")
artifact.describe()
Show code cell output
Artifact(uid='3ur15azB5SSYkYcc0000', is_latest=True, description='anndata with obs', suffix='.h5ad', type='dataset', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', n_observations=40, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-21 06:57:27 UTC)
Provenance
.storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-flow'
.transform = 'register_example_file.py'
.run = 2024-11-21 06:57:23 UTC
.created_by = 'testuser1'
Labels
.tissues = 'heart', 'liver', 'kidney', 'brain'
.cell_types = 'hematopoietic stem cell', 'hepatocyte', 'T cell', 'my new cell type'
.diseases = 'cardiac ventricle disorder', 'liver lymphoma', 'chronic kidney disease', 'Alzheimer disease'
Features
'cell_type' = 'T cell', 'hematopoietic stem cell', 'hepatocyte', 'my new cell type'
'cell_type_id' = 'hematopoietic stem cell', 'T cell', 'hepatocyte'
'disease' = 'Alzheimer disease', 'cardiac ventricle disorder', 'chronic kidney disease', 'liver lymphoma'
'tissue' = 'brain', 'heart', 'kidney', 'liver'
Feature sets
'var' = 'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52'
'obs' = 'cell_type', 'cell_type_id', 'tissue', 'disease'
Get a backed AnnData object¶
adata = artifact.open()
adata
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 40 × 100
constructed for the AnnData object 3ur15azB5SSYkYcc0000.h5ad
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
Subset dataset to specific cell types and diseases¶
cell_types = artifact.cell_types.all().distinct().lookup(return_field="name")
diseases = artifact.diseases.all().distinct().lookup(return_field="name")
Create the subset:
subset_obs = adata.obs.cell_type.isin(
[cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
Show code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
Show code cell output
cell_type disease
T cell chronic kidney disease 10
hematopoietic stem cell liver lymphoma 10
Name: count, dtype: int64
Register the subsetted AnnData:
curate = ln.Curator.from_anndata(
adata_subset.to_memory(),
var_index=bt.Gene.ensembl_gene_id,
categoricals={
"cell_type": bt.CellType.name,
"disease": bt.Disease.name,
"tissue": bt.Tissue.name,
},
organism="human",
)
curate.validate()
Show code cell output
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
✓ 'var_index' is validated against Gene.ensembl_gene_id
✓ 'cell_type' is validated against CellType.name
✓ 'disease' is validated against Disease.name
✓ 'tissue' is validated against Tissue.name
True
artifact = curate.save_artifact(description="anndata with obs subset")
artifact.describe()
Show code cell output
Artifact(uid='yKHoRaAHydl0VWe90000', is_latest=True, description='anndata with obs subset', suffix='.h5ad', type='dataset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_observations=20, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-11-21 06:57:28 UTC)
Provenance
.storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-flow'
.transform = 'Analysis flow'
.run = 2024-11-21 06:57:28 UTC
.created_by = 'testuser1'
Labels
.tissues = 'liver', 'kidney'
.cell_types = 'hematopoietic stem cell', 'T cell'
.diseases = 'liver lymphoma', 'chronic kidney disease'
Features
'cell_type' = 'T cell', 'hematopoietic stem cell'
'disease' = 'chronic kidney disease', 'liver lymphoma'
'tissue' = 'kidney', 'liver'
Feature sets
'var' = 'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52'
'obs' = 'cell_type', 'cell_type_id', 'tissue', 'disease'
Examine data lineage¶
Query a subsetted .h5ad
artifact containing “hematopoietic stem cell” and “T cell”:
cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
suffix=".h5ad",
description__endswith="subset",
cell_types__in=[
cell_types.hematopoietic_stem_cell,
cell_types.t_cell,
],
).first()
my_subset
Show code cell output
Artifact(uid='yKHoRaAHydl0VWe90000', is_latest=True, description='anndata with obs subset', suffix='.h5ad', type='dataset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_observations=20, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=2, run_id=2, created_by_id=1, created_at=2024-11-21 06:57:28 UTC)
Common questions that might arise are:
What is the history of this artifact?
Which features and labels are associated with it?
Which notebook analyzed and registered this artifact?
By whom?
And which artifact is its parent?
Let’s answer this using LaminDB:
print("--> What is the lineage of this artifact?\n")
artifact.view_lineage()
print("\n\n--> Which features and labels are associated with it?\n")
print(artifact.features)
print(artifact.labels)
print("\n\n--> Which notebook analyzed and saved this artifact\n")
print(artifact.transform)
print("\n\n--> Who save this artifact?\n")
print(artifact.created_by)
print("\n\n--> Which artifacts were inputs?\n")
display(artifact.run.input_artifacts.df())
--> What is the lineage of this artifact?
--> Which features and labels are associated with it?
Features
'cell_type' = 'T cell', 'hematopoietic stem cell'
'disease' = 'chronic kidney disease', 'liver lymphoma'
'tissue' = 'kidney', 'liver'
Feature sets
'var' = 'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52'
'obs' = 'cell_type', 'cell_type_id', 'tissue', 'disease'
Labels
.tissues = 'liver', 'kidney'
.cell_types = 'hematopoietic stem cell', 'T cell'
.diseases = 'liver lymphoma', 'chronic kidney disease'
--> Which notebook analyzed and saved this artifact
Transform(uid='eNef4Arw8nNM0000', is_latest=True, name='Analysis flow', key='analysis-flow.ipynb', type='notebook', created_by_id=1, created_at=2024-11-21 06:57:28 UTC)
--> Who save this artifact?
User(uid='DzTjkKse', handle='testuser1', name='Test User1', created_at=2024-11-21 06:57:19 UTC)
--> Which artifacts were inputs?
uid | version | is_latest | description | key | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | 3ur15azB5SSYkYcc0000 | None | True | anndata with obs | None | .h5ad | dataset | 46992 | IJORtcQUSS11QBqD-nTD0A | None | 40 | md5 | AnnData | 1 | True | 1 | 1 | 1 | 2024-11-21 06:57:27.156126+00:00 | 1 |
Show code cell content
!rm -r ./analysis-flow
!lamin delete --force analysis-flow
• deleting instance testuser1/analysis-flow