Analysis flow¶
Here, we’ll track typical data transformations like subsetting that occur during analysis.
If exploring more generally, read this first: Project flow.
# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./analysis-usecase --schema bionty
Show code cell output
→ connected lamindb: testuser1/analysis-usecase
import lamindb as ln
import bionty as bt
from lamin_utils import logger
→ connected lamindb: testuser1/analysis-usecase
Register an initial dataset¶
Here we register an initial artifact with a pipeline script register_example_file.py.
!python analysis-flow-scripts/register_example_file.py
Show code cell output
→ connected lamindb: testuser1/analysis-usecase
→ created Transform('K4wsS5DT'), started new Run('PIBd2SVj') at 2024-10-11 09:36:28 UTC
✓ added 4 records with Feature.name for columns: 'cell_type', 'cell_type_id', 'tissue', 'disease'
• saving labels for 'cell_type'
✓ added 3 records from public with CellType.name for cell_type: 'T cell', 'hematopoietic stem cell', 'hepatocyte'
! 1 non-validated values are not saved in CellType.name: ['my new cell type']!
→ to lookup values, use lookup().cell_type
→ to save, run .add_new_from('cell_type')
• saving labels for 'cell_type_id'
• saving labels for 'tissue'
• saving labels for 'disease'
✓ added 1 record with CellType.name for cell_type: 'my new cell type'
✓ var_index is validated against Gene.ensembl_gene_id
✓ cell_type is validated against CellType.name
✓ cell_type_id is validated against CellType.ontology_id
✓ tissue is validated against Tissue.name
✓ disease is validated against Disease.name
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/63KLTuN2O0BwkXMQ0000.h5ad')
✓ storing artifact '63KLTuN2O0BwkXMQ0000' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/63KLTuN2O0BwkXMQ0000.h5ad'
• parsing feature names of X stored in slot 'var'
✓ 99 unique terms (100.00%) are validated for ensembl_gene_id
✓ linked: FeatureSet(uid='jxU3aGYtd64YKcwe4PwK', n=99, dtype='float', registry='bionty.Gene', hash='-frOq7J0bik-J7Ad9DX7HA', created_by_id=1, run_id=1)
• parsing feature names of slot 'obs'
✓ 4 unique terms (100.00%) are validated for name
✓ linked: FeatureSet(uid='g4QSHc2bFlqKIAQV9rSG', n=4, registry='Feature', hash='L5lCL-O_lGJwzlnxvYy4Ag', created_by_id=1, run_id=1)
✓ saved 2 feature sets for slots: 'var','obs'
→ finished Run('PIBd2SVj') after 0:00:08.624079 at 2024-10-11 09:36:37 UTC
Pull the registered dataset, apply a transformation, and register the result¶
Track the current notebook:
ln.track("eNef4Arw8nNM0000")
Show code cell output
→ notebook imports: bionty==0.51.2 lamin_utils==0.13.6 lamindb==0.76.12
→ created Transform('eNef4Arw'), started new Run('XeZaLG4u') at 2024-10-11 09:36:38 UTC
artifact = ln.Artifact.get(description="anndata with obs")
artifact.describe()
Show code cell output
Artifact(uid='63KLTuN2O0BwkXMQ0000', is_latest=True, description='anndata with obs', suffix='.h5ad', type='dataset', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', n_observations=40, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-10-11 09:36:37 UTC)
Provenance
.storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase'
.transform = 'register_example_file.py'
.run = 2024-10-11 09:36:28 UTC
.created_by = 'testuser1'
Labels
.tissues = 'kidney', 'liver', 'heart', 'brain'
.cell_types = 'my new cell type', 'T cell', 'hematopoietic stem cell', 'hepatocyte'
.diseases = 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
Features
'cell_type' = 'my new cell type'
'cell_type_id' = 'T cell', 'hematopoietic stem cell', 'hepatocyte'
'disease' = 'chronic kidney disease', 'liver lymphoma', 'cardiac ventricle disorder', 'Alzheimer disease'
'tissue' = 'kidney', 'liver', 'heart', 'brain'
Feature sets
'var' = 'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52'
'obs' = 'cell_type', 'cell_type_id', 'tissue', 'disease'
Get a backed AnnData object¶
adata = artifact.open()
adata
Show code cell output
AnnDataAccessor object with n_obs × n_vars = 40 × 100
constructed for the AnnData object 63KLTuN2O0BwkXMQ0000.h5ad
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
Subset dataset to specific cell types and diseases¶
cell_types = artifact.cell_types.all().lookup(return_field="name")
diseases = artifact.diseases.all().lookup(return_field="name")
Create the subset:
subset_obs = adata.obs.cell_type.isin(
[cell_types.t_cell, cell_types.hematopoietic_stem_cell]
) & (adata.obs.disease.isin([diseases.liver_lymphoma, diseases.chronic_kidney_disease]))
adata_subset = adata[subset_obs]
adata_subset
Show code cell output
AnnDataAccessorSubset object with n_obs × n_vars = 20 × 100
obs: ['_index', 'cell_type', 'cell_type_id', 'disease', 'tissue']
var: ['_index']
adata_subset.obs[["cell_type", "disease"]].value_counts()
Show code cell output
cell_type disease
T cell chronic kidney disease 10
hematopoietic stem cell liver lymphoma 10
Name: count, dtype: int64
Register the subsetted AnnData:
curate = ln.Curator.from_anndata(
adata_subset.to_memory(),
var_index=bt.Gene.ensembl_gene_id,
categoricals={
"cell_type": bt.CellType.name,
"disease": bt.Disease.name,
"tissue": bt.Tissue.name,
},
organism="human",
)
curate.validate()
Show code cell output
✓ var_index is validated against Gene.ensembl_gene_id
✓ cell_type is validated against CellType.name
✓ disease is validated against Disease.name
✓ tissue is validated against Tissue.name
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/anndata/_core/anndata.py:1756: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
True
artifact = curate.save_artifact(description="anndata with obs subset")
Show code cell output
• path content will be copied to default storage upon `save()` with key `None` ('.lamindb/F7ZHPRtoVWYCUU110000.h5ad')
✓ storing artifact 'F7ZHPRtoVWYCUU110000' at '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase/.lamindb/F7ZHPRtoVWYCUU110000.h5ad'
• parsing feature names of X stored in slot 'var'
✓ 99 unique terms (100.00%) are validated for ensembl_gene_id
✓ linked: FeatureSet(uid='jxU3aGYtd64YKcwe4PwK', n=99, dtype='float', registry='bionty.Gene', hash='-frOq7J0bik-J7Ad9DX7HA', created_by_id=1, run_id=1, created_at=2024-10-11 09:36:37 UTC)
• parsing feature names of slot 'obs'
✓ 4 unique terms (100.00%) are validated for name
✓ linked: FeatureSet(uid='g4QSHc2bFlqKIAQV9rSG', n=4, registry='Feature', hash='L5lCL-O_lGJwzlnxvYy4Ag', created_by_id=1, run_id=1, created_at=2024-10-11 09:36:37 UTC)
artifact.describe()
Show code cell output
Artifact(uid='F7ZHPRtoVWYCUU110000', is_latest=True, description='anndata with obs subset', suffix='.h5ad', type='dataset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_observations=20, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, created_at=2024-10-11 09:36:39 UTC)
Provenance
.storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/analysis-usecase'
.transform = 'Analysis flow'
.run = 2024-10-11 09:36:38 UTC
.created_by = 'testuser1'
Labels
.tissues = 'kidney', 'liver'
.cell_types = 'T cell', 'hematopoietic stem cell'
.diseases = 'chronic kidney disease', 'liver lymphoma'
Features
'cell_type' = 'T cell', 'hematopoietic stem cell'
'disease' = 'chronic kidney disease', 'liver lymphoma'
'tissue' = 'kidney', 'liver'
Feature sets
'var' = 'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52'
'obs' = 'cell_type', 'cell_type_id', 'tissue', 'disease'
Examine data flow¶
Query a subsetted .h5ad
artifact containing “hematopoietic stem cell” and “T cell”:
cell_types = bt.CellType.lookup()
my_subset = ln.Artifact.filter(
suffix=".h5ad",
description__endswith="subset",
cell_types__in=[
cell_types.hematopoietic_stem_cell,
cell_types.t_cell,
],
).first()
my_subset
Show code cell output
Artifact(uid='F7ZHPRtoVWYCUU110000', is_latest=True, description='anndata with obs subset', suffix='.h5ad', type='dataset', size=38992, hash='RgGUx7ndRplZZSmalTAWiw', n_observations=20, _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=2, run_id=2, created_by_id=1, created_at=2024-10-11 09:36:39 UTC)
Common questions that might arise are:
What is the history of this artifact?
Which features and labels are associated with it?
Which notebook analyzed and registered this artifact?
By whom?
And which artifact is its parent?
Let’s answer this using LaminDB:
print("--> What is the history of this artifact?\n")
artifact.view_lineage()
print("\n\n--> Which features and labels are associated with it?\n")
logger.print(artifact.features)
logger.print(artifact.labels)
print("\n\n--> Which notebook analyzed and registered this artifact\n")
logger.print(artifact.transform)
print("\n\n--> By whom\n")
logger.print(artifact.created_by)
print("\n\n--> And which artifact is its parent\n")
display(artifact.run.input_artifacts.df())
--> What is the history of this artifact?
--> Which features and labels are associated with it?
Features
'cell_type' = 'T cell', 'hematopoietic stem cell'
'disease' = 'chronic kidney disease', 'liver lymphoma'
'tissue' = 'kidney', 'liver'
Feature sets
'var' = 'TSPAN6', 'TNMD', 'DPM1', 'SCYL3', 'FIRRM', 'FGR', 'CFH', 'FUCA2', 'GCLC', 'NFYA', 'STPG1', 'NIPAL3', 'LAS1L', 'ENPP4', 'SEMA3F', 'CFTR', 'ANKIB1', 'CYP51A1', 'KRIT1', 'RAD52'
'obs' = 'cell_type', 'cell_type_id', 'tissue', 'disease'
Labels
.tissues = 'kidney', 'liver'
.cell_types = 'T cell', 'hematopoietic stem cell'
.diseases = 'chronic kidney disease', 'liver lymphoma'
--> Which notebook analyzed and registered this artifact
Transform(uid='eNef4Arw8nNM0000', is_latest=True, name='Analysis flow', key='analysis-flow.ipynb', type='notebook', created_by_id=1, created_at=2024-10-11 09:36:38 UTC)
--> By whom
User(uid='DzTjkKse', handle='testuser1', name='Test User1', created_at=2024-10-11 09:36:25 UTC)
--> And which artifact is its parent
uid | version | is_latest | description | key | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
1 | 63KLTuN2O0BwkXMQ0000 | None | True | anndata with obs | None | .h5ad | dataset | 46992 | IJORtcQUSS11QBqD-nTD0A | None | 40 | md5 | AnnData | 1 | True | 1 | 1 | 1 | 2024-10-11 09:36:37.409335+00:00 | 1 |
Show code cell content
!rm -r ./analysis-usecase
!lamin delete --force analysis-usecase
• deleting instance testuser1/analysis-usecase