scrna4/6 Jupyter Notebook lamindata

Analyze a collection in memory

Here, we’ll analyze the growing collection by loading it into memory. This is only possible if it’s not too large. If your data is large, you’ll likely want to iterate over the collection to train a model, the topic of the next page (scrna5/6).

import lamindb as ln
import bionty as bt

ln.track("mfWKm8OtAzp80000")
Hide code cell output
 connected lamindb: testuser1/test-scrna
 created Transform('mfWKm8OtAzp80000'), started new Run('JKWTqjn7...') at 2025-01-20 07:36:23 UTC
 notebook imports: bionty==1.0.0 lamindb==1.0.2 scanpy==1.10.4
ln.Collection.df()
Hide code cell output
uid key description hash reference reference_type space_id meta_artifact_id version is_latest run_id created_at created_by_id _aux _branch_code
id
2 CefCs4oPJxCIkP7o0001 My versioned scRNA-seq collection None luH-jPb6eJLsXvc1TWGpUg None None 1 None 2 True 2 2025-01-20 07:36:13.724000+00:00 1 None 1
1 CefCs4oPJxCIkP7o0000 My versioned scRNA-seq collection None DuyXxlMxwF92YehyBLbhKg None None 1 None None False 1 2025-01-20 07:35:57.595000+00:00 1 None 1
collection = ln.Collection.get(name="My versioned scRNA-seq collection", version="2")
collection.artifacts.df()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
1 heySREilI158XLST0000 None Human immune cells from Conde22 .h5ad dataset AnnData 57612943 t_YJQpYrAyAGhs7Ir68zKj None 1648 sha1-fl True False 1 1 None None True 1 2025-01-20 07:35:53.126000+00:00 1 None 1
3 Xl9wc1xjhWAhOdVJ0001 None 10x reference adata, trusted cell type annotation .h5ad dataset AnnData 857336 GK721a-L-fGDI8kXefKMtA None 70 md5 True False 1 1 None None True 2 2025-01-20 07:36:13.519000+00:00 1 None 1

If the collection isn’t too large, we can now load it into memory.

Under-the-hood, the AnnData objects are concatenated during loading.

The amount of time this takes depends on a variety of factors.

If it occurs often, one might consider storing a concatenated version of the collection, rather than the individual pieces.

adata = collection.load()

The default is an outer join during concatenation as in pandas:

adata
Hide code cell output
AnnData object with n_obs × n_vars = 1718 × 36508
    obs: 'donor', 'tissue', 'cell_type', 'assay', 'cell_type_untrusted', 'n_genes', 'percent_mito', 'louvain', 'cell_type_untrusted_original', 'artifact_uid'
    obsm: 'X_umap', 'X_pca'

The AnnData has the reference to the individual artifacts in the .obs annotations:

adata.obs.artifact_uid.cat.categories
Hide code cell output
Index(['heySREilI158XLST0000', 'Xl9wc1xjhWAhOdVJ0001'], dtype='object')

We can easily obtain ensemble IDs for gene symbols using the look up object:

genes = bt.Gene.lookup(field="symbol")
genes.itm2b.ensembl_gene_id
Hide code cell output
'ENSG00000136156'

Let us create a plot:

import scanpy as sc

sc.pp.pca(adata, n_comps=2)
sc.pl.pca(
    adata,
    color=genes.itm2b.ensembl_gene_id,
    title=(
        f"{genes.itm2b.symbol} / {genes.itm2b.ensembl_gene_id} /"
        f" {genes.itm2b.description}"
    ),
    save="_itm2b",
)
WARNING: saving figure to file figures/pca_itm2b.pdf
_images/71311d5974128efadc82eec7e7e5d8fb7aac60b8bca2c906e933f92032526a0f.png

We could save a plot as a pdf and then see it in the flow diagram:

artifact = ln.Artifact("./figures/pca_itm2b.pdf", description="My result on ITM2B")
artifact.save()
artifact.view_lineage()
_images/3a45ec1975539c01d8400356e3c1da74217147f454a4e3d0cffb29aa2bcae8fe.svg

But given the image is part of the notebook, we can also rely on the report that we create when saving the notebook:

ln.finish()