Analyze a collection in memory¶

Here, we’ll analyze the growing collection by loading it into memory. This is only possible if it’s not too large. If your data is large, you’ll likely want to iterate over the collection to train a model, the topic of the next page ().

import lamindb as ln
import bionty as bt

ln.track("mfWKm8OtAzp8")

ln.Collection.to_dataframe()

Show code cell output Hide code cell output

	uid	key	description	hash	reference	reference_type	version	is_latest	is_locked	created_at	branch_id	space_id	created_by_id	run_id	meta_artifact_id
id
2	vhcAXUbJAaL6cU3h0001	scrna/collection1	None	luH-jPb6eJLsXvc1TWGpUg	None	None	2	True	False	2025-11-14 00:12:16.013000+00:00	1	1	1	2	None
1	vhcAXUbJAaL6cU3h0000	scrna/collection1	None	DuyXxlMxwF92YehyBLbhKg	None	None	None	False	False	2025-11-14 00:11:55.486000+00:00	1	1	1	1	None

collection = ln.Collection.get(key="scrna/collection1", version="2")

collection.artifacts.to_dataframe()

Show code cell output Hide code cell output

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	version	is_latest	is_locked	created_at	branch_id	space_id	storage_id	run_id	schema_id	created_by_id
id
5	kB3t4oXRQfdGWSeX0001	None	10x reference adata, trusted cell type annotation	.h5ad	dataset	AnnData	857336	GK721a-L-fGDI8kXefKMtA	None	70	None	True	False	2025-11-14 00:12:15.245000+00:00	1	1	1	2	NaN	1
1	Vlll9t2AZFhUwcyV0000	datasets/conde22.h5ad	None	.h5ad	dataset	AnnData	57612943	t_YJQpYrAyAGhs7Ir68zKj	None	1648	None	True	False	2025-11-14 00:11:55.263000+00:00	1	1	1	1	3.0	1

If the collection isn’t too large, we can now load it into memory.

Under-the-hood, the AnnData objects are concatenated during loading.

The amount of time this takes depends on a variety of factors.

If it occurs often, one might consider storing a concatenated version of the collection, rather than the individual pieces.

adata = collection.load()

The default is an outer join during concatenation as in pandas:

adata

The AnnData has the reference to the individual artifacts in the .obs annotations:

adata.obs.artifact_uid.cat.categories

We can easily obtain ensemble IDs for gene symbols using the look up object:

genes = bt.Gene.lookup(field="symbol")

genes.itm2b.ensembl_gene_id

Let us create a plot:

import scanpy as sc

sc.pp.pca(adata, n_comps=2)

sc.pl.pca(
    adata,
    color=genes.itm2b.ensembl_gene_id,
    title=(
        f"{genes.itm2b.symbol} / {genes.itm2b.ensembl_gene_id} /"
        f" {genes.itm2b.description}"
    ),
    save="_itm2b",
)

WARNING: saving figure to file figures/pca_itm2b.pdf

_images/b489b461bad5331975e566938fbc2319e8a59fb0b26da339ae4fdb1354dc35cd.png

We could save a plot as a pdf and then see it in the flow diagram:

artifact = ln.Artifact(
    "./figures/pca_itm2b.pdf", description="My result on ITM2B"
).save()
artifact.view_lineage()

_images/c49b45cd87d4499b065ccdd3da094faac151b617e630a292986abcde6384a8a3.svg

But given the image is part of the notebook, we can also rely on the report that we create when saving the notebook:

ln.finish()