scRNA-seq¶

Here, you’ll learn how to manage a growing number of scRNA-seq datasets as a single queryable collection:

create a dataset (an Artifact) and seed a Collection ()
append a new dataset to the collection ()
query & analyze individual datasets ()
load the collection into memory ()
iterate over the collection to train an ML model ()
concatenate the collection to a single tiledbsoma array store ()

If you’re only interested in using a large curated scRNA-seq collection, see the CELLxGENE guide.

# pip install 'lamindb[jupyter,bionty]'
!lamin init --storage ./test-scrna --modules bionty

import lamindb as ln
import bionty as bt

ln.track("Nv48yAceNSh8")

Populate metadata registries based on an artifact¶

Let us look at the standardized data of Conde et al., Science (2022), available from CELLxGENE. anndata_human_immune_cells() loads a subsampled version:

adata = ln.core.datasets.anndata_human_immune_cells()
adata

To validate & annotate a dataset, we need to define valid features and a schema.

# define valid features
ln.Feature(name="donor", dtype=str).save()
ln.Feature(name="tissue", dtype=bt.Tissue).save()
ln.Feature(name="cell_type", dtype=bt.CellType).save()
ln.Feature(name="assay", dtype=bt.ExperimentalFactor).save()

# define a schema (or get via ln.examples.anndata.anndata_ensembl_gene_ids_and_valid_features_in_obs())
obs_schema = ln.Schema(
    itype=ln.Feature
).save()  # validate obs columns against the Feature registry
varT_schema = ln.Schema(
    itype=bt.Gene.ensembl_gene_id
).save()  # validate var.T columns against the Gene registry
schema = ln.Schema(
    name="anndata_ensembl_gene_ids_and_valid_features_in_obs",
    otype="AnnData",
    slots={"obs": obs_schema, "var.T": varT_schema},
).save()

Let’s attempt saving this dataset as a validated & annotated artifact.

try:
    artifact = ln.Artifact.from_anndata(adata, schema=schema).save()
except ln.errors.ValidationError as error:
    print(error)

One cell type isn’t validated because it’s not part of the CellType registry. Let’s create it.

bt.CellType(name="animal cell").save()

We can now save the dataset.

# runs ~10sec because it imports 40k Ensembl gene IDs from a public ontology
artifact = ln.Artifact.from_anndata(
    adata, key="datasets/conde22.h5ad", schema=schema
).save()

Some Ensembl gene IDs don’t validate because they stem from an older version of Ensembl. If we wanted to be 100% sure that all gene identifiers are valid Ensembl IDs you can import the genes from an old Ensembl version into the Gene registry (see guide). One can also enforce this through the .var.T schema by setting schema.maximal_set=True, which will prohibit any non-valid features in the dataframe.

artifact.describe()

Seed a collection¶

Let’s create a first version of a collection that will encompass many h5ad files when more data is ingested.

Note

To see the result of the incremental growth, take a look at the CELLxGENE Census guide for an instance with ~1k h5ads and ~50 million cells.

collection = ln.Collection(artifact, key="scrna/collection1").save()

For this version 1 of the collection, collection and artifact match each other. But they’re independently tracked and queryable through their registries:

collection.describe()

Access the underlying artifacts like so:

collection.artifacts.df()

Show code cell output Hide code cell output

	uid	key	description	suffix	kind	otype	size	hash	n_files	n_observations	_hash_type	_key_is_virtual	_overwrite_versions	space_id	storage_id	schema_id	version	is_latest	run_id	created_at	created_by_id	_aux	branch_id
id
1	vu2o1KhF4zmuvVZ80000	datasets/conde22.h5ad	None	.h5ad	dataset	AnnData	57612943	t_YJQpYrAyAGhs7Ir68zKj	None	1648	sha1-fl	True	False	1	1	3	None	True	1	2025-07-29 19:23:50.393000+00:00	1	{'af': {'0': True}}	1

See data lineage:

collection.view_lineage()

_images/2aaf20f5f96464c8f8de39a78e181a3169669ab2064fbd8b43526fd18cc57511.svg

Finish the run and save the notebook.

ln.finish()

→ finished Run('yDpa1QlC') after 28s at 2025-07-29 19:23:52 UTC