##### scRNA-seq [image: .md][image]

Here, you'll learn how to manage a growing number of scRNA-seq
datasets as a single queryable collection:

1. create a dataset (an "Artifact") and seed a "Collection" ([image:
 scrna1/6][image])

2. append a new dataset to the collection ([image: scrna2/6][image])

3. query & analyze individual datasets ([image: scrna3/6][image])

4. load the collection into memory ([image: scrna4/6][image])

5. iterate over the collection to train an ML model ([image:
 scrna5/6][image])

6. concatenate the collection to a single "tiledbsoma" array store
 ([image: scrna6/6][image])

If you're only interested in *using* a large curated scRNA-seq
collection, see the CELLxGENE guide.

 # pip install lamindb
 !lamin init --storage ./test-scrna --modules bionty

 import lamindb as ln
 import bionty as bt

 ln.track()

#### Populate metadata registries based on an artifact

Let us look at the standardized data of Conde *et al.*, Science
(2022), available from CELLxGENE. "anndata_human_immune_cells()" loads
a subsampled version:

 adata = ln.core.datasets.anndata_human_immune_cells()
 adata

To validate & annotate a dataset, we need to define valid features.

 ln.Feature(name="donor", dtype=str).save()
 ln.Feature(name="tissue", dtype=bt.Tissue).save()
 ln.Feature(name="cell_type", dtype=bt.CellType).save()
 ln.Feature(name="assay", dtype=bt.ExperimentalFactor).save()

Let's attempt saving this dataset as a validated & annotated artifact.

 try:
 artifact = ln.Artifact.from_anndata(
 adata, schema="ensembl_gene_ids_and_valid_features_in_obs"
 ).save()
 except ln.errors.ValidationError:
 pass

One cell type isn't validated because it's not part of the "CellType"
registry. Let's create it.

 adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])
 bt.CellType(name="animal cell").save()

We can now save the dataset.

 # runs ~10sec because it imports 40k Ensembl gene IDs from a public ontology
 artifact = ln.Artifact.from_anndata(
 adata,
 key="datasets/conde22.h5ad",
 schema="ensembl_gene_ids_and_valid_features_in_obs",
 ).save()

Some Ensembl gene IDs don't validate because they stem from an older
version of Ensembl. If we wanted to be 100% sure that all gene
identifiers are valid Ensembl IDs you can import the genes from an old
Ensembl version into the "Gene" registry (see ). One can also enforce
this through the ".var.T" schema by setting "schema.maximal_set=True",
which will prohibit any non-valid features in the dataframe.

 artifact.describe()

#### Seed a collection

Let's create a first version of a collection that will encompass many
"h5ad" files when more data is ingested.

Note:

  To see the result of the incremental growth, take a look at the
  CELLxGENE Census guide for an instance with ~1k h5ads and ~50
  million cells.

 collection = ln.Collection(artifact, key="scrna/collection1").save()

For this version 1 of the collection, collection and artifact match
each other. But they're independently tracked and queryable through
their registries:

 collection.describe()

Access the underlying artifacts like so:

 collection.artifacts.to_dataframe()

See data lineage:

 collection.view_lineage()

Finish the run and save the notebook.

 ln.finish()