Concatenate datasets to a single array store¶
In the previous notebooks, we’ve seen how to incrementally create a collection of scRNA-seq datasets and train models on it.
Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata.
This is also what CELLxGENE does to create Census: a number of .h5ad
files are concatenated to give rise to a single tiledbsoma
array store (CELLxGENE: scRNA-seq).
import lamindb as ln
import pandas as pd
import scanpy as sc
import tiledbsoma.io
from functools import reduce
ln.track("oJN8WmVrxI8m0000")
Show code cell output
→ connected lamindb: testuser1/test-scrna
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing CSCDataset from `anndata.experimental` is deprecated. Import anndata.abc.CSCDataset instead.
warnings.warn(msg, FutureWarning)
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/anndata/utils.py:429: FutureWarning: Importing CSRDataset from `anndata.experimental` is deprecated. Import anndata.abc.CSRDataset instead.
warnings.warn(msg, FutureWarning)
→ created Transform('oJN8WmVr'), started new Run('yO7idoWZ') at 2024-11-21 06:54:39 UTC
→ notebook imports: lamindb==0.76.16 pandas==2.2.3 scanpy==1.10.4 tiledbsoma==1.15.0rc3
Query the collection of h5ad
files that we’d like to concatenate into a single array.
collection = ln.Collection.get(name="My versioned scRNA-seq collection", version="2")
collection.describe()
Show code cell output
Collection(uid='6HBgmdrV4xz24AVj0001', version='2', is_latest=True, name='My versioned scRNA-seq collection', hash='PPKf6IQ4SPZoY0yMQQfI6w', visibility=1, created_at=2024-11-21 06:54:05 UTC)
Provenance
.created_by = 'testuser1'
.transform = 'Standardize and append a dataset'
.run = 2024-11-21 06:53:45 UTC
Usage
.input_of_runs = 2024-11-21 06:54:15 UTC, 2024-11-21 06:54:32 UTC
Prepare the AnnData objects¶
To concatenate the AnnData
objects into a single tiledbsoma.Experiment
, they need to have the same .var
and .obs
columns.
# load a number of AnnData objects that's small enough to fit into memory
adatas = [artifact.load() for artifact in collection.ordered_artifacts]
# compute the intersection of columns for these objects
var_columns = reduce(
pd.Index.intersection, [adata.var.columns for adata in adatas]
) # this only affects metadata columns of features (say, gene annotations)
var_raw_columns = reduce(
pd.Index.intersection, [adata.raw.var.columns for adata in adatas]
)
obs_columns = reduce(
pd.Index.intersection, [adata.obs.columns for adata in adatas]
) # this actually subsets features (dataset dimensions)
Prepare the AnnData
objects for concatenation. Prepare id fields, sanitize index
names, intersect columns, drop .obsp
, .uns
and columns that aren’t part of the intersection.
for i, adata in enumerate(adatas):
del adata.obsp # not supported by tiledbsoma
del adata.uns # not supported by tiledbsoma
adata.obs = adata.obs.filter(obs_columns) # filter columns to intersection
adata.obs["obs_id"] = (
adata.obs.index
) # prepare a column for tiledbsoma to use as an index
adata.obs["dataset"] = i
adata.obs.index.name = None
adata.var = adata.var.filter(var_columns) # filter columns to intersection
adata.var["var_id"] = adata.var.index
adata.var.index.name = None
drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
adata.raw.var["var_id"] = adata.raw.var.index
adata.raw.var.index.name = None
Create the array store¶
Save the AnnData
objects in one array store referenced by an Artifact
.
soma_artifact = ln.integrations.save_tiledbsoma_experiment(
adatas,
description="tiledbsoma experiment",
measurement_name="RNA",
obs_id_name="obs_id",
var_id_name="var_id",
append_obsm_varm=True,
)
Note
Provenance is tracked by writing the current run.uid
to tiledbsoma.Experiment.obs
as lamin_run_uid
.
If you know tiledbsoma
API, then note that save_tiledbsoma_experiment()
abstracts over both tiledbsoma.io.register_anndatas
and tiledbsoma.io.from_anndata
.
Query the array store¶
Here we query the obs
from the array store.
with soma_artifact.open() as soma_store:
obs = soma_store["obs"]
var = soma_store["ms"]["RNA"]["var"]
obs_columns_store = obs.schema.names
var_columns_store = var.schema.names
obs_store_df = obs.read().concat().to_pandas()
display(obs_store_df)
Show code cell output
soma_joinid | cell_type | obs_id | dataset | lamin_run_uid | |
---|---|---|---|---|---|
0 | 0 | classical monocyte | CZINY-0109_CTGGTCTAGTCTGTAC | 0 | yO7idoWZDECgHGxgeXhV |
1 | 1 | T follicular helper cell | CZI-IA10244332+CZI-IA10244434_CCTTCGACATACTCTT | 0 | yO7idoWZDECgHGxgeXhV |
2 | 2 | memory B cell | Pan_T7935491_CTGGTCTGTACATGTC | 0 | yO7idoWZDECgHGxgeXhV |
3 | 3 | alveolar macrophage | Pan_T7980367_GGGCATCCAGGTGGAT | 0 | yO7idoWZDECgHGxgeXhV |
4 | 4 | naive thymus-derived CD4-positive, alpha-beta ... | Pan_T7935494_ATCATGGTCTACCTGC | 0 | yO7idoWZDECgHGxgeXhV |
... | ... | ... | ... | ... | ... |
1713 | 1713 | CD8-positive, CD25-positive, alpha-beta regula... | CAGACAACAAAACG-7 | 1 | yO7idoWZDECgHGxgeXhV |
1714 | 1714 | effector memory CD4-positive, alpha-beta T cel... | ACAGTGTGTACTGG-3 | 1 | yO7idoWZDECgHGxgeXhV |
1715 | 1715 | CD8-positive, CD25-positive, alpha-beta regula... | GAGGGTGACCTATT-1 | 1 | yO7idoWZDECgHGxgeXhV |
1716 | 1716 | CD16-positive, CD56-dim natural killer cell, h... | AGTAATTGGCTTAG-3 | 1 | yO7idoWZDECgHGxgeXhV |
1717 | 1717 | CD38-high pre-BCR positive cell | AATGAGGATGGTTG-4 | 1 | yO7idoWZDECgHGxgeXhV |
1718 rows × 5 columns
Append to the array store¶
Prepare a new AnnData
object to be appended to the store.
ln.core.datasets.anndata_with_obs().write_h5ad("adata_to_append.h5ad")
!lamin save adata_to_append.h5ad --description "adata to append"
Show code cell output
→ connected lamindb: testuser1/test-scrna
→ saved: Artifact(uid='xwzXDVx0UJhrno0U0000', is_latest=True, description='adata to append', suffix='.h5ad', size=46992, hash='IJORtcQUSS11QBqD-nTD0A', _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, created_by_id=1, created_at=2024-11-21 06:54:47 UTC)
→ storage path: /home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna/.lamindb/xwzXDVx0UJhrno0U0000.h5ad
adata = ln.Artifact.filter(description="adata to append").one().load()
adata.obs_names_make_unique()
adata.var_names_make_unique()
adata.obs["obs_id"] = adata.obs.index
adata.var["var_id"] = adata.var.index
adata.obs["dataset"] = obs_store_df["dataset"].max()
obs_columns_same = [
obs_col for obs_col in adata.obs.columns if obs_col in obs_columns_store
]
adata.obs = adata.obs[obs_columns_same]
var_columns_same = [
var_col for var_col in adata.var.columns if var_col in var_columns_store
]
adata.var = adata.var[var_columns_same]
adata.write_h5ad("adata_to_append.h5ad")
Show code cell output
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/anndata/_core/anndata.py:1758: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
utils.warn_names_duplicates("var")
Append the AnnData
object from disk by revising soma_artifact
.
soma_artifact = ln.integrations.save_tiledbsoma_experiment(
["adata_to_append.h5ad"],
revises=soma_artifact,
measurement_name="RNA",
obs_id_name="obs_id",
var_id_name="var_id",
)
Update the array store¶
Add a new embedding to the existing array store.
# read the data matrix
with soma_artifact.open() as soma_store:
ms_rna = soma_store["ms"]["RNA"]
n_obs = len(soma_store["obs"])
n_var = len(ms_rna["var"])
X = ms_rna["X"]["data"].read().coos((n_obs, n_var)).concat().to_scipy()
# calculate PCA embedding from the queried `X`
pca_array = sc.pp.pca(X, n_comps=2)
Show code cell output
/opt/hostedtoolcache/Python/3.11.10/x64/lib/python3.11/site-packages/anndata/_core/storage.py:48: FutureWarning: AnnData previously had undefined behavior around matrices of type <class 'scipy.sparse._coo.coo_matrix'>.In 0.12, passing in this type will throw an error. Please convert to a supported type.Continue using for this minor version at your own risk.
warnings.warn(msg, FutureWarning)
Open the array store in write mode and add PCA.
with soma_artifact.open(mode="w") as soma_store:
tiledbsoma.io.add_matrix_to_collection(
exp=soma_store,
measurement_name="RNA",
collection_name="obsm",
matrix_name="pca",
matrix_data=pca_array,
)
See array store mutations¶
During the append-to and update operations, the data in the array store was changed. LaminDB automatically tracks these revisions recording the number of objects, hashes, and provenance.
soma_artifact.versions.df()
Show code cell output
uid | version | is_latest | description | key | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
5 | PO4E9vTgdvKGDq3v0000 | None | False | tiledbsoma experiment | None | .tiledbsoma | None | 15088000 | cbYls2PUR3eAC54UVdhaYg | 151 | 1718.0 | md5-d | tiledbsoma | 1 | True | 1 | 6 | 6 | 2024-11-21 06:54:44.388179+00:00 | 1 |
7 | PO4E9vTgdvKGDq3v0001 | None | False | tiledbsoma experiment | None | .tiledbsoma | None | 15099818 | A21F6305SkYKSFNlLP8-eA | 175 | 1758.0 | md5-d | tiledbsoma | 1 | True | 1 | 6 | 6 | 2024-11-21 06:54:48.547062+00:00 | 1 |
8 | PO4E9vTgdvKGDq3v0002 | None | True | tiledbsoma experiment | None | .tiledbsoma | None | 15120626 | AMj-jOV3r4lUAmDY-myNZg | 184 | NaN | md5-d | None | 1 | True | 1 | 6 | 6 | 2024-11-21 06:54:49.526081+00:00 | 1 |
View lineage of the array store¶
Check the generating flow of the array store.
soma_artifact.view_lineage()
Note
For the underlying API, see the tiledbsoma documentation.