scrna6/6

Concatenate datasets to a single array store

In the previous notebooks, we’ve seen how to incrementally create a collection of scRNA-seq datasets and train models on it.

Sometimes we want to concatenate all datasets into one big array to speed up ad-hoc queries for slices for arbitrary metadata (see this blog post). This is what CELLxGENE does to create Census: a number of .h5ad files are concatenated to give rise to a single tiledbsoma array store (CELLxGENE: scRNA-seq).

Note

This notebook is based on the tiledbsoma documentation.

import lamindb as ln
import pandas as pd
import tiledbsoma
import tiledbsoma.io
from functools import reduce
💡 connected lamindb: testuser1/test-scrna
ln.settings.transform.stem_uid = "oJN8WmVrxI8m"
ln.settings.transform.version = "1"
ln.track()
Hide code cell output
💡 notebook imports: lamindb==0.74.3 pandas==2.2.2 tiledbsoma==1.12.3
💡 saved: Transform(uid='oJN8WmVrxI8m5zKv', version='1', name='Concatenate datasets to a single array store', key='scrna6', type='notebook', created_by_id=1, updated_at='2024-07-26 12:22:57 UTC')
💡 saved: Run(uid='C292giAASAHi2Cvjh1Vk', transform_id=6, created_by_id=1)
Run(uid='C292giAASAHi2Cvjh1Vk', started_at='2024-07-26 12:22:57 UTC', is_consecutive=True, transform_id=6, created_by_id=1)

Query the collection of h5ad files that we’d like to convert into a single array.

collection = ln.Collection.filter(
    name="My versioned scRNA-seq collection", version="2"
).one()
collection.describe()
Hide code cell output
Collection(uid='7hRDhzTTUWLHntihz8tr', version='2', name='My versioned scRNA-seq collection', hash='Umjxg4HR1wkZqKROsyz1', visibility=1, updated_at='2024-07-26 12:22:23 UTC')
  Provenance
    .created_by = 'testuser1'
    .transform = 'Standardize and append a batch of data'
    .run = '2024-07-26 12:21:55 UTC'
    .input_of = ["'2024-07-26 12:22:33 UTC'", "'2024-07-26 12:22:50 UTC'"]
  Feature sets
    'obs' = 'donor', 'tissue', 'cell_type', 'assay'
    'var' = 'MIR1302-2HG', 'FAM138A', 'OR4F5', 'None', 'OR4F29', 'OR4F16', 'LINC01409', 'FAM87B', 'LINC01128', 'LINC00115', 'FAM41C'

Prepare the array store

Prepare a path and a context for a new tiledbsoma.Experiment.

We will create our array store at the LaminDB instance root with folder name "scrna.tiledbsoma".

storage_settings = ln.setup.settings.storage
soma_path = (storage_settings.root / "scrna.tiledbsoma").as_posix()  # we could take any AWS S3 path, here

If our path is on AWS S3, we need to create a context with region information (exception: us-east-1). You can find more about tiledb configuration parameters in the tiledb documentation.

if storage_settings.type == "s3":  # if the storage location is on AWS S3
    storage_region = storage_settings.region
    ctx = tiledbsoma.SOMATileDBContext(tiledb_config={"vfs.s3.region": storage_region})
else:
    ctx = None

Prepare the AnnData objects

We need to prepare theAnnData objects in the collection to be concatenated into one tiledbsoma.Experiment. They need to have the same .var and .obs columns, .uns and .obsp should be removed.

adatas = [artifact.load() for artifact in collection.artifacts]

Compute the intersetion of all columns. All AnnData objects should have the same columns in their .obs, .var, .raw.var to be ingested into one tiledbsoma.Experiment.

obs_columns = reduce(pd.Index.intersection, [adata.obs.columns for adata in adatas])
var_columns = reduce(pd.Index.intersection, [adata.var.columns for adata in adatas])
var_raw_columns = reduce(pd.Index.intersection, [adata.raw.var.columns for adata in adatas])

Prepare the AnnData objects for concatenation. Prepare id fields, sanitize index names, intersect columns, drop slots. Here we have to drop .obsp, .uns and also columns from the dataframes that are not in the intersections obtained above, otherwise the ingestion will fail. We will need to provide obs and var names in tiledbsoma.io.register_anndatas, so we create these fileds (obs_id, var_id) from the dataframe indices.

for i, adata in enumerate(adatas):
    del adata.obsp
    del adata.uns
    
    adata.obs = adata.obs.filter(obs_columns)
    adata.obs["obs_id"] = adata.obs.index
    adata.obs["dataset"] = i
    adata.obs.index.name = None
    
    adata.var = adata.var.filter(var_columns)
    adata.var["var_id"] = adata.var.index
    adata.var.index.name = None
    
    drop_raw_var_columns = adata.raw.var.columns.difference(var_raw_columns)
    adata.raw.var.drop(columns=drop_raw_var_columns, inplace=True)
    adata.raw.var["var_id"] = adata.raw.var.index
    adata.raw.var.index.name = None

Create the array store

Register all the AnnData objects. Pass experiment_uri=None because tiledbsoma.Experiment doesn’t exist yet:

registration_mapping = tiledbsoma.io.register_anndatas(
    experiment_uri=None,
    adatas=adatas,
    measurement_name="RNA",
    obs_field_name="obs_id",
    var_field_name="var_id",
    append_obsm_varm=True
)

Ingest the AnnData objects sequentially, providing the context. This saves the AnnData objects in one array store.

for adata in adatas:
    tiledbsoma.io.from_anndata(
        experiment_uri=soma_path,
        anndata=adata,
        measurement_name="RNA",
        registration_mapping=registration_mapping,
        context=ctx
    )
Hide code cell output
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)
/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/abc.py:119: FutureWarning: SparseDataset is deprecated and will be removed in late 2024. It has been replaced by the public classes CSRDataset and CSCDataset.

For instance checks, use `isinstance(X, (anndata.experimental.CSRDataset, anndata.experimental.CSCDataset))` instead.

For creation, use `anndata.experimental.sparse_dataset(X)` instead.

  return _abc_instancecheck(cls, instance)

Register the array store

Register the created tiledbsoma.Experiment store in lamindb:

soma_artifact = ln.Artifact(soma_path, description="My scRNA-seq SOMA Experiment").save()
soma_artifact.describe()
Hide code cell output
Artifact(uid='sA23vMamY3JGlwYeeWg8', description='My scRNA-seq SOMA Experiment', key='scrna.tiledbsoma', suffix='.tiledbsoma', type='dataset', size=15060274, hash='cml-6ov9zMntdrEG-5wddw', hash_type='md5-d', n_objects=143, visibility=1, key_is_virtual=False, updated_at='2024-07-26 12:23:04 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/lamin-usecases/lamin-usecases/docs/test-scrna'
    .transform = 'Concatenate datasets to a single array store'
    .run = '2024-07-26 12:22:57 UTC'

Query the array

Open and query the experiment. We can use the registered Artifact.

with soma_artifact.open() as soma_array:
    print(soma_array["obs"].read().concat().to_pandas())
Hide code cell output
      soma_joinid                                          cell_type  \
0               0                                     dendritic cell   
1               1                              B cell, CD19-positive   
2               2                                     dendritic cell   
3               3                              B cell, CD19-positive   
4               4  effector memory CD4-positive, alpha-beta T cel...   
...           ...                                                ...   
1713         1713  naive thymus-derived CD4-positive, alpha-beta ...   
1714         1714  naive thymus-derived CD4-positive, alpha-beta ...   
1715         1715  naive thymus-derived CD4-positive, alpha-beta ...   
1716         1716             CD8-positive, alpha-beta memory T cell   
1717         1717  naive thymus-derived CD4-positive, alpha-beta ...   

                             obs_id  dataset  
0                  GCAGGGCTGGATTC-1        0  
1                  CTTTAGTGGTTACG-6        0  
2                  TGACTGGAACCATG-7        0  
3                  TCAATCACCCTTCG-8        0  
4                  CGTTATACAGTACC-8        0  
...                             ...      ...  
1713  Pan_T7991594_CTCACACTCCAGGGCT        1  
1714  Pan_T7980358_CGAGCACAGAAGATTC        1  
1715    CZINY-0064_AGACCATCACGCTGCA        1  
1716    CZINY-0050_TCGATTTAGATGTTGA        1  
1717    CZINY-0064_AGTGTTGTCCGAGCTG        1  

[1718 rows x 4 columns]