Query cellxgene-census using TileDB-SOMA

The first guide queried metadata and h5ad artifacts directly through LaminDB.

This guide uses the TileDB-SOMA API to run similar queries.

Setup

Load your LaminDB instance for storing queried data:

!lamin init --storage ./test-cellxgene --schema bionty
💡 connected lamindb: testuser1/test-cellxgene
import lamindb as ln
import bionty as bt
import cellxgene_census

census_version = "2023-07-25"
💡 connected lamindb: testuser1/test-cellxgene

Create lookup objects

We use metadata records in the laminlabs/cellxgene instance to generate lookups:

source = "laminlabs/cellxgene"
human = "homo_sapiens"

features = ln.Feature.using(source).lookup(return_field="name")
assays = bt.ExperimentalFactor.using(source).lookup(return_field="name")
cell_types = bt.CellType.using(source).lookup(return_field="name")
tissues = bt.Tissue.using(source).lookup(return_field="name")
ulabels = ln.ULabel.using(source).lookup()
suspension_types = ulabels.is_suspension_type.children.all().lookup(return_field="name")

Query data

value_filter = (
    f'{features.tissue} == "{tissues.brain}" and {features.cell_type} in'
    f' ["{cell_types.microglial_cell}", "{cell_types.neuron}"] and'
    f' {features.suspension_type} == "{suspension_types.cell}" and {features.assay} =='
    f' "{assays.ln_10x_3_v3}"'
)
value_filter
'tissue == "brain" and cell_type in ["microglial cell", "neuron"] and suspension_type == "cell" and assay == "10x 3\' v3"'
%%time

with cellxgene_census.open_soma(census_version=census_version) as census:
    # Reads SOMADataFrame as a slice
    cell_metadata = census["census_data"][human].obs.read(value_filter=value_filter)

    # Concatenates results to pyarrow.Table
    cell_metadata = cell_metadata.concat()

    # Converts to pandas.DataFrame
    cell_metadata = cell_metadata.to_pandas()
CPU times: user 4.38 s, sys: 2.46 s, total: 6.84 s
Wall time: 10.5 s
cell_metadata.shape
(66418, 21)
cell_metadata.head()
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id ... is_primary_data self_reported_ethnicity self_reported_ethnicity_ontology_term_id sex sex_ontology_term_id suspension_type tissue tissue_ontology_term_id tissue_general tissue_general_ontology_term_id
0 29071956 c888b684-6c51-431f-972a-6c963044cef0 10x 3' v3 EFO:0009922 microglial cell CL:0000129 68-year-old human stage HsapDv:0000162 glioblastoma MONDO:0018177 ... False unknown unknown female PATO:0000383 cell brain UBERON:0000955 brain UBERON:0000955
1 29071957 c888b684-6c51-431f-972a-6c963044cef0 10x 3' v3 EFO:0009922 microglial cell CL:0000129 68-year-old human stage HsapDv:0000162 glioblastoma MONDO:0018177 ... False unknown unknown female PATO:0000383 cell brain UBERON:0000955 brain UBERON:0000955
2 29071964 c888b684-6c51-431f-972a-6c963044cef0 10x 3' v3 EFO:0009922 microglial cell CL:0000129 68-year-old human stage HsapDv:0000162 glioblastoma MONDO:0018177 ... False unknown unknown female PATO:0000383 cell brain UBERON:0000955 brain UBERON:0000955
3 29071966 c888b684-6c51-431f-972a-6c963044cef0 10x 3' v3 EFO:0009922 microglial cell CL:0000129 68-year-old human stage HsapDv:0000162 glioblastoma MONDO:0018177 ... False unknown unknown female PATO:0000383 cell brain UBERON:0000955 brain UBERON:0000955
4 29071967 c888b684-6c51-431f-972a-6c963044cef0 10x 3' v3 EFO:0009922 microglial cell CL:0000129 68-year-old human stage HsapDv:0000162 glioblastoma MONDO:0018177 ... False unknown unknown female PATO:0000383 cell brain UBERON:0000955 brain UBERON:0000955

5 rows × 21 columns

Create AnnData

%%time

with cellxgene_census.open_soma(census_version=census_version) as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism=human,
        obs_value_filter=value_filter,
        column_names={
            "obs": [
                features.assay,
                features.cell_type,
                features.tissue,
                features.disease,
                features.suspension_type,
            ]
        },
    )
CPU times: user 41.1 s, sys: 17.6 s, total: 58.6 s
Wall time: 51.6 s
adata.var = adata.var.set_index("feature_id")
adata
AnnData object with n_obs × n_vars = 66418 × 60664
    obs: 'assay', 'cell_type', 'tissue', 'disease', 'suspension_type'
    var: 'soma_joinid', 'feature_name', 'feature_length'
adata.var.head()
soma_joinid feature_name feature_length
feature_id
ENSG00000121410 0 A1BG 3999
ENSG00000268895 1 A1BG-AS1 3374
ENSG00000148584 2 A1CF 9603
ENSG00000175899 3 A2M 6318
ENSG00000245105 4 A2M-AS1 2948
adata.obs.head()
assay cell_type tissue disease suspension_type
0 10x 3' v3 microglial cell brain glioblastoma cell
1 10x 3' v3 microglial cell brain glioblastoma cell
2 10x 3' v3 microglial cell brain glioblastoma cell
3 10x 3' v3 microglial cell brain glioblastoma cell
4 10x 3' v3 microglial cell brain glioblastoma cell

Register the queried AnnData

ln.transform.stem_uid = "6oq3VJy5yxIU"
ln.transform.version = "0"
ln.track()
💡 notebook imports: bionty==0.44.0 cellxgene-census==1.14.1 lamindb==0.74.0
💡 saved: Transform(uid='6oq3VJy5yxIU6K79', version='0', name='Query cellxgene-census using TileDB-SOMA', key='query-census', type='notebook', created_by_id=1, updated_at='2024-06-20 13:51:20 UTC')
💡 saved: Run(uid='N6yIamHx6C5Um4KuESgN', transform_id=1, created_by_id=1)
Run(uid='N6yIamHx6C5Um4KuESgN', started_at='2024-06-20 13:51:20 UTC', is_consecutive=True, transform_id=1, created_by_id=1)

Register genes and features:

bt.settings.organism = "human"
genes = bt.Gene.from_values(adata.var_names, field=bt.Gene.ensembl_gene_id)
ln.save(genes)

features = ln.Feature.from_df(adata.obs)
ln.save(features)
did not create Gene records for 289 non-validated ensembl_gene_ids: 'ENSG00000112096', 'ENSG00000137808', 'ENSG00000161149', 'ENSG00000182230', 'ENSG00000203441', 'ENSG00000203812', 'ENSG00000204092', 'ENSG00000205485', 'ENSG00000212951', 'ENSG00000214783', 'ENSG00000214970', 'ENSG00000215067', 'ENSG00000215271', 'ENSG00000221995', 'ENSG00000223458', 'ENSG00000223797', 'ENSG00000224167', 'ENSG00000224247', 'ENSG00000224739', 'ENSG00000224745', ...

Register the AnnData object:

artifact = ln.Artifact.from_anndata(
    adata,
    description=(
        "microglial and neuron cell data from 10x 3' v3 in brain queried from Census"
    ),
)
artifact.save()
Artifact(uid='wb3X7jueU9EFyC1Gq39t', description='microglial and neuron cell data from 10x 3' v3 in brain queried from Census', suffix='.h5ad', type='dataset', accessor='AnnData', size=674995866, hash='v8QkSfHA4jUocUskUyBSzl', hash_type='sha1-fl', visibility=1, key_is_virtual=True, created_by_id=1, storage_id=1, transform_id=1, run_id=1, updated_at='2024-06-20 13:51:40 UTC')

Link validated metadata:

artifact.features._add_set_from_anndata(var_field=bt.Gene.ensembl_gene_id)
289 terms (0.50%) are not validated for ensembl_gene_id: ENSG00000263388, ENSG00000285162, ENSG00000276814, ENSG00000228434, ENSG00000283517, ENSG00000264067, ENSG00000273388, ENSG00000282080, ENSG00000283504, ENSG00000283648, ENSG00000237513, ENSG00000239467, ENSG00000236886, ENSG00000262292, ENSG00000226747, ENSG00000248103, ENSG00000260060, ENSG00000273576, ENSG00000232411, ENSG00000256427, ...
features_remote = ln.Feature.using(source).lookup().dict()
features = ln.Feature.lookup().dict()

for col, orm in {
    "assay": bt.ExperimentalFactor,
    "cell_type": bt.CellType,
    "tissue": bt.Tissue,
    "disease": bt.Disease,
    "suspension_type": ln.ULabel,
}.items():
    labels = orm.from_values(adata.obs[col])
    if len(labels) > 0:
        ln.save(labels)
    else:
        labels = [orm(name=name) for name in adata.obs[col].unique()]
        ln.save(labels)
    artifact.labels.add(labels, features.get(col))
Hide code cell output
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
did not create ULabel record for 1 non-validated name: 'cell'
artifact.describe()
Artifact(uid='wb3X7jueU9EFyC1Gq39t', description='microglial and neuron cell data from 10x 3' v3 in brain queried from Census', suffix='.h5ad', type='dataset', accessor='AnnData', size=674995866, hash='v8QkSfHA4jUocUskUyBSzl', hash_type='sha1-fl', visibility=1, key_is_virtual=True, updated_at='2024-06-20 13:51:47 UTC')
  Provenance
    .created_by = 'testuser1'
    .storage = '/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene'
    .transform = 'Query cellxgene-census using TileDB-SOMA'
    .run = '2024-06-20 13:51:20 UTC'
  Labels
    .tissues = 'brain'
    .cell_types = 'microglial cell', 'neuron'
    .diseases = 'glioblastoma'
    .experimental_factors = '10x 3' v3'
    .ulabels = 'cell'
  Features
    'suspension_type' = 'cell'
    'tissue' = 'brain'
    'cell_type' = 'microglial cell', 'neuron'
    'disease' = 'glioblastoma'
    'assay' = '10x 3' v3'
  Feature sets
    'var' = 'A1BG', 'A1BG-AS1', 'A1CF', 'A2M', 'A2M-AS1', 'A2ML1', 'A2ML1-AS1', 'A3GALT2', 'A4GALT', 'A4GNT', 'AAAS', 'AACS', 'AADAC', 'AADACL2', 'AADACL2-AS1', 'AADACL3', 'AADACL4', 'AADAT', 'PRXL2C', 'AAGAB'
    'obs' = 'assay', 'cell_type', 'tissue', 'disease', 'suspension_type'
artifact.view_lineage()
_images/a4594a3e20179e38f4c9f720daf43aad54fe9900344bcb6f8f43f62708e7f2b8.svg
# clean up test instance
!lamin delete --force test-cellxgene
!rm -r ./test-cellxgene
Hide code cell output
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.10.14/x64/bin/lamin", line 8, in <module>
    sys.exit(main())
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 367, in __call__
    return super().__call__(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/rich_click/rich_command.py", line 152, in main
    rv = self.invoke(ctx)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamin_cli/__main__.py", line 103, in delete
    return delete(instance, force=force)
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/_delete.py", line 98, in delete
    n_objects = check_storage_is_empty(
  File "/opt/hostedtoolcache/Python/3.10.14/x64/lib/python3.10/site-packages/lamindb_setup/core/upath.py", line 779, in check_storage_is_empty
    raise InstanceNotEmpty(message)
lamindb_setup.core.upath.InstanceNotEmpty: Storage /home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene/.lamindb contains 1 objects ('_is_initialized' ignored) - delete them prior to deleting the instance
['/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene/.lamindb/_is_initialized', '/home/runner/work/cellxgene-lamin/cellxgene-lamin/docs/test-cellxgene/.lamindb/wb3X7jueU9EFyC1Gq39t.h5ad']