hub

CELLxGENE: scRNA-seq

CZ CELLxGENE hosts one of the largest standardized collections of scRNA-seq data - LaminDB provides a streamlined interface to query and load it.

You can use the CELLxGENE data in two ways:

  1. Query collections of AnnData objects.

  2. Query slices from a single concatenated dataset without downloading everything via TileDB-SOMA.

To build similar data assets in-house:

  1. See the transfer guide to zero-copy data to your own LaminDB instance.

  2. See the scRNA guide to create a growing, standardized & versioned scRNA-seq dataset collection.

Show me a screenshot
# pip install lamindb
!lamin init --modules bionty --storage ./test-cellxgene
Hide code cell output
 initialized lamindb: testuser1/test-cellxgene
import lamindb as ln
import bionty as bt
Hide code cell output
 connected lamindb: testuser1/test-cellxgene

Create the central query object for this instance:

db = ln.DB("laminlabs/cellxgene")

Query for individual datasets

Every individual dataset in CELLxGENE is an .h5ad file that is stored as an artifact in LaminDB. Here is an exemplary query:

users = db.User.lookup()
cell_types = db.bionty.CellType.lookup()

db.Artifact.filter(
    suffix=".h5ad",
    description__contains="immune",
    size__gt=1e9,  # size > 1GB
    cell_types__name__in=["B cell", "T cell"],  # cell types measured in AnnData
    created_by__name="sunnyosun",
).order_by("created_at").to_dataframe(
    include=["cell_types__name", "created_by__handle"]  # join with additional info
).head()
Hide code cell output
uid id key
What happens under the hood?

As you saw from inspecting ln.Artifact, ln.Artifact.cell_types relates artifacts with bt.CellType.

The expression cell_types__name__in performs the join of the underlying registries and matches bt.CellType.name to ["B cell", "T cell"].

Similar for created_by, which relates artifacts with ln.User.

Slice an individual dataset

Let’s look at a CELLxGENE artifact and show its metadata using .describe().

artifact = db.Artifact.get(description="Mature kidney dataset: immune", is_latest=True)
artifact.describe()
Hide code cell output
Artifact: cell-census/2025-01-30/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad (2025-01-30)
|   description: Mature kidney dataset: immune
├── uid: WwmBIhBNLTlRcSoBDt77            run: o9WY9Nh (annotate_2025_30_01_LTS.py)
kind: None                           otype: AnnData                           
hash: yUEFmGNLKpPqf-VNn0flzg         size: 43.1 MB                            
branch: main                         space: all                               
created_at: 2025-07-30 09:51:09 UTC  created_by: zethson                      
n_observations: 7803                                                          
├── storage/path: s3://cellxgene-data-public/cell-census/2025-01-30/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad
├── Dataset features
├── obs (20)                                                                                                   
│   assay_ontology_term_id          bionty.ExperimentalFactor.ontolo…  EFO:0009899                             
│   cell_type_ontology_term_id      bionty.CellType.ontology_id[sour…  CL:0000097, CL:0000236, CL:0000451, CL:…
│   development_stage_ontology_te…  bionty.DevelopmentalStage.ontolo…  HsapDv:0000096, HsapDv:0000098, HsapDv:…
│   disease_ontology_term_id        bionty.Disease.ontology_id[sourc…  PATO:0000461                            
│   organism_ontology_term_id       bionty.Organism.ontology_id[sour…  NCBITaxon:9606                          
│   self_reported_ethnicity_ontol…  bionty.Ethnicity.ontology_id[sou…  unknown                                 
│   sex_ontology_term_id            bionty.Phenotype.ontology_id[sou…  PATO:0000383, PATO:0000384              
│   suspension_type                 ULabel                             cell                                    
│   tissue_ontology_term_id         bionty.Tissue.ontology_id[source…  UBERON:0000362, UBERON:0001224, UBERON:…
│   tissue_type                     ULabel                             tissue                                  
│   assay                           bionty.ExperimentalFactor[source…                                          
│   cell_type                       bionty.CellType[source__uid='3Uw…                                          
│   development_stage               bionty.DevelopmentalStage[source…                                          
│   disease                         bionty.Disease[source__uid='4a3e…                                          
│   donor_id                        str                                                                        
│   self_reported_ethnicity         bionty.Ethnicity[source__uid='MJ…                                          
│   sex                             bionty.Phenotype[source__uid='3o…                                          
│   tissue                          bionty.Tissue[source__uid='MUtAG…                                          
│   organism                        bionty.Organism.scientific_name[…                                          
│   is_primary_data                 ULabel                                                                     
└── var (2)                                                                                                    
    var_index                       bionty.Gene.ensembl_gene_id[sour…                                          
    feature_is_filtered             bool                                                                       
├── External features
└── n_of_donors                     int                                13                                      
└── Labels
    └── .ulabels                        ULabel                             cell, tissue                            
        .references                     Reference                          Spatiotemporal immune zonation of the h…
        .organisms                      bionty.Organism                    human                                   
        .tissues                        bionty.Tissue                      kidney blood vessel, renal pelvis, cort…
        .cell_types                     bionty.CellType                    natural killer cell, CD8-positive, alph…
        .diseases                       bionty.Disease                     normal                                  
        .phenotypes                     bionty.Phenotype                   female, male                            
        .experimental_factors           bionty.ExperimentalFactor          10x 3' v2                               
        .developmental_stages           bionty.DevelopmentalStage          63-year-old stage, 2-year-old stage, 4-…
        .ethnicities                    bionty.Ethnicity                   unknown                                 
More ways of accessing metadata

Access just features:

artifact.features

Or get labels given a feature:

artifact.labels.get(features.tissue).to_dataframe()

To query & load a slice of the array data, you have several options:

  1. Cache the artifact on disk and return the path to the cached data like: artifact.cache() -> Path

  2. Cache & load the entire artifact into memory via artifact.load() -> AnnData

  3. Stream the array using a (cloud-backed) accessor artifact.open() -> AnnDataAccessor

All of these option run much faster in the AWS us-west-2 data center.

Cache:

cache_path = artifact.cache()
cache_path
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
PosixUPath('/home/runner/.cache/lamindb/cellxgene-data-public/cell-census/2025-01-30/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad')

Cache & load:

adata = artifact.load()
adata
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
AnnData object with n_obs × n_vars = 7803 × 32839
    obs: 'donor_id', 'donor_age', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sample_uuid', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'mapped_reference_annotation', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'reported_diseases', 'sex_ontology_term_id', 'compartment', 'Experiment', 'Project', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

Now we have an AnnData object, which stores observation annotations matching our artifact-level query in the .obs slot, and we can re-use almost the same query on the array-level.

See the array-level query
adata_slice = adata[
    adata.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata.obs.tissue == tissues.kidney.name)
    & (adata.obs.suspension_type == suspension_types.cell.name)
    & (adata.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_slice
See the artifact-level query
collection = cxg.Collection.filter(name="cellxgene-census", version="2025-01-30").one()
query = collection.artifacts.filter(
    organism=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)

AnnData uses Pandas to manage metadata and the syntax differs slightly. However, the same metadata records are used.

Stream, slice and load the slice into memory:

with artifact.open() as adata_backed:
    display(adata_backed)
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
AnnDataAccessor object with n_obs × n_vars = 7803 × 32839
  constructed for the AnnData object 20d87640-4be8-487f-93d4-dce38378d00f.h5ad
    obs: ['Experiment', 'Project', '_index', 'assay', 'assay_ontology_term_id', 'author_cell_type', 'cell_type', 'cell_type_ontology_term_id', 'compartment', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_age', 'donor_id', 'is_primary_data', 'library_uuid', 'mapped_reference_annotation', 'observation_joinid', 'organism', 'organism_ontology_term_id', 'reported_diseases', 'sample_uuid', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'suspension_uuid', 'tissue', 'tissue_ontology_term_id', 'tissue_type']
    obsm: ['X_umap']
    raw: ['X', 'var', 'varm']
    uns: ['citation', 'default_embedding', 'schema_reference', 'schema_version', 'title']
    var: ['_index', 'feature_biotype', 'feature_is_filtered', 'feature_length', 'feature_name', 'feature_reference', 'feature_type']

We now have an AnnDataAccessor object, which behaves much like an AnnData, and slicing looks similar to the query above.

See the slicing operation
adata_backed_slice = adata_backed[
    adata_backed.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata_backed.obs.tissue == tissues.kidney.name)
    & (adata_backed.obs.suspension_type == suspension_types.cell.name)
    & (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
]

adata_backed_slice.to_memory()

Query collections of datasets

Let’s search collections from CELLxGENE within the 2025-01-30 release:

db.Collection.filter(version="2025-01-30").search("human retina", limit=3)
Hide code cell output
<QuerySet [Collection(uid='8ohRJQq8e3F7pdlBZbi0', version='2025-01-30', is_latest=True, key='Single cell atlas of the human retina', description='10.1101/2023.11.07.566105', hash='_S609Z-_U4hnq1MxfIv23g', reference='4c6eaf5c-6d57-4c76-b1e9-60df8c655f1e', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=8, run_id=44, created_at=2025-08-04 14:53:28 UTC, is_locked=False), Collection(uid='tZYmzwfh0bIYzKBQVurp', version='2025-01-30', is_latest=True, key='Cell Types of the Human Retina and Its Organoids at Single-Cell Resolution', description='10.1016/j.cell.2020.08.013', hash='kzC9nzkfigH0xy1BWOWYnQ', reference='2f4c738f-e2f3-4553-9db2-0582a38ea4dc', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=8, run_id=44, created_at=2025-08-04 14:53:29 UTC, is_locked=False), Collection(uid='quQDnLsMLkP3JRsC8gp5', version='2025-01-30', is_latest=True, key='Single-cell transcriptomic atlas for adult human retina', description='10.1016/j.xgen.2023.100298', hash='_bg3b6SweW6v1TJX7NgHCw', reference='af893e86-8e9f-41f1-a474-ef05359b1fb7', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=8, run_id=44, created_at=2025-08-04 14:53:38 UTC, is_locked=False)]>

Let’s get the record of the top hit collection:

collection = db.Collection.get("quQDnLsMLkP3JRsC8gp4")
collection
Hide code cell output
Collection(uid='quQDnLsMLkP3JRsC8gp4', version='2024-07-01', is_latest=False, key='Single-cell transcriptomic atlas for adult human retina', description='10.1016/j.xgen.2023.100298', hash='NIo8G6_reJTEqMzW2nMc', reference='af893e86-8e9f-41f1-a474-ef05359b1fb7', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:21:51 UTC, is_locked=False)

It’s a Science paper and we can find more information on it using the DOI or CELLxGENE collection id. There are multiple versions of this collection.

collection.versions.to_dataframe()
Hide code cell output
uid key description hash reference reference_type version is_latest is_locked created_at branch_id space_id created_by_id run_id meta_artifact_id
id
767 quQDnLsMLkP3JRsC8gp5 Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 _bg3b6SweW6v1TJX7NgHCw af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 2025-01-30 True False 2025-08-04 14:53:38.098405+00:00 1 1 8 44.0 None
606 quQDnLsMLkP3JRsC8gp4 Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 NIo8G6_reJTEqMzW2nMc af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 2024-07-01 False False 2024-07-16 12:21:51.449109+00:00 1 1 1 27.0 None
291 quQDnLsMLkP3JRsCJNGB Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 FsD52kpR7dF2h78-P3ka af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 2023-12-15 False False 2024-01-11 13:41:01.880382+00:00 1 1 1 22.0 None
134 quQDnLsMLkP3JRsC6WWz Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 xhfSShX8lypXPx00zevx af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 2023-07-25 False False 2024-01-08 12:22:12.891930+00:00 1 1 1 NaN None

The collection groups artifacts.

collection.artifacts.to_dataframe()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
uid key description suffix kind otype size hash n_files n_observations version is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
3600 80xlsVmayPPBCCEZ7aBc cell-census/2024-07-01/h5ads/ed419b4e-db9b-40f... Non-neuronal cells in human retina .h5ad dataset AnnData 1070671504 slN6j-9aSrYFw-IPL-wv-A None 18011 2024-07-01 False False 2024-07-12 12:34:10.255394+00:00 1 1 2 27 None 1
3378 Ce4Mqe4X2vUhwkwnh5YQ cell-census/2024-07-01/h5ads/aad97cb5-f375-45e... Retinal ganglion cells in human retina .h5ad dataset AnnData 784580498 w-_LJDfBv7vsZqw-9Jt72g None 11617 2024-07-01 False False 2024-07-12 12:34:09.816906+00:00 1 1 2 27 None 1
3273 1OyQQLNfu1nzvVADODND cell-census/2024-07-01/h5ads/8f10185b-e0b3-46a... Bipolar cells in human retina .h5ad dataset AnnData 3075818557 1GQwZcymSrr7d2Xit-5Deg None 53040 2024-07-01 False False 2024-07-12 12:34:09.644258+00:00 1 1 2 27 None 1
3018 QpuY5RsGTBBMN61QGY4t cell-census/2024-07-01/h5ads/359f7af4-87d4-411... Amacrine cells in human retina .h5ad dataset AnnData 3382221253 S7gXlC-cJ362BOqYZFxMOA None 56507 2024-07-01 False False 2024-07-12 12:34:09.160201+00:00 1 1 2 27 None 1
2919 GA2BXWwoJlcRfzNp3iyQ cell-census/2024-07-01/h5ads/11ef37ee-2173-458... Horizontal cells in human retina .h5ad dataset AnnData 404987285 fR0O7fSUHxmAfEDC8J7Ipw None 7348 2024-07-01 False False 2024-07-12 12:34:08.949267+00:00 1 1 2 27 None 1
2855 wYiUe9hn4TJijpoX90Mr cell-census/2024-07-01/h5ads/0129dbd9-a7d3-4f6... All major cell types in adult human retina .h5ad dataset AnnData 14638089351 bXxaz_quQ4mIbVlarLZZKQ None 244474 2024-07-01 False False 2024-07-12 12:34:08.826175+00:00 1 1 2 27 None 1
2852 Oc6ANFJ0FgOW1B70mNIq cell-census/2024-07-01/h5ads/00e5dedd-b9b7-43b... Photoreceptor cells in human retina (rod cells... .h5ad dataset AnnData 990594324 qFT65q6_k30pki8-1_2HoQ None 21422 2024-07-01 False False 2024-07-12 12:34:08.813762+00:00 1 1 2 27 None 1

Let’s now look at the collection that corresponds to the cellxgene-census release of .h5ad artifacts.

collection = db.Collection.get(key="cellxgene-census", version="2025-01-30")
collection
Hide code cell output
Collection(uid='dMyEX3NTfKOEYXyMKDD8', version='2025-01-30', is_latest=True, key='cellxgene-census', description=None, hash='NjqvY0g6hlzgyVXTYer0Ng', reference=None, reference_type=None, meta_artifact=None, branch_id=1, space_id=1, created_by_id=8, run_id=44, created_at=2025-08-04 14:53:10 UTC, is_locked=False)

You can query across artifacts by arbitrary metadata combinations, for instance:

organisms = db.bionty.Organism.lookup()
experimental_factors = db.bionty.ExperimentalFactor.lookup()
tissues = db.bionty.Tissue.lookup()
suspension_types = db.ULabel.filter(type__name="SuspensionType").lookup()
# here we choose to return .name directly
features = db.Feature.lookup(return_field="name")
assays = db.bionty.ExperimentalFactor.lookup(return_field="name")
query = collection.artifacts.filter(
    organisms=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size")
query.to_dataframe().head()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations version is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
3915 WwmBIhBNLTlRcSoBDt77 cell-census/2025-01-30/h5ads/20d87640-4be8-487... Mature kidney dataset: immune .h5ad None AnnData 45157740 yUEFmGNLKpPqf-VNn0flzg None 7803 2025-01-30 True False 2025-07-30 09:51:09.740656+00:00 1 1 2 43 1569 8
3987 gHlQ5Muwu3G9pvFCx3x9 cell-census/2025-01-30/h5ads/2d31c0ca-0233-41c... Fetal kidney dataset: immune .h5ad None AnnData 64549172 rHzhFDLOZMcBWmBWR1_cXA None 6847 2025-01-30 True False 2025-07-30 09:51:09.859933+00:00 1 1 2 43 1569 8
4662 P4Oai3OLGAzRwoicHfLN cell-census/2025-01-30/h5ads/9ea768a2-87ab-46b... Mature kidney dataset: full .h5ad None AnnData 194046649 1Eae2AxiVQAxcTF030UurQ None 40268 2025-01-30 True False 2025-07-30 09:51:10.987003+00:00 1 1 2 43 1569 8
3811 DSpevwaIl5E2jIWHbui5 cell-census/2025-01-30/h5ads/105c7dad-0468-462... mature .h5ad None AnnData 233914560 1yRecaV0oKr9BE62eeTTsw None 40268 2025-01-30 True False 2025-07-30 09:51:09.568062+00:00 1 1 2 43 1569 8
5010 11HQaMeIUaOwyHoOkqqN cell-census/2025-01-30/h5ads/d7dcfd8f-2ee7-438... Fetal kidney dataset: full .h5ad None AnnData 342427224 3NFL6OtWZAgkxJY0uKtL3w None 27197 2025-01-30 True False 2025-07-30 09:51:11.580779+00:00 1 1 2 43 1569 8

Slice a concatenated array

Let us now use the concatenated version of the Census collection: a tiledbsoma array that concatenates all AnnData arrays present in the collection we just explored. Slicing tiledbsoma works similar to slicing DataFrame or AnnData.

value_filter = (
    f'{features.tissue} == "{tissues.brain.name}" and {features.cell_type} in'
    f' ["{cell_types.microglial_cell.name}", "{cell_types.neuron.name}"] and'
    f' {features.suspension_type} == "{suspension_types.cell.name}" and {features.assay} =='
    f' "{assays.ln_10x_3_v3}"'
)
value_filter
'tissue == "brain" and cell_type in ["microglial cell", "neuron"] and suspension_type == "cell" and assay == "10x 3\' v3"'

Query for the tiledbsoma array store that contains all concatenated expression data. It’s a new dataset produced by concatenating all AnnData-like artifacts in the Census collection.

census_artifact = db.Artifact.get(description="Census 2025-01-30")

Run the slicing operation.

human = "homo_sapiens"  # subset to human data

# open the array store for queries
with census_artifact.open() as store:
    # read SOMADataFrame as a slice
    cell_metadata = store["census_data"][human].obs.read(value_filter=value_filter)
    # concatenate results to pyarrow.Table
    cell_metadata = cell_metadata.concat()
    # convert to pandas.DataFrame
    cell_metadata = cell_metadata.to_pandas()

cell_metadata.head()
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id donor_id is_primary_data observation_joinid self_reported_ethnicity self_reported_ethnicity_ontology_term_id sex sex_ontology_term_id suspension_type tissue tissue_ontology_term_id tissue_type tissue_general tissue_general_ontology_term_id raw_sum nnz raw_mean_nnz raw_variance_nnz n_measured_vars
0 46791195 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False 4kz&dTc~he unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 8231.0 3617 2.275643 27.214100 59229
1 46791196 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False ?--VrK%il{ unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 6513.0 2905 2.241997 41.835701 59229
2 46791197 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False _ws9=bPvtV unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 10514.0 3864 2.721014 75.146584 59229
3 46791198 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False +_|VI+u9-j unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 9979.0 3863 2.583225 44.252976 59229
4 46791199 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False i-geD_>1A2 unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 13689.0 4737 2.889804 53.114966 59229

Create an AnnData object.

from tiledbsoma import AxisQuery

with census_artifact.open() as store:
    experiment = store["census_data"][human]
    adata = experiment.axis_query(
        "RNA", obs_query=AxisQuery(value_filter=value_filter)
    ).to_anndata(
        X_name="raw",
        column_names={
            "obs": [
                features.assay,
                features.cell_type,
                features.tissue,
                features.disease,
                features.suspension_type,
            ]
        },
    )

adata.var = adata.var.set_index("feature_id")
adata
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.
  return dispatch(args[0].__class__)(*args, **kw)
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.
  return dispatch(args[0].__class__)(*args, **kw)
AnnData object with n_obs × n_vars = 117660 × 61888
    obs: 'assay', 'cell_type', 'tissue', 'disease', 'suspension_type'
    var: 'soma_joinid', 'feature_name', 'feature_type', 'feature_length', 'nnz', 'n_measured_obs'

Train ML models

You can either directly train ML models on very large collections of AnnData-like artifacts or on a single concatenated tiledbsoma-like artifact. For pros & cons of these approaches, see this blog post.

On a collection of arrays

mapped() caches AnnData objects on disk and creates a map-style dataset that performs a virtual join of the features of the underlying AnnData objects.

from torch.utils.data import DataLoader

census_collection = db.Collection.get(name="cellxgene-census", version="2025-01-30")

dataset = census_collection.mapped(obs_keys=[features.cell_type], join="outer")

dataloader = DataLoader(dataset, batch_size=128, shuffle=True)

for batch in dataloader:
    pass

dataset.close()

For more background, see Train a machine learning model on a collection.

On a concatenated array

You can create streaming PyTorch dataloaders from tiledbsoma stores using cellxgene_census package.

import cellxgene_census.experimental.ml as census_ml

store = census_artifact.open()

experiment = store["census_data"][human]
experiment_datapipe = census_ml.ExperimentDataPipe(
    experiment,
    measurement_name="RNA",
    X_name="raw",
    obs_query=AxisQuery(value_filter=value_filter),
    obs_column_names=[features.cell_type],
    batch_size=128,
    shuffle=True,
    soma_chunk_size=10000,
)
experiment_dataloader = census_ml.experiment_dataloader(experiment_datapipe)

for batch in experiment_dataloader:
    pass

store.close()

For more background see this guide.