hub

CELLxGENE: scRNA-seq

CZ CELLxGENE hosts the globally largest standardized collection of scRNA-seq datasets.

LaminDB makes it easy to query the CELLxGENE data and integrate it with in-house data of any kind (omics, phenotypes, pdfs, notebooks, ML models, …).

You can use the CELLxGENE data in two ways:

  1. Query collections of AnnData objects.

  2. Slice a big array store produced by concatenated AnnData objects via tiledbsoma.

If you are interested in building similar data assets in-house:

  1. See the transfer guide to zero-copy data to your own LaminDB instance.

  2. See the scRNA guide to create a growing, standardized & versioned scRNA-seq dataset collection.

Show me a screenshot

Connect to the public LaminDB instance that mirrors cellxgene:

# pip install 'lamindb[bionty,jupyter]'
!lamin connect laminlabs/cellxgene
Hide code cell output
 connected lamindb: laminlabs/cellxgene
 to map a local dev directory, call: lamin settings set dev-dir .
import lamindb as ln
import bionty as bt
Hide code cell output
 connected lamindb: laminlabs/cellxgene

Query & understand metadata

Auto-complete metadata

You can create look-up objects for any registry in LaminDB, including basic biological entities and things like users or storage locations.

Let’s use auto-complete to look up cell types:

Show me a screenshot
cell_types = bt.CellType.lookup()
cell_types.effector_t_cell
Hide code cell output
CellType(uid='3nfZTVV4', name='effector T cell', ontology_id='CL:0000911', abbr=None, synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', branch_id=1, space_id=1, created_by_id=1, run_id=None, source_id=48, created_at=2023-11-28 22:30:57 UTC, is_locked=False)

You can also arbitrarily chain filters and create lookups from them:

users = ln.User.lookup()
organisms = bt.Organism.lookup()
experimental_factors = bt.ExperimentalFactor.lookup()  # labels for experimental factors
tissues = bt.Tissue.lookup()  # tissue labels
suspension_types = ln.ULabel.filter(type__name="SuspensionType").lookup()
# here we choose to return .name directly
features = ln.Feature.lookup(return_field="name")
assays = bt.ExperimentalFactor.lookup(return_field="name")

Search & filter metadata

We can use search & filters for metadata:

bt.CellType.search("effector T cell").to_dataframe().head()
Hide code cell output
uid name ontology_id abbr synonyms description is_locked created_at branch_id space_id created_by_id run_id source_id
id
1623 3nfZTVV4 effector T cell CL:0000911 None effector T-cell|effector T-lymphocyte|effector... A Differentiated T Cell With Ability To Traffi... False 2023-11-28 22:30:57.481760+00:00 1 1 1 None 48
1169 6JD5JCZC CD8-positive, alpha-beta cytokine secreting ef... CL:0000908 None CD8-positive, alpha-beta cytokine secreting ef... A Cd8-Positive, Alpha-Beta T Cell With The Phe... False 2023-11-28 22:27:55.571572+00:00 1 1 1 None 48
1229 69TEBGqb exhausted T cell CL:0011025 None Tex cell|An effector T cell that displays impa... None False 2023-11-28 22:27:55.572880+00:00 1 1 1 None 48
1331 43cBCa7s helper T cell CL:0000912 None helper T-lymphocyte|T-helper cell|helper T lym... A Effector T Cell That Provides Help In The Fo... False 2023-11-28 22:27:55.575949+00:00 1 1 1 None 48
1503 1oa5G2Mq memory T cell CL:0000813 None memory T-cell|memory T lymphocyte|memory T-lym... A Long-Lived, Antigen-Experienced T Cell That ... False 2023-11-28 22:27:55.580286+00:00 1 1 1 None 48

And use a uid to filter exactly one metadata record:

effector_t_cell = bt.CellType.get("3nfZTVV4")
effector_t_cell
Hide code cell output
CellType(uid='3nfZTVV4', name='effector T cell', ontology_id='CL:0000911', abbr=None, synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', branch_id=1, space_id=1, created_by_id=1, run_id=None, source_id=48, created_at=2023-11-28 22:30:57 UTC, is_locked=False)

Understand ontologies

View the related ontology terms:

effector_t_cell.view_parents(distance=2, with_children=True)
Hide code cell output
_images/b7eb4410024c573f14ac551e15aed82372f722ffab5b9fdf59b7b721376e3c12.svg

Or access them programmatically:

effector_t_cell.children.to_dataframe()
Hide code cell output
uid name ontology_id abbr synonyms description is_locked created_at branch_id space_id created_by_id run_id source_id
id
1331 43cBCa7s helper T cell CL:0000912 None helper T-lymphocyte|T-helper cell|helper T lym... A Effector T Cell That Provides Help In The Fo... False 2023-11-28 22:27:55.575949+00:00 1 1 1 None 48
1309 5s4gCMdn cytotoxic T cell CL:0000910 None cytotoxic T lymphocyte|cytotoxic T-lymphocyte|... A Mature T Cell That Differentiated And Acquir... False 2023-11-28 22:27:55.575440+00:00 1 1 1 None 48
1229 69TEBGqb exhausted T cell CL:0011025 None Tex cell|An effector T cell that displays impa... None False 2023-11-28 22:27:55.572880+00:00 1 1 1 None 48
1088 490Xhb24 effector CD4-positive, alpha-beta T cell CL:0001044 None effector CD4-positive, alpha-beta T lymphocyte... A Cd4-Positive, Alpha-Beta T Cell With The Phe... False 2023-11-28 22:27:55.569828+00:00 1 1 1 None 48
931 2VQirdSp effector CD8-positive, alpha-beta T cell CL:0001050 None effector CD8-positive, alpha-beta T lymphocyte... A Cd8-Positive, Alpha-Beta T Cell With The Phe... False 2023-11-28 22:27:55.565976+00:00 1 1 1 None 48

Query for individual datasets

Every individual dataset in CELLxGENE is an .h5ad file that is stored as an artifact in LaminDB. Here is an exemplary query:

ln.Artifact.filter(
    suffix=".h5ad",  # filename suffix
    description__contains="immune",
    size__gt=1e9,  # size > 1GB
    cell_types__in=[
        cell_types.b_cell,
        cell_types.t_cell,
    ],  # cell types measured in AnnData
    created_by=users.sunnyosun,  # creator
).order_by("created_at").to_dataframe(
    include=["cell_types__name", "created_by__handle"]  # join with additional info
).head()
Hide code cell output
uid key cell_types__name created_by__handle
id
879 BCutg5cxmqLmy2Z5SS8J cell-census/2023-07-25/h5ads/01ad3cd7-3929-465... {CD4-positive, alpha-beta T cell, CD8-positive... sunnyosun
1106 3xdOASXuAxxJtSchJO3D cell-census/2023-07-25/h5ads/48101fa2-1a63-451... {double-positive, alpha-beta thymocyte, regula... sunnyosun
1174 wt7eD72sTzwL3rfYaZr2 cell-census/2023-07-25/h5ads/58b01044-c5e5-4b0... {stromal cell, innate lymphoid cell, plasma ce... sunnyosun
1377 znTBqWgfYgFlLjdQ6Ba7 cell-census/2023-07-25/h5ads/9dbab10c-118d-496... {mature gamma-delta T cell, mast cell, ciliate... sunnyosun
1482 dEP0dZ8UxLgwnkLjz6Iq cell-census/2023-07-25/h5ads/bd65a70f-b274-413... {mature NK T cell, mast cell, endothelial cell... sunnyosun
What happens under the hood?

As you saw from inspecting ln.Artifact, ln.Artifact.cell_types relates artifacts with bt.CellType.

The expression cell_types__name__in performs the join of the underlying registries and matches bt.CellType.name to ["B cell", "T cell"].

Similar for created_by, which relates artifacts with ln.User.

To see what you can query for, look at the registry representation.

ln.Artifact
Hide code cell output
Artifact
  Simple fields
    .uid: CharField
    .key: CharField
    .description: TextField
    .suffix: CharField
    .kind: CharField
    .otype: CharField
    .size: BigIntegerField
    .hash: CharField
    .n_files: BigIntegerField
    .n_observations: BigIntegerField
    .version: CharField
    .is_latest: BooleanField
    .is_locked: BooleanField
    .created_at: DateTimeField
    .updated_at: DateTimeField
  Relational fields
    .branch: Branch
    .space: Space
    .storage: Storage
    .run: Run
    .schema: Schema
    .created_by: User
    .input_of_runs: Run
    .feature_sets: Schema
    .ulabels: ULabel
    .users: User
    .collections: Collection
    .records: Record
    .linked_in_records: Record
    .references: Reference
    .projects: Project
    .blocks: Block
  Bionty fields
    .organisms: bionty.Organism
    .genes: bionty.Gene
    .proteins: bionty.Protein
    .cell_markers: bionty.CellMarker
    .tissues: bionty.Tissue
    .cell_types: bionty.CellType
    .diseases: bionty.Disease
    .cell_lines: bionty.CellLine
    .phenotypes: bionty.Phenotype
    .pathways: bionty.Pathway
    .experimental_factors: bionty.ExperimentalFactor
    .developmental_stages: bionty.DevelopmentalStage
    .ethnicities: bionty.Ethnicity

Slice an individual dataset

Let’s look at an artifact and show its metadata using .describe().

artifact = ln.Artifact.get(description="Mature kidney dataset: immune", is_latest=True)
artifact.describe()
Hide code cell output
Artifact: cell-census/2025-01-30/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad (2025-01-30)
|   description: Mature kidney dataset: immune
├── uid: WwmBIhBNLTlRcSoBDt77            run: o9WY9Nh (annotate_2025_30_01_LTS.py)
kind: None                           otype: AnnData                           
hash: yUEFmGNLKpPqf-VNn0flzg         size: 43.1 MB                            
branch: main                         space: all                               
created_at: 2025-07-30 09:51:09 UTC  created_by: zethson                      
n_observations: 7803                                                          
├── storage/path: s3://cellxgene-data-public/cell-census/2025-01-30/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad
├── Dataset features
├── obs (20)                                                                                                   
│   assay_ontology_term_id          bionty.ExperimentalFactor.ontolo…  EFO:0009899                             
│   cell_type_ontology_term_id      bionty.CellType.ontology_id[sour…  CL:0000097, CL:0000236, CL:0000451, CL:…
│   development_stage_ontology_te…  bionty.DevelopmentalStage.ontolo…  HsapDv:0000096, HsapDv:0000098, HsapDv:…
│   disease_ontology_term_id        bionty.Disease.ontology_id[sourc…  PATO:0000461                            
│   organism_ontology_term_id       bionty.Organism.ontology_id[sour…  NCBITaxon:9606                          
│   self_reported_ethnicity_ontol…  bionty.Ethnicity.ontology_id[sou…  unknown                                 
│   sex_ontology_term_id            bionty.Phenotype.ontology_id[sou…  PATO:0000383, PATO:0000384              
│   suspension_type                 ULabel                             cell                                    
│   tissue_ontology_term_id         bionty.Tissue.ontology_id[source…  UBERON:0000362, UBERON:0001224, UBERON:…
│   tissue_type                     ULabel                             tissue                                  
│   assay                           bionty.ExperimentalFactor[source…                                          
│   cell_type                       bionty.CellType[source__uid='3Uw…                                          
│   development_stage               bionty.DevelopmentalStage[source…                                          
│   disease                         bionty.Disease[source__uid='4a3e…                                          
│   donor_id                        str                                                                        
│   self_reported_ethnicity         bionty.Ethnicity[source__uid='MJ…                                          
│   sex                             bionty.Phenotype[source__uid='3o…                                          
│   tissue                          bionty.Tissue[source__uid='MUtAG…                                          
│   organism                        bionty.Organism.scientific_name[…                                          
│   is_primary_data                 ULabel                                                                     
└── var (2)                                                                                                    
    var_index                       bionty.Gene.ensembl_gene_id[sour…                                          
    feature_is_filtered             bool                                                                       
├── External features
└── n_of_donors                     int                                13                                      
└── Labels
    └── .ulabels                        ULabel                             cell, tissue                            
        .references                     Reference                          Spatiotemporal immune zonation of the h…
        .organisms                      bionty.Organism                    human                                   
        .tissues                        bionty.Tissue                      kidney blood vessel, renal pelvis, cort…
        .cell_types                     bionty.CellType                    natural killer cell, CD8-positive, alph…
        .diseases                       bionty.Disease                     normal                                  
        .phenotypes                     bionty.Phenotype                   female, male                            
        .experimental_factors           bionty.ExperimentalFactor          10x 3' v2                               
        .developmental_stages           bionty.DevelopmentalStage          63-year-old stage, 2-year-old stage, 4-…
        .ethnicities                    bionty.Ethnicity                   unknown                                 
More ways of accessing metadata

Access just features:

artifact.features

Or get labels given a feature:

artifact.labels.get(features.tissue).to_dataframe()

If you want to query a slice of the array data, you have two options:

  1. Cache the artifact on disk and return the path to the cached data. Doesn’t download anything if the artifact is already in the cache.

  2. Cache & load the entire artifact into memory via artifact.load() -> AnnData

  3. Stream the array using a (cloud-backed) accessor artifact.open() -> AnnDataAccessor

Both will run much faster in the AWS us-west-2 data center.

Cache:

cache_path = artifact.cache()
cache_path
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
PosixUPath('/home/runner/.cache/lamindb/cellxgene-data-public/cell-census/2025-01-30/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad')

Cache & load:

adata = artifact.load()
adata
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
AnnData object with n_obs × n_vars = 7803 × 32839
    obs: 'donor_id', 'donor_age', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sample_uuid', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'mapped_reference_annotation', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'reported_diseases', 'sex_ontology_term_id', 'compartment', 'Experiment', 'Project', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

Now we have an AnnData object, which stores observation annotations matching our artifact-level query in the .obs slot, and we can re-use almost the same query on the array-level.

See the array-level query
adata_slice = adata[
    adata.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata.obs.tissue == tissues.kidney.name)
    & (adata.obs.suspension_type == suspension_types.cell.name)
    & (adata.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_slice
See the artifact-level query
collection = ln.Collection.filter(name="cellxgene-census", version="2024-07-01").one()
query = collection.artifacts.filter(
    organism=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)

AnnData uses pandas to manage metadata and the syntax differs slightly. However, the same metadata records are used.

Stream, slice and load the slice into memory:

with artifact.open() as adata_backed:
    display(adata_backed)
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
AnnDataAccessor object with n_obs × n_vars = 7803 × 32839
  constructed for the AnnData object 20d87640-4be8-487f-93d4-dce38378d00f.h5ad
    obs: ['Experiment', 'Project', '_index', 'assay', 'assay_ontology_term_id', 'author_cell_type', 'cell_type', 'cell_type_ontology_term_id', 'compartment', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_age', 'donor_id', 'is_primary_data', 'library_uuid', 'mapped_reference_annotation', 'observation_joinid', 'organism', 'organism_ontology_term_id', 'reported_diseases', 'sample_uuid', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'suspension_uuid', 'tissue', 'tissue_ontology_term_id', 'tissue_type']
    obsm: ['X_umap']
    raw: ['X', 'var', 'varm']
    uns: ['citation', 'default_embedding', 'schema_reference', 'schema_version', 'title']
    var: ['_index', 'feature_biotype', 'feature_is_filtered', 'feature_length', 'feature_name', 'feature_reference', 'feature_type']

We now have an AnnDataAccessor object, which behaves much like an AnnData, and slicing looks similar to the query above.

See the slicing operation
adata_backed_slice = adata_backed[
    adata_backed.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata_backed.obs.tissue == tissues.kidney.name)
    & (adata_backed.obs.suspension_type == suspension_types.cell.name)
    & (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
]

adata_backed_slice.to_memory()

Query collections of datasets

Let’s search collections from CELLxGENE within the 2024-07-01 release:

ln.Collection.filter(version="2024-07-01").search("human retina", limit=10)
Hide code cell output
<QuerySet [Collection(uid='2gBKIwx8AtCHc4nfcQqc', version='2024-07-01', is_latest=False, key='A single-cell transcriptome atlas of the adult human retina', description='10.15252/embj.2018100811', hash='sCh4gUTJJJjECsp1dj0q', reference='3472f32d-4a33-48e2-aad5-666d4631bf4c', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:20:47 UTC, is_locked=False), Collection(uid='tZYmzwfh0bIYzKBQVuro', version='2024-07-01', is_latest=False, key='Cell Types of the Human Retina and Its Organoids at Single-Cell Resolution', description='10.1016/j.cell.2020.08.013', hash='nGcCV4HJONcma2SExXw2', reference='2f4c738f-e2f3-4553-9db2-0582a38ea4dc', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:19:55 UTC, is_locked=False), Collection(uid='zZLyhpo1aDdxdbULFbVT', version='2024-07-01', is_latest=False, key='Single-cell transcriptomic atlas of the human retina identifies cell types associated with age-related macular degeneration', description='10.1038/s41467-019-12780-8', hash='1B0m9_FahAvefSTM8_AV', reference='1a486c4c-c115-4721-8c9f-f9f096e10857', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:20:25 UTC, is_locked=False), Collection(uid='8ohRJQq8e3F7pdlBZbhz', version='2024-07-01', is_latest=False, key='Single cell atlas of the human retina', description='10.1101/2023.11.07.566105', hash='_vU7tll3t-0NCuJL-fm0', reference='4c6eaf5c-6d57-4c76-b1e9-60df8c655f1e', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:19:25 UTC, is_locked=False), Collection(uid='quQDnLsMLkP3JRsC8gp4', version='2024-07-01', is_latest=False, key='Single-cell transcriptomic atlas for adult human retina', description='10.1016/j.xgen.2023.100298', hash='NIo8G6_reJTEqMzW2nMc', reference='af893e86-8e9f-41f1-a474-ef05359b1fb7', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:21:51 UTC, is_locked=False), Collection(uid='Yxth0JJgMb2VVOCfSgWj', version='2024-07-01', is_latest=False, key='Single-cell transcriptomics of the human retinal pigment epithelium and choroid in health and macular degeneration', description='10.1073/pnas.1914143116', hash='j2LqihaaNawOtEFysl3c', reference='f8057c47-fcd8-4fcf-88b0-e2f930080f6e', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:21:55 UTC, is_locked=False)]>

Let’s get the record of the top hit collection:

collection = ln.Collection.get("quQDnLsMLkP3JRsC8gp4")
collection
Hide code cell output
Collection(uid='quQDnLsMLkP3JRsC8gp4', version='2024-07-01', is_latest=False, key='Single-cell transcriptomic atlas for adult human retina', description='10.1016/j.xgen.2023.100298', hash='NIo8G6_reJTEqMzW2nMc', reference='af893e86-8e9f-41f1-a474-ef05359b1fb7', reference_type='CELLxGENE Collection ID', meta_artifact=None, branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:21:51 UTC, is_locked=False)

It’s a Science paper and we can find more information on it using the DOI or CELLxGENE collection id. There are multiple versions of this collection.

collection.versions.to_dataframe()
Hide code cell output
uid key description hash reference reference_type version is_latest is_locked created_at branch_id space_id created_by_id run_id meta_artifact_id
id
767 quQDnLsMLkP3JRsC8gp5 Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 _bg3b6SweW6v1TJX7NgHCw af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 2025-01-30 True False 2025-08-04 14:53:38.098405+00:00 1 1 8 44.0 None
606 quQDnLsMLkP3JRsC8gp4 Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 NIo8G6_reJTEqMzW2nMc af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 2024-07-01 False False 2024-07-16 12:21:51.449109+00:00 1 1 1 27.0 None
291 quQDnLsMLkP3JRsCJNGB Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 FsD52kpR7dF2h78-P3ka af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 2023-12-15 False False 2024-01-11 13:41:01.880382+00:00 1 1 1 22.0 None
134 quQDnLsMLkP3JRsC6WWz Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 xhfSShX8lypXPx00zevx af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 2023-07-25 False False 2024-01-08 12:22:12.891930+00:00 1 1 1 NaN None

The collection groups artifacts.

collection.artifacts.to_dataframe()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
uid key description suffix kind otype size hash n_files n_observations version is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
3600 80xlsVmayPPBCCEZ7aBc cell-census/2024-07-01/h5ads/ed419b4e-db9b-40f... Non-neuronal cells in human retina .h5ad dataset AnnData 1070671504 slN6j-9aSrYFw-IPL-wv-A None 18011 2024-07-01 False False 2024-07-12 12:34:10.255394+00:00 1 1 2 27 None 1
3378 Ce4Mqe4X2vUhwkwnh5YQ cell-census/2024-07-01/h5ads/aad97cb5-f375-45e... Retinal ganglion cells in human retina .h5ad dataset AnnData 784580498 w-_LJDfBv7vsZqw-9Jt72g None 11617 2024-07-01 False False 2024-07-12 12:34:09.816906+00:00 1 1 2 27 None 1
3273 1OyQQLNfu1nzvVADODND cell-census/2024-07-01/h5ads/8f10185b-e0b3-46a... Bipolar cells in human retina .h5ad dataset AnnData 3075818557 1GQwZcymSrr7d2Xit-5Deg None 53040 2024-07-01 False False 2024-07-12 12:34:09.644258+00:00 1 1 2 27 None 1
3018 QpuY5RsGTBBMN61QGY4t cell-census/2024-07-01/h5ads/359f7af4-87d4-411... Amacrine cells in human retina .h5ad dataset AnnData 3382221253 S7gXlC-cJ362BOqYZFxMOA None 56507 2024-07-01 False False 2024-07-12 12:34:09.160201+00:00 1 1 2 27 None 1
2919 GA2BXWwoJlcRfzNp3iyQ cell-census/2024-07-01/h5ads/11ef37ee-2173-458... Horizontal cells in human retina .h5ad dataset AnnData 404987285 fR0O7fSUHxmAfEDC8J7Ipw None 7348 2024-07-01 False False 2024-07-12 12:34:08.949267+00:00 1 1 2 27 None 1
2855 wYiUe9hn4TJijpoX90Mr cell-census/2024-07-01/h5ads/0129dbd9-a7d3-4f6... All major cell types in adult human retina .h5ad dataset AnnData 14638089351 bXxaz_quQ4mIbVlarLZZKQ None 244474 2024-07-01 False False 2024-07-12 12:34:08.826175+00:00 1 1 2 27 None 1
2852 Oc6ANFJ0FgOW1B70mNIq cell-census/2024-07-01/h5ads/00e5dedd-b9b7-43b... Photoreceptor cells in human retina (rod cells... .h5ad dataset AnnData 990594324 qFT65q6_k30pki8-1_2HoQ None 21422 2024-07-01 False False 2024-07-12 12:34:08.813762+00:00 1 1 2 27 None 1

Let’s now look at the collection that corresponds to the cellxgene-census release of .h5ad artifacts.

collection = ln.Collection.get(key="cellxgene-census", version="2024-07-01")
collection
Hide code cell output
Collection(uid='dMyEX3NTfKOEYXyMKDD7', version='2024-07-01', is_latest=False, key='cellxgene-census', description=None, hash='nI8Ag-HANeOpZOz-8CSn', reference=None, reference_type=None, meta_artifact=None, branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:14:38 UTC, is_locked=False)

You can count all contained artifacts or get them as a dataframe.

collection.artifacts.count()
Hide code cell output
812
collection.artifacts.to_dataframe().head()  # not tracking run & transform because read-only instance
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
uid key description suffix kind otype size hash n_files n_observations version is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
3661 TJWkpg366cGCnTAR9t8I cell-census/2024-07-01/h5ads/ff7d15fa-f4b6-4a0... Dissection: Cerebral cortex (Cx) - Cuneus, ros... .h5ad dataset AnnData 340176061 RZu8Sv7H-XawvbQy74MKLg None 28051 2024-07-01 False False 2024-07-12 12:34:10.348961+00:00 1 1 2 27 None 1
3660 ds2ArOPrb4AA8WBGkITP cell-census/2024-07-01/h5ads/ff45e623-7f5f-46e... Tabula Sapiens - Pancreas .h5ad dataset AnnData 361538340 yqnnnEGdaeWby_4EVEpg8g None 13497 2024-07-01 False False 2024-07-12 12:34:10.347293+00:00 1 1 2 27 None 1
3659 KBW89Mf7IGcekja2hADu cell-census/2024-07-01/h5ads/fe52003e-1460-4a6... Myeloid compartment .h5ad dataset AnnData 691757462 SZ5tB0T4YKfiUuUkAL09ZA None 51552 2024-07-01 False False 2024-07-12 12:34:10.345829+00:00 1 1 2 27 None 1
3658 J7Ni7YzRM9R94RhmShk0 cell-census/2024-07-01/h5ads/fe4b89d5-461e-440... TI epithelial .h5ad dataset AnnData 938253823 dtbgyvXPKfqLIiB7uQEh-A None 154136 2024-07-01 False False 2024-07-12 12:34:10.344331+00:00 1 1 2 27 None 1
3657 g0RcSSYe5vQKzSWYkhMc cell-census/2024-07-01/h5ads/fe1a73ab-a203-45f... Dissection: Amygdaloid complex (AMY) - basolat... .h5ad dataset AnnData 391552151 1V_lPFFOF51ioRTSVWx9Mg None 35285 2024-07-01 False False 2024-07-12 12:34:10.341733+00:00 1 1 2 27 None 1

You can query across artifacts by arbitrary metadata combinations, for instance:

query = collection.artifacts.filter(
    organisms=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size")  # order by size
query.to_dataframe().head()  # convert to DataFrame
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations version is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
2961 WwmBIhBNLTlRcSoBDt76 cell-census/2024-07-01/h5ads/20d87640-4be8-487... Mature kidney dataset: immune .h5ad dataset AnnData 45158726 GCMHkdQSTeXxRVF7gMZFIA None 7803 2024-07-01 False False 2024-07-12 12:34:09.039540+00:00 1 1 2 27 None 1
3000 gHlQ5Muwu3G9pvFCx3x8 cell-census/2024-07-01/h5ads/2d31c0ca-0233-41c... Fetal kidney dataset: immune .h5ad dataset AnnData 64546349 2qy8uy-65Sd_XcBU-nrPgA None 6847 2024-07-01 False False 2024-07-12 12:34:09.128217+00:00 1 1 2 27 None 1
3324 P4Oai3OLGAzRwoicHfLM cell-census/2024-07-01/h5ads/9ea768a2-87ab-46b... Mature kidney dataset: full .h5ad dataset AnnData 194047623 aZVpGZwAfMCziff_5ow2bg None 40268 2024-07-01 False False 2024-07-12 12:34:09.732579+00:00 1 1 2 27 None 1
2914 DSpevwaIl5E2jIWHbui4 cell-census/2024-07-01/h5ads/105c7dad-0468-462... mature .h5ad dataset AnnData 233914522 pz2wn0GB8pcRRupfY03gKQ None 40268 2024-07-01 False False 2024-07-12 12:34:08.941671+00:00 1 1 2 27 None 1
3519 11HQaMeIUaOwyHoOkqqM cell-census/2024-07-01/h5ads/d7dcfd8f-2ee7-438... Fetal kidney dataset: full .h5ad dataset AnnData 342398936 CzNBRaQGupXRxF5IntjWBg None 27197 2024-07-01 False False 2024-07-12 12:34:10.101903+00:00 1 1 2 27 None 1

Slice a concatenated array

Let us now use the concatenated version of the Census collection: a tiledbsoma array that concatenates all AnnData arrays present in the collection we just explored. Slicing tiledbsoma works similar to slicing DataFrame or AnnData.

value_filter = (
    f'{features.tissue} == "{tissues.brain.name}" and {features.cell_type} in'
    f' ["{cell_types.microglial_cell.name}", "{cell_types.neuron.name}"] and'
    f' {features.suspension_type} == "{suspension_types.cell.name}" and {features.assay} =='
    f' "{assays.ln_10x_3_v3}"'
)
value_filter
'tissue == "brain" and cell_type in ["microglial cell", "neuron"] and suspension_type == "cell" and assay == "10x 3\' v3"'

Query for the tiledbsoma array store that contains all concatenated expression data. It’s a new dataset produced by concatenating all AnnData-like artifacts in the Census collection.

census_artifact = ln.Artifact.get(description="Census 2025-01-30")

Run the slicing operation.

human = "homo_sapiens"  # subset to human data

# open the array store for queries
with census_artifact.open() as store:
    # read SOMADataFrame as a slice
    cell_metadata = store["census_data"][human].obs.read(value_filter=value_filter)
    # concatenate results to pyarrow.Table
    cell_metadata = cell_metadata.concat()
    # convert to pandas.DataFrame
    cell_metadata = cell_metadata.to_pandas()

cell_metadata.head()
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id donor_id is_primary_data observation_joinid self_reported_ethnicity self_reported_ethnicity_ontology_term_id sex sex_ontology_term_id suspension_type tissue tissue_ontology_term_id tissue_type tissue_general tissue_general_ontology_term_id raw_sum nnz raw_mean_nnz raw_variance_nnz n_measured_vars
0 46791195 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False 4kz&dTc~he unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 8231.0 3617 2.275643 27.214100 59229
1 46791196 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False ?--VrK%il{ unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 6513.0 2905 2.241997 41.835701 59229
2 46791197 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False _ws9=bPvtV unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 10514.0 3864 2.721014 75.146584 59229
3 46791198 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False +_|VI+u9-j unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 9979.0 3863 2.583225 44.252976 59229
4 46791199 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False i-geD_>1A2 unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 13689.0 4737 2.889804 53.114966 59229

Create an AnnData object.

from tiledbsoma import AxisQuery

with census_artifact.open() as store:
    experiment = store["census_data"][human]
    adata = experiment.axis_query(
        "RNA", obs_query=AxisQuery(value_filter=value_filter)
    ).to_anndata(
        X_name="raw",
        column_names={
            "obs": [
                features.assay,
                features.cell_type,
                features.tissue,
                features.disease,
                features.suspension_type,
            ]
        },
    )

adata.var = adata.var.set_index("feature_id")
adata
! run input wasn't tracked, call `ln.track()` and re-run
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.
  return dispatch(args[0].__class__)(*args, **kw)
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.
  return dispatch(args[0].__class__)(*args, **kw)
AnnData object with n_obs × n_vars = 117660 × 61888
    obs: 'assay', 'cell_type', 'tissue', 'disease', 'suspension_type'
    var: 'soma_joinid', 'feature_name', 'feature_type', 'feature_length', 'nnz', 'n_measured_obs'

Train ML models

You can either directly train ML models on very large collections of AnnData-like artifacts or on a single concatenated tiledbsoma-like artifact. For pros & cons of these approaches, see this blog post.

On a collection of arrays

mapped() caches AnnData objects on disk and creates a map-style dataset that performs a virtual join of the features of the underlying AnnData objects.

from torch.utils.data import DataLoader

census_collection = ln.Collection.get(name="cellxgene-census", version="2024-07-01")

dataset = census_collection.mapped(obs_keys=[features.cell_type], join="outer")

dataloader = DataLoader(dataset, batch_size=128, shuffle=True)

for batch in dataloader:
    pass

dataset.close()

For more background, see Train a machine learning model on a collection.

On a concatenated array

You can create streaming PyTorch dataloaders from tiledbsoma stores using cellxgene_census package.

import cellxgene_census.experimental.ml as census_ml

store = census_artifact.open()

experiment = store["census_data"][human]
experiment_datapipe = census_ml.ExperimentDataPipe(
    experiment,
    measurement_name="RNA",
    X_name="raw",
    obs_query=AxisQuery(value_filter=value_filter),
    obs_column_names=[features.cell_type],
    batch_size=128,
    shuffle=True,
    soma_chunk_size=10000,
)
experiment_dataloader = census_ml.experiment_dataloader(experiment_datapipe)

for batch in experiment_dataloader:
    pass

store.close()

For more background see this guide.