hub

CELLxGENE: scRNA-seq

CZ CELLxGENE hosts the globally largest standardized collection of scRNA-seq datasets.

LaminDB makes it easy to query the CELLxGENE data and integrate it with in-house data of any kind (omics, phenotypes, pdfs, notebooks, ML models, …).

You can use the CELLxGENE data in two ways:

  1. Query collections of AnnData objects.

  2. Slice a big array store produced by concatenated AnnData objects via tiledbsoma.

If you are interested in building similar data assets in-house:

  1. See the transfer guide to zero-copy data to your own LaminDB instance.

  2. See the scRNA guide to create a growing, standardized & versioned scRNA-seq dataset collection.

Show me a screenshot

Connect to the public LaminDB instance that mirrors cellxgene:

# pip install 'lamindb[bionty,jupyter]'
!lamin connect laminlabs/cellxgene
Hide code cell output
 connected lamindb: laminlabs/cellxgene
import lamindb as ln
import bionty as bt
Hide code cell output
 connected lamindb: laminlabs/cellxgene

Query & understand metadata

Auto-complete metadata

You can create look-up objects for any registry in LaminDB, including basic biological entities and things like users or storage locations.

Let’s use auto-complete to look up cell types:

Show me a screenshot
cell_types = bt.CellType.lookup()
cell_types.effector_t_cell
Hide code cell output
CellType(uid='3nfZTVV4', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', branch_id=1, space_id=1, created_by_id=1, source_id=48, created_at=2023-11-28 22:30:57 UTC)

You can also arbitrarily chain filters and create lookups from them:

users = ln.User.lookup()
organisms = bt.Organism.lookup()
experimental_factors = bt.ExperimentalFactor.lookup()  # labels for experimental factors
tissues = bt.Tissue.lookup()  # tissue labels
suspension_types = ln.ULabel.filter(type__name="SuspensionType").lookup()
# here we choose to return .name directly
features = ln.Feature.lookup(return_field="name")
assays = bt.ExperimentalFactor.lookup(return_field="name")

Search & filter metadata

We can use search & filters for metadata:

bt.CellType.search("effector T cell").df().head()
Hide code cell output
/tmp/ipykernel_4260/2373537041.py:1: FutureWarning: Use to_dataframe instead of df, df will be removed in the future.
  bt.CellType.search("effector T cell").df().head()
uid name ontology_id abbr synonyms description space_id source_id run_id created_at created_by_id _aux branch_id
id
1623 3nfZTVV4 effector T cell CL:0000911 None effector T-cell|effector T-lymphocyte|effector... A Differentiated T Cell With Ability To Traffi... 1 48 None 2023-11-28 22:30:57.481760+00:00 1 None 1
1169 6JD5JCZC CD8-positive, alpha-beta cytokine secreting ef... CL:0000908 None CD8-positive, alpha-beta cytokine secreting ef... A Cd8-Positive, Alpha-Beta T Cell With The Phe... 1 48 None 2023-11-28 22:27:55.571572+00:00 1 None 1
1229 69TEBGqb exhausted T cell CL:0011025 None Tex cell|An effector T cell that displays impa... None 1 48 None 2023-11-28 22:27:55.572880+00:00 1 None 1
1331 43cBCa7s helper T cell CL:0000912 None helper T-lymphocyte|T-helper cell|helper T lym... A Effector T Cell That Provides Help In The Fo... 1 48 None 2023-11-28 22:27:55.575949+00:00 1 None 1
1503 1oa5G2Mq memory T cell CL:0000813 None memory T-cell|memory T lymphocyte|memory T-lym... A Long-Lived, Antigen-Experienced T Cell That ... 1 48 None 2023-11-28 22:27:55.580286+00:00 1 None 1

And use a uid to filter exactly one metadata record:

effector_t_cell = bt.CellType.get("3nfZTVV4")
effector_t_cell
Hide code cell output
CellType(uid='3nfZTVV4', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', branch_id=1, space_id=1, created_by_id=1, source_id=48, created_at=2023-11-28 22:30:57 UTC)

Understand ontologies

View the related ontology terms:

effector_t_cell.view_parents(distance=2, with_children=True)
Hide code cell output
_images/35f8bdf7fbde79497e2b281e8157e31cacd6205cc37490647fadbc46b444b7e6.svg

Or access them programmatically:

effector_t_cell.children.df()
Hide code cell output
/tmp/ipykernel_4260/1140720359.py:1: FutureWarning: Use to_dataframe instead of df, df will be removed in the future.
  effector_t_cell.children.df()
uid name ontology_id abbr synonyms description space_id source_id run_id created_at created_by_id _aux branch_id
id
931 2VQirdSp effector CD8-positive, alpha-beta T cell CL:0001050 None effector CD8-positive, alpha-beta T lymphocyte... A Cd8-Positive, Alpha-Beta T Cell With The Phe... 1 48 None 2023-11-28 22:27:55.565976+00:00 1 None 1
1088 490Xhb24 effector CD4-positive, alpha-beta T cell CL:0001044 None effector CD4-positive, alpha-beta T lymphocyte... A Cd4-Positive, Alpha-Beta T Cell With The Phe... 1 48 None 2023-11-28 22:27:55.569828+00:00 1 None 1
1229 69TEBGqb exhausted T cell CL:0011025 None Tex cell|An effector T cell that displays impa... None 1 48 None 2023-11-28 22:27:55.572880+00:00 1 None 1
1309 5s4gCMdn cytotoxic T cell CL:0000910 None cytotoxic T lymphocyte|cytotoxic T-lymphocyte|... A Mature T Cell That Differentiated And Acquir... 1 48 None 2023-11-28 22:27:55.575440+00:00 1 None 1
1331 43cBCa7s helper T cell CL:0000912 None helper T-lymphocyte|T-helper cell|helper T lym... A Effector T Cell That Provides Help In The Fo... 1 48 None 2023-11-28 22:27:55.575949+00:00 1 None 1

Query for individual datasets

Every individual dataset in CELLxGENE is an .h5ad file that is stored as an artifact in LaminDB. Here is an exemplary query:

ln.Artifact.filter(
    suffix=".h5ad",  # filename suffix
    description__contains="immune",
    size__gt=1e9,  # size > 1GB
    cell_types__in=[
        cell_types.b_cell,
        cell_types.t_cell,
    ],  # cell types measured in AnnData
    created_by=users.sunnyosun,  # creator
).order_by("created_at").df(
    include=["cell_types__name", "created_by__handle"]  # join with additional info
).head()
Hide code cell output
/tmp/ipykernel_4260/596979442.py:10: FutureWarning: Use to_dataframe instead of df, df will be removed in the future.
  ).order_by("created_at").df(
uid key cell_types__name created_by__handle
id
879 BCutg5cxmqLmy2Z5SS8J cell-census/2023-07-25/h5ads/01ad3cd7-3929-465... {CD4-positive, alpha-beta T cell, classical mo... sunnyosun
1106 3xdOASXuAxxJtSchJO3D cell-census/2023-07-25/h5ads/48101fa2-1a63-451... {hematopoietic stem cell, neutrophil, mature B... sunnyosun
1174 wt7eD72sTzwL3rfYaZr2 cell-census/2023-07-25/h5ads/58b01044-c5e5-4b0... {innate lymphoid cell, CD4-positive, alpha-bet... sunnyosun
1377 znTBqWgfYgFlLjdQ6Ba7 cell-census/2023-07-25/h5ads/9dbab10c-118d-496... {neutrophil, monocyte, dendritic cell, mast ce... sunnyosun
1482 dEP0dZ8UxLgwnkLjz6Iq cell-census/2023-07-25/h5ads/bd65a70f-b274-413... {CD4-positive, alpha-beta T cell, naive T cell... sunnyosun
What happens under the hood?

As you saw from inspecting ln.Artifact, ln.Artifact.cell_types relates artifacts with bt.CellType.

The expression cell_types__name__in performs the join of the underlying registries and matches bt.CellType.name to ["B cell", "T cell"].

Similar for created_by, which relates artifacts with ln.User.

To see what you can query for, look at the registry representation.

ln.Artifact
Hide code cell output
Artifact
  Simple fields
    .uid: CharField
    .key: CharField
    .description: CharField
    .suffix: CharField
    .kind: CharField
    .otype: CharField
    .size: BigIntegerField
    .hash: CharField
    .n_files: BigIntegerField
    .n_observations: BigIntegerField
    .version: CharField
    .is_latest: BooleanField
    .created_at: DateTimeField
    .updated_at: DateTimeField
  Relational fields
    .branch: Branch
    .space: Space
    .storage: Storage
    .run: Run
    .schema: Schema
    .created_by: User
    .ulabels: ULabel
    .input_of_runs: Run
    .feature_sets: Schema
    .collections: Collection
    .linked_in_records: Record
    .records: Record
    .references: Reference
    .projects: Project
  Bionty fields
    .organisms: bionty.Organism
    .genes: bionty.Gene
    .proteins: bionty.Protein
    .cell_markers: bionty.CellMarker
    .tissues: bionty.Tissue
    .cell_types: bionty.CellType
    .diseases: bionty.Disease
    .cell_lines: bionty.CellLine
    .phenotypes: bionty.Phenotype
    .pathways: bionty.Pathway
    .experimental_factors: bionty.ExperimentalFactor
    .developmental_stages: bionty.DevelopmentalStage
    .ethnicities: bionty.Ethnicity

Slice an individual dataset

Let’s look at an artifact and show its metadata using .describe().

artifact = ln.Artifact.get(description="Mature kidney dataset: immune", is_latest=True)
artifact.describe()
Hide code cell output
Artifact .h5ad · AnnData
├── General
│   ├── key: cell-census/2025-01-30/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad
│   ├── description: Mature kidney dataset: immune
│   ├── uid: WwmBIhBNLTlRcSoBDt77          hash: yUEFmGNLKpPqf-VNn0flzg
│   ├── size: 43.1 MB                      transform: annotate_2025_30_01_LTS.py
│   ├── space: all                         branch: main
│   ├── created_by: zethson                created_at: 2025-07-30 09:51:09
│   ├── n_observations: 7803               version: 2025-01-30
│   └── storage path: 
s3://cellxgene-data-public/cell-census/2025-01-30/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad
├── Dataset features
│   ├── obs20                        [Feature]                                                                  
│   │   assay_ontology_term_id          cat[bionty.ExperimentalFactor.on…  10x 3' v2                               
│   │   cell_type_ontology_term_id      cat[bionty.CellType.ontology_id[…  B cell, CD4-positive, alpha-beta T cell…
│   │   development_stage_ontology_te…  cat[bionty.DevelopmentalStage.on…  12-year-old stage, 19-month-old stage, …
│   │   disease_ontology_term_id        cat[bionty.Disease.ontology_id[s…  normal                                  
│   │   organism_ontology_term_id       cat[bionty.Organism.ontology_id[…  human                                   
│   │   self_reported_ethnicity_ontol…  cat[bionty.Ethnicity.ontology_id…  unknown                                 
│   │   sex_ontology_term_id            cat[bionty.Phenotype.ontology_id…  female, male                            
│   │   suspension_type                 cat[ULabel]                        cell                                    
│   │   tissue_ontology_term_id         cat[bionty.Tissue.ontology_id[so…  cortex of kidney, kidney, kidney blood …
│   │   tissue_type                     cat[ULabel]                        tissue                                  
│   │   assay                           cat[bionty.ExperimentalFactor[so…                                          
│   │   cell_type                       cat[bionty.CellType[source__uid=…                                          
│   │   development_stage               cat[bionty.DevelopmentalStage[so…                                          
│   │   disease                         cat[bionty.Disease[source__uid='…                                          
│   │   donor_id                        str                                                                        
│   │   self_reported_ethnicity         cat[bionty.Ethnicity[source__uid…                                          
│   │   sex                             cat[bionty.Phenotype[source__uid…                                          
│   │   tissue                          cat[bionty.Tissue[source__uid='M…                                          
│   │   organism                        cat[bionty.Organism.scientific_n…                                          
│   │   is_primary_data                 cat[ULabel]                                                                
│   └── var2                         [Feature]                                                                  
var_index                       cat[bionty.Gene.ensembl_gene_id[…                                          
feature_is_filtered             bool                                                                       
├── Linked features
│   └── n_of_donors                     int                                13                                      
└── Labels
    └── .ulabels                        ULabel                             cell, tissue                            
        .organisms                      bionty.Organism                    human                                   
        .tissues                        bionty.Tissue                      kidney blood vessel, renal pelvis, cort…
        .cell_types                     bionty.CellType                    natural killer cell, CD8-positive, alph…
        .diseases                       bionty.Disease                     normal                                  
        .phenotypes                     bionty.Phenotype                   female, male                            
        .experimental_factors           bionty.ExperimentalFactor          10x 3' v2                               
        .developmental_stages           bionty.DevelopmentalStage          63-year-old stage, 2-year-old stage, 4-…
        .ethnicities                    bionty.Ethnicity                   unknown                                 
More ways of accessing metadata

Access just features:

artifact.features

Or get labels given a feature:

artifact.labels.get(features.tissue).df()

If you want to query a slice of the array data, you have two options:

  1. Cache the artifact on disk and return the path to the cached data. Doesn’t download anything if the artifact is already in the cache.

  2. Cache & load the entire artifact into memory via artifact.load() -> AnnData

  3. Stream the array using a (cloud-backed) accessor artifact.open() -> AnnDataAccessor

Both will run much faster in the AWS us-west-2 data center.

Cache:

cache_path = artifact.cache()
cache_path
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
PosixUPath('/home/runner/.cache/lamindb/cellxgene-data-public/cell-census/2025-01-30/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad')

Cache & load:

adata = artifact.load()
adata
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
AnnData object with n_obs × n_vars = 7803 × 32839
    obs: 'donor_id', 'donor_age', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sample_uuid', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'mapped_reference_annotation', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'reported_diseases', 'sex_ontology_term_id', 'compartment', 'Experiment', 'Project', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
    uns: 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
    obsm: 'X_umap'

Now we have an AnnData object, which stores observation annotations matching our artifact-level query in the .obs slot, and we can re-use almost the same query on the array-level.

See the array-level query
adata_slice = adata[
    adata.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata.obs.tissue == tissues.kidney.name)
    & (adata.obs.suspension_type == suspension_types.cell.name)
    & (adata.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_slice
See the artifact-level query
collection = ln.Collection.filter(name="cellxgene-census", version="2024-07-01").one()
query = collection.artifacts.filter(
    organism=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)

AnnData uses pandas to manage metadata and the syntax differs slightly. However, the same metadata records are used.

Stream, slice and load the slice into memory:

with artifact.open() as adata_backed:
    display(adata_backed)
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
AnnDataAccessor object with n_obs × n_vars = 7803 × 32839
  constructed for the AnnData object 20d87640-4be8-487f-93d4-dce38378d00f.h5ad
    obs: ['Experiment', 'Project', '_index', 'assay', 'assay_ontology_term_id', 'author_cell_type', 'cell_type', 'cell_type_ontology_term_id', 'compartment', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_age', 'donor_id', 'is_primary_data', 'library_uuid', 'mapped_reference_annotation', 'observation_joinid', 'organism', 'organism_ontology_term_id', 'reported_diseases', 'sample_uuid', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'suspension_uuid', 'tissue', 'tissue_ontology_term_id', 'tissue_type']
    obsm: ['X_umap']
    raw: ['X', 'var', 'varm']
    uns: ['citation', 'default_embedding', 'schema_reference', 'schema_version', 'title']
    var: ['_index', 'feature_biotype', 'feature_is_filtered', 'feature_length', 'feature_name', 'feature_reference', 'feature_type']

We now have an AnnDataAccessor object, which behaves much like an AnnData, and slicing looks similar to the query above.

See the slicing operation
adata_backed_slice = adata_backed[
    adata_backed.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata_backed.obs.tissue == tissues.kidney.name)
    & (adata_backed.obs.suspension_type == suspension_types.cell.name)
    & (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
]

adata_backed_slice.to_memory()

Query collections of datasets

Let’s search collections from CELLxGENE within the 2024-07-01 release:

ln.Collection.filter(version="2024-07-01").search("human retina", limit=10)
Hide code cell output
<QuerySet [Collection(uid='2gBKIwx8AtCHc4nfcQqc', version='2024-07-01', is_latest=False, key='A single-cell transcriptome atlas of the adult human retina', description='10.15252/embj.2018100811', hash='sCh4gUTJJJjECsp1dj0q', reference='3472f32d-4a33-48e2-aad5-666d4631bf4c', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:20:47 UTC), Collection(uid='tZYmzwfh0bIYzKBQVuro', version='2024-07-01', is_latest=False, key='Cell Types of the Human Retina and Its Organoids at Single-Cell Resolution', description='10.1016/j.cell.2020.08.013', hash='nGcCV4HJONcma2SExXw2', reference='2f4c738f-e2f3-4553-9db2-0582a38ea4dc', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:19:55 UTC), Collection(uid='zZLyhpo1aDdxdbULFbVT', version='2024-07-01', is_latest=False, key='Single-cell transcriptomic atlas of the human retina identifies cell types associated with age-related macular degeneration', description='10.1038/s41467-019-12780-8', hash='1B0m9_FahAvefSTM8_AV', reference='1a486c4c-c115-4721-8c9f-f9f096e10857', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:20:25 UTC), Collection(uid='8ohRJQq8e3F7pdlBZbhz', version='2024-07-01', is_latest=False, key='Single cell atlas of the human retina', description='10.1101/2023.11.07.566105', hash='_vU7tll3t-0NCuJL-fm0', reference='4c6eaf5c-6d57-4c76-b1e9-60df8c655f1e', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:19:25 UTC), Collection(uid='quQDnLsMLkP3JRsC8gp4', version='2024-07-01', is_latest=False, key='Single-cell transcriptomic atlas for adult human retina', description='10.1016/j.xgen.2023.100298', hash='NIo8G6_reJTEqMzW2nMc', reference='af893e86-8e9f-41f1-a474-ef05359b1fb7', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:21:51 UTC), Collection(uid='Yxth0JJgMb2VVOCfSgWj', version='2024-07-01', is_latest=False, key='Single-cell transcriptomics of the human retinal pigment epithelium and choroid in health and macular degeneration', description='10.1073/pnas.1914143116', hash='j2LqihaaNawOtEFysl3c', reference='f8057c47-fcd8-4fcf-88b0-e2f930080f6e', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:21:55 UTC)]>

Let’s get the record of the top hit collection:

collection = ln.Collection.get("quQDnLsMLkP3JRsC8gp4")
collection
Hide code cell output
Collection(uid='quQDnLsMLkP3JRsC8gp4', version='2024-07-01', is_latest=False, key='Single-cell transcriptomic atlas for adult human retina', description='10.1016/j.xgen.2023.100298', hash='NIo8G6_reJTEqMzW2nMc', reference='af893e86-8e9f-41f1-a474-ef05359b1fb7', reference_type='CELLxGENE Collection ID', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:21:51 UTC)

It’s a Science paper and we can find more information on it using the DOI or CELLxGENE collection id. There are multiple versions of this collection.

collection.versions.df()
Hide code cell output
/tmp/ipykernel_4260/1781224710.py:1: FutureWarning: Use to_dataframe instead of df, df will be removed in the future.
  collection.versions.df()
uid key description hash reference reference_type space_id meta_artifact_id version is_latest run_id created_at created_by_id _aux branch_id
id
767 quQDnLsMLkP3JRsC8gp5 Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 _bg3b6SweW6v1TJX7NgHCw af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 1 None 2025-01-30 True 44.0 2025-08-04 14:53:38.098405+00:00 8 None 1
606 quQDnLsMLkP3JRsC8gp4 Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 NIo8G6_reJTEqMzW2nMc af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 1 None 2024-07-01 False 27.0 2024-07-16 12:21:51.449109+00:00 1 None 1
291 quQDnLsMLkP3JRsCJNGB Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 FsD52kpR7dF2h78-P3ka af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 1 None 2023-12-15 False 22.0 2024-01-11 13:41:01.880382+00:00 1 None 1
134 quQDnLsMLkP3JRsC6WWz Single-cell transcriptomic atlas for adult hum... 10.1016/j.xgen.2023.100298 xhfSShX8lypXPx00zevx af893e86-8e9f-41f1-a474-ef05359b1fb7 CELLxGENE Collection ID 1 None 2023-07-25 False NaN 2024-01-08 12:22:12.891930+00:00 1 None 1

The collection groups artifacts.

collection.artifacts.df()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
/tmp/ipykernel_4260/3881808658.py:1: FutureWarning: Use to_dataframe instead of df, df will be removed in the future.
  collection.artifacts.df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux branch_id
id
2855 wYiUe9hn4TJijpoX90Mr cell-census/2024-07-01/h5ads/0129dbd9-a7d3-4f6... All major cell types in adult human retina .h5ad dataset AnnData 14638089351 bXxaz_quQ4mIbVlarLZZKQ None 244474 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:08.826175+00:00 1 None 1
3018 QpuY5RsGTBBMN61QGY4t cell-census/2024-07-01/h5ads/359f7af4-87d4-411... Amacrine cells in human retina .h5ad dataset AnnData 3382221253 S7gXlC-cJ362BOqYZFxMOA None 56507 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:09.160201+00:00 1 None 1
2852 Oc6ANFJ0FgOW1B70mNIq cell-census/2024-07-01/h5ads/00e5dedd-b9b7-43b... Photoreceptor cells in human retina (rod cells... .h5ad dataset AnnData 990594324 qFT65q6_k30pki8-1_2HoQ None 21422 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:08.813762+00:00 1 None 1
2919 GA2BXWwoJlcRfzNp3iyQ cell-census/2024-07-01/h5ads/11ef37ee-2173-458... Horizontal cells in human retina .h5ad dataset AnnData 404987285 fR0O7fSUHxmAfEDC8J7Ipw None 7348 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:08.949267+00:00 1 None 1
3273 1OyQQLNfu1nzvVADODND cell-census/2024-07-01/h5ads/8f10185b-e0b3-46a... Bipolar cells in human retina .h5ad dataset AnnData 3075818557 1GQwZcymSrr7d2Xit-5Deg None 53040 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:09.644258+00:00 1 None 1
3378 Ce4Mqe4X2vUhwkwnh5YQ cell-census/2024-07-01/h5ads/aad97cb5-f375-45e... Retinal ganglion cells in human retina .h5ad dataset AnnData 784580498 w-_LJDfBv7vsZqw-9Jt72g None 11617 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:09.816906+00:00 1 None 1
3600 80xlsVmayPPBCCEZ7aBc cell-census/2024-07-01/h5ads/ed419b4e-db9b-40f... Non-neuronal cells in human retina .h5ad dataset AnnData 1070671504 slN6j-9aSrYFw-IPL-wv-A None 18011 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:10.255394+00:00 1 None 1

Let’s now look at the collection that corresponds to the cellxgene-census release of .h5ad artifacts.

collection = ln.Collection.get(key="cellxgene-census", version="2024-07-01")
collection
Hide code cell output
Collection(uid='dMyEX3NTfKOEYXyMKDD7', version='2024-07-01', is_latest=False, key='cellxgene-census', hash='nI8Ag-HANeOpZOz-8CSn', branch_id=1, space_id=1, created_by_id=1, run_id=27, created_at=2024-07-16 12:14:38 UTC)

You can count all contained artifacts or get them as a dataframe.

collection.artifacts.count()
Hide code cell output
812
collection.artifacts.df().head()  # not tracking run & transform because read-only instance
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
/tmp/ipykernel_4260/633454652.py:1: FutureWarning: Use to_dataframe instead of df, df will be removed in the future.
  collection.artifacts.df().head()  # not tracking run & transform because read-only instance
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux branch_id
id
3305 1BNWhcCqu1CMSJaHxpbn cell-census/2024-07-01/h5ads/98e5ea9f-16d6-47e... All - A single-cell transcriptomic atlas chara... .h5ad dataset AnnData 2578203515 k-aZJBIjuvnO5Vek3JK-Mg None 110824 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:09.696918+00:00 1 None 1
3301 aJTH55LW2CTIWu306YiY cell-census/2024-07-01/h5ads/98113e7e-f586-406... Supercluster: Deep-layer intratelencephalic .h5ad dataset AnnData 3521994530 B8cjeVHgg9Q9Rr-JGaUjfg None 228467 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:09.690020+00:00 1 None 1
3313 pnQX4jvkj3eFWGOzDxbW cell-census/2024-07-01/h5ads/9b686bb6-1427-4e1... Evolution of cellular diversity in primary mot... .h5ad dataset AnnData 107509355 Z-uGNA6tRhMB1q46A3R8yg None 10739 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:09.715809+00:00 1 None 1
3566 2bF2gDSwbNbDsFVg2KQf cell-census/2024-07-01/h5ads/e4ddac12-f48f-445... Supercluster: CGE-derived interneurons .h5ad dataset AnnData 2586217727 8IDdkinp07n9AgQaWH9yUw None 129495 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:10.200783+00:00 1 None 1
2879 Pvhx7GAmAt4SYg03sE0M cell-census/2024-07-01/h5ads/06ef6b36-6c9b-4e1... Single nucleus transcriptomic profiling of hum... .h5ad dataset AnnData 92790726 V9KkecqXGqQJRF1lluo6Kg None 10533 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:08.881323+00:00 1 None 1

You can query across artifacts by arbitrary metadata combinations, for instance:

query = collection.artifacts.filter(
    organisms=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size")  # order by size
query.df().head()  # convert to DataFrame
Hide code cell output
/tmp/ipykernel_4260/268231875.py:9: FutureWarning: Use to_dataframe instead of df, df will be removed in the future.
  query.df().head()  # convert to DataFrame
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux branch_id
id
2961 WwmBIhBNLTlRcSoBDt76 cell-census/2024-07-01/h5ads/20d87640-4be8-487... Mature kidney dataset: immune .h5ad dataset AnnData 45158726 GCMHkdQSTeXxRVF7gMZFIA None 7803 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:09.039540+00:00 1 None 1
3000 gHlQ5Muwu3G9pvFCx3x8 cell-census/2024-07-01/h5ads/2d31c0ca-0233-41c... Fetal kidney dataset: immune .h5ad dataset AnnData 64546349 2qy8uy-65Sd_XcBU-nrPgA None 6847 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:09.128217+00:00 1 None 1
3324 P4Oai3OLGAzRwoicHfLM cell-census/2024-07-01/h5ads/9ea768a2-87ab-46b... Mature kidney dataset: full .h5ad dataset AnnData 194047623 aZVpGZwAfMCziff_5ow2bg None 40268 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:09.732579+00:00 1 None 1
2914 DSpevwaIl5E2jIWHbui4 cell-census/2024-07-01/h5ads/105c7dad-0468-462... mature .h5ad dataset AnnData 233914522 pz2wn0GB8pcRRupfY03gKQ None 40268 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:08.941671+00:00 1 None 1
3519 11HQaMeIUaOwyHoOkqqM cell-census/2024-07-01/h5ads/d7dcfd8f-2ee7-438... Fetal kidney dataset: full .h5ad dataset AnnData 342398936 CzNBRaQGupXRxF5IntjWBg None 27197 md5-n False False 1 2 None 2024-07-01 False 27 2024-07-12 12:34:10.101903+00:00 1 None 1

Slice a concatenated array

Let us now use the concatenated version of the Census collection: a tiledbsoma array that concatenates all AnnData arrays present in the collection we just explored. Slicing tiledbsoma works similar to slicing DataFrame or AnnData.

value_filter = (
    f'{features.tissue} == "{tissues.brain.name}" and {features.cell_type} in'
    f' ["{cell_types.microglial_cell.name}", "{cell_types.neuron.name}"] and'
    f' {features.suspension_type} == "{suspension_types.cell.name}" and {features.assay} =='
    f' "{assays.ln_10x_3_v3}"'
)
value_filter
'tissue == "brain" and cell_type in ["microglial cell", "neuron"] and suspension_type == "cell" and assay == "10x 3\' v3"'

Query for the tiledbsoma array store that contains all concatenated expression data. It’s a new dataset produced by concatenating all AnnData-like artifacts in the Census collection.

census_artifact = ln.Artifact.get(description="Census 2025-01-30")

Run the slicing operation.

human = "homo_sapiens"  # subset to human data

# open the array store for queries
with census_artifact.open() as store:
    # read SOMADataFrame as a slice
    cell_metadata = store["census_data"][human].obs.read(value_filter=value_filter)
    # concatenate results to pyarrow.Table
    cell_metadata = cell_metadata.concat()
    # convert to pandas.DataFrame
    cell_metadata = cell_metadata.to_pandas()

cell_metadata.head()
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
soma_joinid dataset_id assay assay_ontology_term_id cell_type cell_type_ontology_term_id development_stage development_stage_ontology_term_id disease disease_ontology_term_id donor_id is_primary_data observation_joinid self_reported_ethnicity self_reported_ethnicity_ontology_term_id sex sex_ontology_term_id suspension_type tissue tissue_ontology_term_id tissue_type tissue_general tissue_general_ontology_term_id raw_sum nnz raw_mean_nnz raw_variance_nnz n_measured_vars
0 46791195 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False 4kz&dTc~he unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 8231.0 3617 2.275643 27.214100 59229
1 46791196 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False ?--VrK%il{ unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 6513.0 2905 2.241997 41.835701 59229
2 46791197 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False _ws9=bPvtV unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 10514.0 3864 2.721014 75.146584 59229
3 46791198 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False +_|VI+u9-j unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 9979.0 3863 2.583225 44.252976 59229
4 46791199 1b52ee22-c71a-4a0f-a4d2-c1be36d0b82a 10x 3' v3 EFO:0009922 neuron CL:0000540 Carnegie stage 18 HsapDv:0000025 normal PATO:0000461 XDD:395 False i-geD_>1A2 unknown unknown unknown unknown cell brain UBERON:0000955 tissue brain UBERON:0000955 13689.0 4737 2.889804 53.114966 59229

Create an AnnData object.

from tiledbsoma import AxisQuery

with census_artifact.open() as store:
    experiment = store["census_data"][human]
    adata = experiment.axis_query(
        "RNA", obs_query=AxisQuery(value_filter=value_filter)
    ).to_anndata(
        X_name="raw",
        column_names={
            "obs": [
                features.assay,
                features.cell_type,
                features.tissue,
                features.disease,
                features.suspension_type,
            ]
        },
    )

adata.var = adata.var.set_index("feature_id")
adata
! run input wasn't tracked, call `ln.track()` and re-run
AnnData object with n_obs × n_vars = 117660 × 61888
    obs: 'assay', 'cell_type', 'tissue', 'disease', 'suspension_type'
    var: 'soma_joinid', 'feature_name', 'feature_type', 'feature_length', 'nnz', 'n_measured_obs'

Train ML models

You can either directly train ML models on very large collections of AnnData-like artifacts or on a single concatenated tiledbsoma-like artifact. For pros & cons of these approaches, see this blog post.

On a collection of arrays

mapped() caches AnnData objects on disk and creates a map-style dataset that performs a virtual join of the features of the underlying AnnData objects.

from torch.utils.data import DataLoader

census_collection = ln.Collection.get(name="cellxgene-census", version="2024-07-01")

dataset = census_collection.mapped(obs_keys=[features.cell_type], join="outer")

dataloader = DataLoader(dataset, batch_size=128, shuffle=True)

for batch in dataloader:
    pass

dataset.close()

For more background, see Train a machine learning model on a collection.

On a concatenated array

You can create streaming PyTorch dataloaders from tiledbsoma stores using cellxgene_census package.

import cellxgene_census.experimental.ml as census_ml

store = census_artifact.open()

experiment = store["census_data"][human]
experiment_datapipe = census_ml.ExperimentDataPipe(
    experiment,
    measurement_name="RNA",
    X_name="raw",
    obs_query=AxisQuery(value_filter=value_filter),
    obs_column_names=[features.cell_type],
    batch_size=128,
    shuffle=True,
    soma_chunk_size=10000,
)
experiment_dataloader = census_ml.experiment_dataloader(experiment_datapipe)

for batch in experiment_dataloader:
    pass

store.close()

For more background see this guide.