hub

CELLxGENE: scRNA-seq

CZ CELLxGENE hosts the globally largest standardized collection of scRNA-seq datasets.

LaminDB makes it easy to query the CELLxGENE data and integrate it with in-house data of any kind (omics, phenotypes, pdfs, notebooks, ML models, …).

You can use the CELLxGENE data in three ways:

  1. In the current guide, you’ll see how to query metadata and data based on AnnData objects.

  2. If you want to use these in your own LaminDB instance, see the transfer guide.

  3. If you’d like to leverage the TileDB-SOMA API for the data subset of CELLxGENE Census, see the Census guide.

If you are interested in building similar data assets in-house:

  1. See the scRNA guide for how to create a growing versioned queryable scRNA-seq dataset.

  2. See the Annotate for validating, curating and registering your own AnnData objects.

  3. Reach out if you are interested in a full zero-copy clone of laminlabs/cellxgene to accelerate building your in-house LaminDB instances.

Setup

Load the public LaminDB instance that mirrors cellxgene on the CLI:

!lamin load laminlabs/cellxgene
💡 connected lamindb: laminlabs/cellxgene
import lamindb as ln
import bionty as bt
💡 connected lamindb: laminlabs/cellxgene
❗ Full backed capabilities are not available for this version of anndata, please install anndata>=0.9.1.

Query & understand metadata

Auto-complete metadata

You can create look-up objects for any registry in LaminDB, including basic biological entities and things like users or storage locations.

Let’s use auto-complete to look up cell types:

Show me a screenshot
cell_types = bt.CellType.lookup()
cell_types.effector_t_cell
CellType(uid='3nfZTVV4', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:30:57 UTC')

You can also arbitrarily chain filters and create lookups from them:

organisms = bt.Organism.lookup()
experimental_factors = bt.ExperimentalFactor.lookup()  # labels for experimental factors
tissues = bt.Tissue.lookup()  # tissue labels
suspension_types = ln.ULabel.filter(name="is_suspension_type").one().children.lookup()  # suspension types

Search & filter metadata

We can use search & filters for metadata:

bt.CellType.search("effector T cell")
Hide code cell output
<QuerySet [CellType(uid='1oa5G2Mq', name='memory T cell', ontology_id='CL:0000813', synonyms='memory T-cell|memory T lymphocyte|memory T-lymphocyte', description='A Long-Lived, Antigen-Experienced T Cell That Has Acquired A Memory Phenotype Including Distinct Surface Markers And The Ability To Differentiate Into An Effector T Cell Upon Antigen Reexposure.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='6JD5JCZC', name='CD8-positive, alpha-beta cytokine secreting effector T cell', ontology_id='CL:0000908', synonyms='CD8-positive, alpha-beta cytokine secreting effector T-cell|CD8-positive, alpha-beta cytokine secreting effector T lymphocyte|CD8-positive, alpha-beta cytokine secreting effector T-lymphocyte', description='A Cd8-Positive, Alpha-Beta T Cell With The Phenotype Cd69-Positive, Cd62L-Negative, Cd127-Negative, And Cd25-Positive, That Secretes Cytokines.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='69TEBGqb', name='exhausted T cell', ontology_id='CL:0011025', synonyms='Tex cell|An effector T cell that displays impaired effector functions (e.g., rapid production of effector cytokines, cytotoxicity) and has limited proliferative potential.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='3nfZTVV4', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:30:57 UTC'), CellType(uid='43cBCa7s', name='helper T cell', ontology_id='CL:0000912', synonyms='helper T-lymphocyte|T-helper cell|helper T lymphocyte|helper T-cell', description='A Effector T Cell That Provides Help In The Form Of Secreted Cytokines To Other Immune Cells.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='3MAa89sT', name='common lymphoid progenitor', ontology_id='CL:0000051', synonyms='common lymphocyte precursor|common lymphoid precursor|common lymphocyte progenitor', description='A Oligopotent Progenitor Cell Committed To The Lymphoid Lineage.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='jUWIrbAm', name='mammary gland glandular cell', ontology_id='CL:1001586', description='Glandular Cell Of Mammary Epithelium. Example: Glandular Cells Of Large And Intermediate Ducts, Glandular Cells In Terminal Ducts.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:34:38 UTC'), CellType(uid='621YTlYS', name='keratin accumulating cell', ontology_id='CL:0000311', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:33:29 UTC'), CellType(uid='wdLgwUXo', name='fat cell', ontology_id='CL:0000136', synonyms='adipocyte|adipose cell', description='A Fat-Storing Cell Found Mostly In The Abdominal Cavity And Subcutaneous Tissue Of Mammals. Fat Is Usually Stored In The Form Of Triglycerides.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='431Fq1jG', name='H1 horizontal cell', ontology_id='CL:0004217', synonyms='A Horizontal Cell', description='A Horizontal Cell With A Large Cell Body, Thick Dendrites, And A Large Dendritic Arbor.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='2k3xeGbT', name='primary cultured cell', ontology_id='CL:0000001', synonyms='unpassaged cultured cell|primary cell culture cell', description='A Cultured Cell That Is Freshly Isolated From A Organismal Source, Or Derives In Culture From Such A Cell Prior To The Culture Being Passaged.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='42yeagNb', name='sphincter associated smooth muscle cell', ontology_id='CL:0000358', description='A Smooth Muscle Cell That Is Part Of A Sphincter. A Sphincter Is A Typically Circular Muscle That Normally Maintains Constriction Of A Natural Body Passage Or Orifice And Which Relaxes As Required By Normal Physiological Functioning.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:33:01 UTC'), CellType(uid='7Cf5qQl6', name='rod bipolar cell', ontology_id='CL:0000751', description='A Bipolar Neuron Found In The Retina That Is Synapsed By Rod Photoreceptor Cells But Not By Cone Photoreceptor Cells.  These Neurons Depolarize In Response To Light.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='4qnxelJp', name='retinal cell', ontology_id='CL:0009004', description='Any Cell In The Retina, The Innermost Layer Or Coating At The Back Of The Eyeball, Which Is Sensitive To Light And In Which The Optic Nerve Terminates.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:30:54 UTC'), CellType(uid='2fwQAWrK', name='type 8 cone bipolar cell (sensu Mus)', ontology_id='CL:0000760', synonyms='DB6 cone bipolar cell', description='An On-Bipolar Neuron Found In The Retina And Having Connections With Cone Photoreceptors Cells And Neurons In The Inner Half Of The Inner Plexiform Layer. This Cell Has The Widest Dendritic Field And The Widest Axon Terminal Of All Retinal Bipolar Cells. The Axon Terminal Is Delicate And Stratified Through Sublaminae 4 And 5 Of The Inner Plexiform Layer.', created_by_id=1, public_source_id=48, updated_at='2024-01-15 07:18:43 UTC'), CellType(uid='5UfW6dqp', name='epithelial cell of pancreas', ontology_id='CL:0000083', synonyms='pancreas epithelial cell|pancreatic epithelial cell', description='An Epithelial Cell Of The Pancreas.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:39:40 UTC'), CellType(uid='3NcX0Le6', name='basal epithelial cell of tracheobronchial tree', ontology_id='CL:0002329', description='An Epithelial Cell Type That Lacks The Columnar Shape Typical For Other Respiratory Epithelial Cells. This Cell Type Is Able To Differentiate Into Other Respiratory Epithelial Cells In Response To Injury.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='23E2AdkD', name='microfold cell of epithelium of small intestine', ontology_id='CL:1000353', description='A M Cell That Is Part Of The Epithelium Of Small Intestine.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='jhb9sEn7', name='stratified epithelial cell', ontology_id='CL:0000079', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='66xlGEYm', name='intestinal crypt stem cell of large intestine', ontology_id='CL:0009016', synonyms='stem cell of large intestine crypt of Lieberkuhn', description='An Intestinal Stem Cell That Is Located In The Large Intestine Crypt Of Liberkuhn. These Stem Cells Reside At The Bottom Of Crypts In The Large Intestine And Are Highly Proliferative. They Either Differentiate Into Transit Amplifying Cells Or Self-Renew To Form New Stem Cells.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC')]>
bt.CellType.search("CD8-positive cytokine effector T cell")
Hide code cell output
<QuerySet [CellType(uid='59W4YfOa', name='kidney proximal convoluted tubule epithelial cell', ontology_id='CL:1000838', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='4qrbhCCl', name='respiratory ciliated cell', ontology_id='CL:4030034', synonyms='ciliated cell of the respiratory tract', description='A Ciliated Cell Of The Respiratory System. Ciliated Cells Are Present In Airway Epithelium.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:33:39 UTC'), CellType(uid='621YTlYS', name='keratin accumulating cell', ontology_id='CL:0000311', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:33:29 UTC'), CellType(uid='vkNV6lFu', name='endothelial cell of coronary artery', ontology_id='CL:2000018', description='Any Endothelial Cell Of Artery That Is Part Of A Coronary Artery.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='2fwQAWrK', name='type 8 cone bipolar cell (sensu Mus)', ontology_id='CL:0000760', synonyms='DB6 cone bipolar cell', description='An On-Bipolar Neuron Found In The Retina And Having Connections With Cone Photoreceptors Cells And Neurons In The Inner Half Of The Inner Plexiform Layer. This Cell Has The Widest Dendritic Field And The Widest Axon Terminal Of All Retinal Bipolar Cells. The Axon Terminal Is Delicate And Stratified Through Sublaminae 4 And 5 Of The Inner Plexiform Layer.', created_by_id=1, public_source_id=48, updated_at='2024-01-15 07:18:43 UTC'), CellType(uid='69TEBGqb', name='exhausted T cell', ontology_id='CL:0011025', synonyms='Tex cell|An effector T cell that displays impaired effector functions (e.g., rapid production of effector cytokines, cytotoxicity) and has limited proliferative potential.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='jUWIrbAm', name='mammary gland glandular cell', ontology_id='CL:1001586', description='Glandular Cell Of Mammary Epithelium. Example: Glandular Cells Of Large And Intermediate Ducts, Glandular Cells In Terminal Ducts.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:34:38 UTC'), CellType(uid='3MAa89sT', name='common lymphoid progenitor', ontology_id='CL:0000051', synonyms='common lymphocyte precursor|common lymphoid precursor|common lymphocyte progenitor', description='A Oligopotent Progenitor Cell Committed To The Lymphoid Lineage.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='6QGSdksS', name='erythroid progenitor cell, mammalian', ontology_id='CL:0001066', description='A Progenitor Cell Committed To The Erythroid Lineage. This Cell Is Ter119-Positive But Lacks Expression Of Other Hematopoietic Lineage Markers (Lin-Negative).', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='23E2AdkD', name='microfold cell of epithelium of small intestine', ontology_id='CL:1000353', description='A M Cell That Is Part Of The Epithelium Of Small Intestine.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='431Fq1jG', name='H1 horizontal cell', ontology_id='CL:0004217', synonyms='A Horizontal Cell', description='A Horizontal Cell With A Large Cell Body, Thick Dendrites, And A Large Dendritic Arbor.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='wdLgwUXo', name='fat cell', ontology_id='CL:0000136', synonyms='adipocyte|adipose cell', description='A Fat-Storing Cell Found Mostly In The Abdominal Cavity And Subcutaneous Tissue Of Mammals. Fat Is Usually Stored In The Form Of Triglycerides.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='2k3xeGbT', name='primary cultured cell', ontology_id='CL:0000001', synonyms='unpassaged cultured cell|primary cell culture cell', description='A Cultured Cell That Is Freshly Isolated From A Organismal Source, Or Derives In Culture From Such A Cell Prior To The Culture Being Passaged.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='42yeagNb', name='sphincter associated smooth muscle cell', ontology_id='CL:0000358', description='A Smooth Muscle Cell That Is Part Of A Sphincter. A Sphincter Is A Typically Circular Muscle That Normally Maintains Constriction Of A Natural Body Passage Or Orifice And Which Relaxes As Required By Normal Physiological Functioning.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:33:01 UTC'), CellType(uid='7Cf5qQl6', name='rod bipolar cell', ontology_id='CL:0000751', description='A Bipolar Neuron Found In The Retina That Is Synapsed By Rod Photoreceptor Cells But Not By Cone Photoreceptor Cells.  These Neurons Depolarize In Response To Light.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='4qnxelJp', name='retinal cell', ontology_id='CL:0009004', description='Any Cell In The Retina, The Innermost Layer Or Coating At The Back Of The Eyeball, Which Is Sensitive To Light And In Which The Optic Nerve Terminates.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:30:54 UTC'), CellType(uid='4IRoehoY', name='preadipocyte', ontology_id='CL:0002334', description='An Undifferentiated Fibroblast That Can Be Stimulated To Form A Fat Cell.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='66xlGEYm', name='intestinal crypt stem cell of large intestine', ontology_id='CL:0009016', synonyms='stem cell of large intestine crypt of Lieberkuhn', description='An Intestinal Stem Cell That Is Located In The Large Intestine Crypt Of Liberkuhn. These Stem Cells Reside At The Bottom Of Crypts In The Large Intestine And Are Highly Proliferative. They Either Differentiate Into Transit Amplifying Cells Or Self-Renew To Form New Stem Cells.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='3NcX0Le6', name='basal epithelial cell of tracheobronchial tree', ontology_id='CL:0002329', description='An Epithelial Cell Type That Lacks The Columnar Shape Typical For Other Respiratory Epithelial Cells. This Cell Type Is Able To Differentiate Into Other Respiratory Epithelial Cells In Response To Injury.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC'), CellType(uid='1oa5G2Mq', name='memory T cell', ontology_id='CL:0000813', synonyms='memory T-cell|memory T lymphocyte|memory T-lymphocyte', description='A Long-Lived, Antigen-Experienced T Cell That Has Acquired A Memory Phenotype Including Distinct Surface Markers And The Ability To Differentiate Into An Effector T Cell Upon Antigen Reexposure.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:27:55 UTC')]>

And use a uid to filter exactly one metadata record:

effector_t_cell = bt.CellType.filter(uid="3nfZTVV4").one()
effector_t_cell
CellType(uid='3nfZTVV4', name='effector T cell', ontology_id='CL:0000911', synonyms='effector T-cell|effector T-lymphocyte|effector T lymphocyte', description='A Differentiated T Cell With Ability To Traffic To Peripheral Tissues And Is Capable Of Mounting A Specific Immune Response.', created_by_id=1, public_source_id=48, updated_at='2023-11-28 22:30:57 UTC')

Understand ontologies

View the related ontology terms:

effector_t_cell.view_parents(distance=2, with_children=True)
_images/6cdfc2f61da5a14e92b8512c8b1af5865ee670a550a55ae2659acf11ebca5fbc.svg

Or access them programmatically:

effector_t_cell.children.df()
uid name ontology_id abbr synonyms description public_source_id run_id created_by_id updated_at
id
931 2VQirdSp effector CD8-positive, alpha-beta T cell CL:0001050 None effector CD8-positive, alpha-beta T lymphocyte... A Cd8-Positive, Alpha-Beta T Cell With The Phe... 48 None 1 2023-11-28 22:27:55.565981+00:00
1088 490Xhb24 effector CD4-positive, alpha-beta T cell CL:0001044 None effector CD4-positive, alpha-beta T lymphocyte... A Cd4-Positive, Alpha-Beta T Cell With The Phe... 48 None 1 2023-11-28 22:27:55.569832+00:00
1229 69TEBGqb exhausted T cell CL:0011025 None Tex cell|An effector T cell that displays impa... None 48 None 1 2023-11-28 22:27:55.572884+00:00
1309 5s4gCMdn cytotoxic T cell CL:0000910 None cytotoxic T lymphocyte|cytotoxic T-lymphocyte|... A Mature T Cell That Differentiated And Acquir... 48 None 1 2023-11-28 22:27:55.575444+00:00
1331 43cBCa7s helper T cell CL:0000912 None helper T-lymphocyte|T-helper cell|helper T lym... A Effector T Cell That Provides Help In The Fo... 48 None 1 2023-11-28 22:27:55.575955+00:00

Query artifacts

Unlike in the SOMA guide, here, we’ll query sets of .h5ad files, which correspond to AnnData objects.

To access them, we query the Collection record that links the latest LTS set of .h5ad artifacts:

collection = ln.Collection.filter(name="cellxgene-census", version="2023-12-15").one()
collection
Collection(uid='dMyEX3NTfKOEYXyMu591', version='2023-12-15', name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24, updated_at='2024-01-30 09:09:49 UTC')

You can get all linked artifacts as a dataframe - there are >1000 h5ad files in cellxgene-census version 2023-12-15.

collection.artifacts.count()
1113
collection.artifacts.df().head()  # not tracking run & transform because read-only instance
Hide code cell output
uid version description key suffix accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
2825 OoktqBIu8jCoGOJlaQPo 2023-12-15 Sst Chodl - DLPFC: Seattle Alzheimer's Disease... cell-census/2023-12-15/h5ads/fc0ceb80-d2d9-47c... .h5ad AnnData 73375840 DqV7FraZIIP_l2DJuvHk_g-9 md5-n None 1877 1 False 2 16 22 1 2024-01-24 07:18:54.197599+00:00
2031 n33nFE2kXSNzNhIAtS3S 2023-12-15 L5 IT - DLPFC: Seattle Alzheimer's Disease Atl... cell-census/2023-12-15/h5ads/44c83972-e5d2-485... .h5ad AnnData 4605202922 ztuPyGXWH_OyCq1OyPlNkw-549 md5-n None 104106 1 False 2 16 22 1 2024-01-24 07:19:02.027481+00:00
1813 mtoOxeGG0Rg3NPH1AlwD 2023-12-15 Microglia-PVM - DLPFC: Seattle Alzheimer's Dis... cell-census/2023-12-15/h5ads/100c6145-7b0e-4ba... .h5ad AnnData 634716733 -B96CrmiOANuzE3xU78WsQ-76 md5-n None 42486 1 False 2 16 22 1 2024-01-24 07:19:04.190720+00:00
1804 V0tqrgE1z1NY2eUUKKQE 2023-12-15 Lamp5 - DLPFC: Seattle Alzheimer's Disease Atl... cell-census/2023-12-15/h5ads/0ed60482-a34f-426... .h5ad AnnData 1580667477 xRTDQGA4iOC4r8sSgz53vQ-189 md5-n None 55968 1 False 2 16 22 1 2024-01-24 07:19:04.646675+00:00
2532 dEP0dZ8UxLgwnkLjHssX 2023-12-15 Single-cell sequencing links multiregional imm... cell-census/2023-12-15/h5ads/bd65a70f-b274-413... .h5ad AnnData 1204103287 5hUwdflh_erDK-U2bEzfvw-144 md5-n None 167283 1 False 2 16 22 1 2024-01-29 07:49:54.125887+00:00

You can query across artifacts by arbitrary metadata combinations, for instance:

query = collection.artifacts.filter(
    organisms=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)
query = query.order_by("size")  # order by size
query.df().head()  # convert to DataFrame
Hide code cell output
uid version description key suffix accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
1880 WwmBIhBNLTlRcSoBpatT 2023-12-15 Mature kidney dataset: immune cell-census/2023-12-15/h5ads/20d87640-4be8-487... .h5ad AnnData 44647761 hSLF-GPhLXaC2tVIOJEdXA-6 md5-n None 7803 1 False 2 16 22 1 2024-01-29 07:46:33.152678+00:00
1880 WwmBIhBNLTlRcSoBpatT 2023-12-15 Mature kidney dataset: immune cell-census/2023-12-15/h5ads/20d87640-4be8-487... .h5ad AnnData 44647761 hSLF-GPhLXaC2tVIOJEdXA-6 md5-n None 7803 1 False 2 16 22 1 2024-01-29 07:46:33.152678+00:00
1930 gHlQ5Muwu3G9pvFC7egT 2023-12-15 Fetal kidney dataset: immune cell-census/2023-12-15/h5ads/2d31c0ca-0233-41c... .h5ad AnnData 64056560 jENeQIq0JdoHl5PyfY-sjA-8 md5-n None 6847 1 False 2 16 22 1 2024-01-29 07:46:37.205210+00:00
2405 P4Oai3OLGAzRwoicaxCB 2023-12-15 Mature kidney dataset: full cell-census/2023-12-15/h5ads/9ea768a2-87ab-46b... .h5ad AnnData 192484358 yghldeu2bOC5jtvnqZH8Og-23 md5-n None 40268 1 False 2 16 22 1 2024-01-29 07:49:11.905786+00:00
2405 P4Oai3OLGAzRwoicaxCB 2023-12-15 Mature kidney dataset: full cell-census/2023-12-15/h5ads/9ea768a2-87ab-46b... .h5ad AnnData 192484358 yghldeu2bOC5jtvnqZH8Og-23 md5-n None 40268 1 False 2 16 22 1 2024-01-29 07:49:11.905786+00:00

Query arrays

Each artifact stores an array in form of an annotated data matrix, an AnnData object.

Let’s look at the first array in the artifact query and show metadata using .describe():

artifact = query.first()
artifact.describe()
Hide code cell output
Artifact(uid='WwmBIhBNLTlRcSoBpatT', version='2023-12-15', description='Mature kidney dataset: immune', key='cell-census/2023-12-15/h5ads/20d87640-4be8-487f-93d4-dce38378d00f.h5ad', suffix='.h5ad', accessor='AnnData', size=44647761, hash='hSLF-GPhLXaC2tVIOJEdXA-6', hash_type='md5-n', n_observations=7803, visibility=1, key_is_virtual=False, updated_at='2024-01-29 07:46:33 UTC')
  Provenance
    .created_by = 'sunnyosun'
    .storage = 's3://cellxgene-data-public'
    .transform = 'Census release 2023-12-15 (LTS)'
    .run = '2024-01-11 09:33:35 UTC'
    .input_of = ["'2024-01-30 09:07:36 UTC'"]
  Labels
    .organisms = 'human'
    .tissues = 'renal medulla', 'kidney blood vessel', 'renal pelvis', 'cortex of kidney', 'kidney'
    .cell_types = 'classical monocyte', 'plasmacytoid dendritic cell', 'natural killer cell', 'dendritic cell', 'CD4-positive, alpha-beta T cell', 'mast cell', 'neutrophil', 'non-classical monocyte', 'CD8-positive, alpha-beta T cell', 'B cell'
    .diseases = 'normal'
    .phenotypes = 'male', 'female'
    .experimental_factors = '10x 3' v2'
    .developmental_stages = '2-year-old human stage', '4-year-old human stage', '12-year-old human stage', '44-year-old human stage', '49-year-old human stage', '53-year-old human stage', '63-year-old human stage', '64-year-old human stage', '67-year-old human stage', '70-year-old human stage'
    .ethnicities = 'unknown'
    .ulabels = 'TxK2', 'Wilms1', 'TxK4', 'TTx', 'RCC3', 'RCC1', 'VHL', 'TxK3', 'TxK1', 'Wilms3'
  Features
    'donor_id' = 'Wilms3', 'TTx', 'pRCC', 'VHL', 'RCC3', 'TxK1', 'TxK4', 'TxK3', 'RCC2', 'Wilms2'
    'suspension_type' = 'cell'
  Feature sets
    'var' = 'None', 'EBF1', 'LINC02202', 'RNF145', 'LINC01932', 'UBLCP1', 'IL12B', 'LINC01845', 'LINC01847', 'ADRA1B', 'TTC1', 'PWWP2A', 'FABP6', 'FABP6-AS1', 'CCNJL', 'C1QTNF2'
    'obs' = 'assay', 'cell_type', 'development_stage', 'disease', 'donor_id', 'self_reported_ethnicity', 'sex', 'tissue', 'organism', 'tissue_type', 'suspension_type'
More ways of accessing metadata

Access just features:

artifact.features

Or get labels given a feature:

artifact.labels.get(features.tissue).df()
artifact.labels.get(features.collection).one()

If you want to query a slice of the array data, you have two options:

  1. Cache & load the entire array into memory via artifact.load() -> AnnData (caches the h5ad on disk, so that you only download once)

  2. Stream the array from the cloud using a cloud-backed accessor artifact.backed() -> AnnDataAccessor

Both options will run much faster if you run them close to the data (AWS S3 on the US West Coast, consider logging into hosted compute there).

Cache & load:

adata = artifact.load()
adata
Hide code cell output
AnnData object with n_obs × n_vars = 7803 × 32922
    obs: 'donor_id', 'donor_age', 'self_reported_ethnicity_ontology_term_id', 'organism_ontology_term_id', 'sample_uuid', 'tissue_ontology_term_id', 'development_stage_ontology_term_id', 'suspension_uuid', 'suspension_type', 'library_uuid', 'assay_ontology_term_id', 'mapped_reference_annotation', 'is_primary_data', 'cell_type_ontology_term_id', 'author_cell_type', 'disease_ontology_term_id', 'reported_diseases', 'sex_ontology_term_id', 'compartment', 'Experiment', 'Project', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
    var: 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
    uns: 'default_embedding', 'schema_version', 'title'
    obsm: 'X_umap'

Now we have an AnnData object, which stores observation annotations matching our artifact-level query in the .obs slot, and we can re-use almost the same query on the array-level.

See the array-level query
adata_slice = adata[
    adata.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata.obs.tissue == tissues.kidney.name)
    & (adata.obs.suspension_type == suspension_types.cell.name)
    & (adata.obs.assay == experimental_factors.ln_10x_3_v2.name)
]
adata_slice
See the artifact-level query for comparison
query = collection.artifacts.filter(
    organism=organisms.human,
    cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
    tissues=tissues.kidney,
    ulabels=suspension_types.cell,
    experimental_factors=experimental_factors.ln_10x_3_v2,
)

AnnData uses pandas to manage metadata and the syntax differs slightly. However, the same metadata records are used.

Stream:

adata_backed = artifact.backed()
adata_backed
Hide code cell output
AnnDataAccessor object with n_obs × n_vars = 7803 × 32922
  constructed for the AnnData object 20d87640-4be8-487f-93d4-dce38378d00f.h5ad
    obs: ['Experiment', 'Project', '_index', 'assay', 'assay_ontology_term_id', 'author_cell_type', 'cell_type', 'cell_type_ontology_term_id', 'compartment', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_age', 'donor_id', 'is_primary_data', 'library_uuid', 'mapped_reference_annotation', 'organism', 'organism_ontology_term_id', 'reported_diseases', 'sample_uuid', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'suspension_uuid', 'tissue', 'tissue_ontology_term_id']
    obsm: ['X_umap']
    raw: ['X', 'var', 'varm']
    uns: ['default_embedding', 'schema_version', 'title']
    var: ['_index', 'feature_biotype', 'feature_is_filtered', 'feature_name', 'feature_reference']

We now have an AnnDataAccessor object, which behaves much like an AnnData, and the query looks the same.

See the query
adata_backed_slice = adata_backed[
    adata_backed.obs.cell_type.isin(
        [cell_types.dendritic_cell.name, cell_types.neutrophil.name]
    )
    & (adata_backed.obs.tissue == tissues.kidney.name)
    & (adata_backed.obs.suspension_type == suspension_types.cell.name)
    & (adata_backed.obs.assay == experimental_factors.ln_10x_3_v2.name)
]

adata_backed_slice.to_memory()

Train an ML model

You can directly train an ML models on the entire collection.

See Train a machine learning model on a collection.

Exploring data by collection

Alternatively,

Let’s search the collections from CELLxGENE within the 2023-12-15 release:

ln.Collection.filter(version="2023-12-15").search("immune human kidney", limit=10)
<QuerySet [Collection(uid='TWZevdipvmWsuEiF7d3N', version='2023-12-15', name='Single cell transcriptional and chromatin accessibility profiling redefine cellular heterogeneity in the adult human kidney', description='10.1038/s41467-021-22368-w', hash='arqM23K0du5k3xHGMLE6', reference='9b02383a-9358-4f0f-9795-a891ec523bcc', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-29 07:54:29 UTC'), Collection(uid='gHBNw253Sj64W4wBfESc', version='2023-12-15', name='Sampling peripheral blood and matched nasal swabs from donors with prior immunodeficiencies and autoimmune conditions infected with SARS-CoV-2', description='10.1101/2020.11.20.20227355', hash='MTKb-3cXSy4ghRvrrZ2m', reference='eb735cc9-d0a7-48fa-b255-db726bf365af', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-11 13:41:05 UTC'), Collection(uid='Yxth0JJgMb2VVOCfUpzz', version='2023-12-15', name='Single-cell transcriptomics of the human retinal pigment epithelium and choroid in health and macular degeneration', description='10.1073/pnas.1914143116', hash='z_OvNfnI4UUjp0M80TK8', reference='f8057c47-fcd8-4fcf-88b0-e2f930080f6e', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-11 13:41:07 UTC'), Collection(uid='kqiPjpzpK9H9rdtnV67f', version='2023-12-15', name='Spatiotemporal immune zonation of the human kidney', description='10.1126/science.aat5031', hash='4wGcXeeqsjVdbRdU7ZuJ', reference='120e86b4-1195-48c5-845b-b98054105eec', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-29 07:54:33 UTC'), Collection(uid='WJLbdahJcDE8E9mzkkNk', version='2023-12-15', name='A molecular atlas of the human postmenopausal fallopian tube and ovary from single-cell RNA and ATAC sequencing', description='10.1016/j.celrep.2022.111838', hash='-MGoN6gWLKK15cNv_gsp', reference='d36ca85c-3e8b-444c-ba3e-a645040c6185', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-29 07:54:38 UTC'), Collection(uid='6tT3kRYI2c6slEpvtOWS', version='2023-12-15', name='Human developing neocortex by area', description='10.1038/s41586-021-03910-8', hash='la2gOvMhAPrzJ9kk5bPz', reference='c8565c6a-01a1-435b-a549-f11b452a83a8', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-11 13:41:08 UTC'), Collection(uid='NoRCcrtIjLnLdEjiMIET', version='2023-12-15', name='A spatially resolved single cell genomic atlas of the adult human breast', description='10.1038/s41586-023-06252-9', hash='CK1g1cnNvGvkhsy8ffz8', reference='4195ab4c-20bd-4cd3-8b3d-65601277e731', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-29 07:55:10 UTC'), Collection(uid='0H2X3A2FhWOgA7i8Jof5', version='2023-12-15', name='A single-cell transcriptional roadmap of the mouse and human lymph node lymphatic vasculature', description='10.3389/fcvm.2020.00052', hash='Rjmluzxs8OKViYgRoqB7', reference='9c8808ce-1138-4dbe-818c-171cff10e650', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-29 07:54:59 UTC'), Collection(uid='AdBAaHp3kZh8qvsvMno8', version='2023-12-15', name='Single-Cell DNA Methylation and 3D Genome Human Brain Atlas', description='10.1126/science.adf5357', hash='sDtsT1U7KnHR5Wg9YvZ7', reference='fdebfda9-bb9a-4b4b-97e5-651097ea07b0', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-11 13:41:07 UTC'), Collection(uid='DI60aiNNLqOpa8t3Gbqu', version='2023-12-15', name='Evolution of cellular diversity in primary motor cortex of human, marmoset monkey, and mouse', description='10.1038/s41586-021-03465-8', hash='wIZtFtiJtDwpVl9Ka_O1', reference='367d95c0-0eb0-4dae-8276-9407239421ee', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-29 07:54:11 UTC')]>

Let’s get the record of the top hit collection:

collection = ln.Collection.filter(uid="kqiPjpzpK9H9rdtnV67f").one()
collection
Collection(uid='kqiPjpzpK9H9rdtnV67f', version='2023-12-15', name='Spatiotemporal immune zonation of the human kidney', description='10.1126/science.aat5031', hash='4wGcXeeqsjVdbRdU7ZuJ', reference='120e86b4-1195-48c5-845b-b98054105eec', reference_type='CELLxGENE Collection ID', visibility=1, created_by_id=1, transform_id=17, run_id=22, updated_at='2024-01-29 07:54:33 UTC')

We see it’s a Science paper and we could find more information using the DOI or CELLxGENE collection id.

Check different versions of this collection:

collection.versions.df()
uid version name description hash reference reference_type visibility transform_id artifact_id run_id created_by_id updated_at
id
17 kqiPjpzpK9H9rdtnHWas 2023-07-25 Spatiotemporal immune zonation of the human ki... 10.1126/science.aat5031 w_VZE7n841ktaA9FjdLh 120e86b4-1195-48c5-845b-b98054105eec CELLxGENE Collection ID 1 NaN None NaN 1 2024-01-08 12:01:20.121095+00:00
365 kqiPjpzpK9H9rdtnV67f 2023-12-15 Spatiotemporal immune zonation of the human ki... 10.1126/science.aat5031 4wGcXeeqsjVdbRdU7ZuJ 120e86b4-1195-48c5-845b-b98054105eec CELLxGENE Collection ID 1 17.0 None 22.0 1 2024-01-29 07:54:33.854515+00:00

Each collection has at least one Artifact file associated to it. Let’s get the associated artifacts:

collection.artifacts.df()
uid version description key suffix accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
1778 b2x19Eg28GGSNnXW1hAD 2023-12-15 Fetal kidney dataset: nephron cell-census/2023-12-15/h5ads/08073b32-d389-41f... .h5ad AnnData 159545411 _JE59jFHDrOn0hj4i1yXSQ-20 md5-n None 10790 1 False 2 16 22 1 2024-01-29 07:46:06.497662+00:00
1880 WwmBIhBNLTlRcSoBpatT 2023-12-15 Mature kidney dataset: immune cell-census/2023-12-15/h5ads/20d87640-4be8-487... .h5ad AnnData 44647761 hSLF-GPhLXaC2tVIOJEdXA-6 md5-n None 7803 1 False 2 16 22 1 2024-01-29 07:46:33.152678+00:00
1930 gHlQ5Muwu3G9pvFC7egT 2023-12-15 Fetal kidney dataset: immune cell-census/2023-12-15/h5ads/2d31c0ca-0233-41c... .h5ad AnnData 64056560 jENeQIq0JdoHl5PyfY-sjA-8 md5-n None 6847 1 False 2 16 22 1 2024-01-29 07:46:37.205210+00:00
1944 USUgRVwrCMquHiImhk5D 2023-12-15 Mature kidney dataset: non PT parenchyma cell-census/2023-12-15/h5ads/2fc9c59f-3cfd-48d... .h5ad AnnData 39294782 3l5iNnBmPFbYfR3-THYWNQ-5 md5-n None 4620 1 False 2 16 22 1 2024-01-29 07:46:52.173865+00:00
2405 P4Oai3OLGAzRwoicaxCB 2023-12-15 Mature kidney dataset: full cell-census/2023-12-15/h5ads/9ea768a2-87ab-46b... .h5ad AnnData 192484358 yghldeu2bOC5jtvnqZH8Og-23 md5-n None 40268 1 False 2 16 22 1 2024-01-29 07:49:11.905786+00:00
2570 6mnZ3SeQFhffr3wTdZZb 2023-12-15 Fetal kidney dataset: stroma cell-census/2023-12-15/h5ads/c52de62a-058d-4d7... .h5ad AnnData 109942751 s24Q5-FNUNQPLZw9BuwOVg-14 md5-n None 8345 1 False 2 16 22 1 2024-01-29 07:50:01.866851+00:00
2652 11HQaMeIUaOwyHoOWVvA 2023-12-15 Fetal kidney dataset: full cell-census/2023-12-15/h5ads/d7dcfd8f-2ee7-438... .h5ad AnnData 341214674 2mnG5TiEpj0Wr5L19TTFRw-41 md5-n None 27197 1 False 2 16 22 1 2024-01-29 07:50:28.610568+00:00