Arc Virtual Cell Atlas: scRNA-seq¶
The Arc Virtual Cell Atlas hosts one of the biggest collections of scRNA-seq datasets.
Lamin mirrors the dataset for simplified access here: laminlabs/arc-virtual-cell-atlas.
If you use the data academically, please cite the original publications, Youngblut et al. (2025) and Zhang et al. (2025).
If you’d like to transfer data into your own LaminDB instance, see the transfer guide.
# pip install 'lamindb[gcp]'
!lamin init --modules bionty,wetlab --storage ./test-arc-virtual-cell-atlas
Show code cell output
→ initialized lamindb: testuser1/test-arc-virtual-cell-atlas
import lamindb as ln
import bionty as bt
import wetlab as wl
import pyarrow.compute as pc
import anndata as ad
Show code cell output
→ connected lamindb: testuser1/test-arc-virtual-cell-atlas
Create the central query object for this instance:
db = ln.DB("laminlabs/arc-virtual-cell-atlas")
Tahoe-100M¶
project_tahoe = db.Project.get(name="Tahoe-100M")
project_tahoe
Show code cell output
Project(uid='H5MwZwyA62rG', name='Tahoe-100M', description=None, is_type=False, abbr=None, url='https://arcinstitute.org/tools/virtualcellatlas', start_date=None, end_date=None, branch_id=1, space_id=1, created_by_id=1, run_id=None, type_id=None, created_at=2025-02-26 16:03:40 UTC, is_locked=False)
# one collection in this project
project_tahoe.collections.to_dataframe()
Show code cell output
| uid | key | description | hash | reference | reference_type | version | is_latest | is_locked | created_at | branch_id | space_id | created_by_id | run_id | meta_artifact_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||
| 1 | BpavRL4ntRTzWEE50000 | tahoe100 | None | GCLk4ZgQxgWspjmEUk3gIg | None | None | 2025-02-25 | True | False | 2025-02-26 13:51:22.787537+00:00 | 1 | 1 | 1 | 3 | None |
Every individual dataset in the atlas is an .h5ad file that is registered as an artifact in LaminDB.
Artifact level metadata are registered and can be explored as follows:
# get the collection: https://lamin.ai/laminlabs/arc-virtual-cell-atlas/collection/BpavRL4ntRTzWEE5
collection_tahoe = db.Collection.get(key="tahoe100")
# 14 artifacts in this collection, each correspond to a plate
artifacts_tahoe = collection_tahoe.artifacts.distinct()
artifacts_tahoe.to_dataframe()
Show code cell output
| uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | version | is_latest | is_locked | created_at | branch_id | space_id | storage_id | run_id | schema_id | created_by_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||||||||||||
| 1375 | BDttiuV3Te8VB0dU0000 | 2025-02-25/h5ad/plate9_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 18791302576 | 4kHbVbmreg6akW6ZgsjxaA | None | 5866669 | None | True | False | 2025-02-25 23:22:22.759201+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1374 | czC19UpUEszVH2bU0000 | 2025-02-25/h5ad/plate8_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 30390935958 | ilAzEPIh4FlDeTFaJ1dILw | None | 8880979 | None | True | False | 2025-02-25 23:22:22.387666+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1373 | DC5cacdJr1VoEXnl0000 | 2025-02-25/h5ad/plate7_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 16514746341 | NOS4MY6eYYPOnAB8ViyWYg | None | 5692117 | None | True | False | 2025-02-25 23:22:22.009157+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1372 | aAHQ3zbD7n1asyYr0000 | 2025-02-25/h5ad/plate6_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 28934897078 | NYvQEqVClziHm0ozWhOw1w | None | 7545393 | None | True | False | 2025-02-25 23:22:21.629962+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1371 | EZATJLC4jE7pmwo40000 | 2025-02-25/h5ad/plate5_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 19763140865 | VMBKFzOI5cj7UC1UDENP4A | None | 6419498 | None | True | False | 2025-02-25 23:22:21.255154+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1370 | tKTeff0ugWqAm4P70000 | 2025-02-25/h5ad/plate4_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 23292672278 | BkBXznbSovNWXtzPFITPcQ | None | 7004356 | None | True | False | 2025-02-25 23:22:20.879928+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1369 | XVSrkq9pyF1OBLgG0000 | 2025-02-25/h5ad/plate3_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 13173722269 | Jnrt7DaSUCGn8D8LS2itaw | None | 4705402 | None | True | False | 2025-02-25 23:22:20.497965+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1368 | ZFeVfd0ugAHeWCxm0000 | 2025-02-25/h5ad/plate2_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 29037152127 | usxviuqGbuw0RYnECCVCWw | None | 8064658 | None | True | False | 2025-02-25 23:22:20.113956+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1367 | aJIqo7bNyJAs9z0r0000 | 2025-02-25/h5ad/plate1_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 19070623904 | 9iCNcouMqfNS3HA/2GUWOA | None | 5481420 | None | True | False | 2025-02-25 23:22:19.737995+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1366 | vn5cUJCHbjpPPsZx0000 | 2025-02-25/h5ad/plate14_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 22427932564 | FrnStRehP16siRGG35ou+g | None | 6518806 | None | True | False | 2025-02-25 23:22:19.357999+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1365 | 9L9HZ55HqUL0aqaR0000 | 2025-02-25/h5ad/plate13_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 28071589885 | RKOiaay+CHvv+Ukk/N+28A | None | 8501658 | None | True | False | 2025-02-25 23:22:18.977981+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1364 | S2h2rPLCaUhZAM9u0000 | 2025-02-25/h5ad/plate12_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 37495736876 | VjAkWVFGVpzAMi9Innusuw | None | 10487057 | None | True | False | 2025-02-25 23:22:18.600910+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1363 | omn7JStfJMzy8m6O0000 | 2025-02-25/h5ad/plate11_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 23230802756 | N2mzoYlMLEl6PdecaYyDvw | None | 7435869 | None | True | False | 2025-02-25 23:22:18.229629+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1362 | 56uA9lPPmJ4zLUcr0000 | 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 26536400717 | j1FXsX7hs7u+eBqnWnmNHw | None | 8044908 | None | True | False | 2025-02-25 23:22:17.849980+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
50 cell lines.
artifacts_tahoe.to_list("cell_lines__name")[:5]
Show code cell output
['A-172', 'A-427', 'A498', 'A549', 'AN3 CA']
380 compounds.
artifacts_tahoe.to_list("compounds__name")[:5]
Show code cell output
['18β-Glycyrrhetinic acid',
'4EGI-1',
'5-Azacytidine',
'5-Fluorouracil',
'8-Hydroxyquinoline']
1,138 perturbations.
artifacts_tahoe.to_list("compound_perturbations__name")[:5]
Show code cell output
["[('18β-Glycyrrhetinic acid', 0.05, 'uM')]",
"[('18β-Glycyrrhetinic acid', 0.5, 'uM')]",
"[('18β-Glycyrrhetinic acid', 5.0, 'uM')]",
"[('4EGI-1', 0.05, 'uM')]",
"[('4EGI-1', 0.5, 'uM')]"]
# check the curated metadata of the first artifact
artifact1 = artifacts_tahoe[0]
artifact1.describe()
Show code cell output
Artifact: 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad (0000) ├── uid: 56uA9lPPmJ4zLUcr0000 run: 0xj4zui (register-tahoe100.ipynb) │ kind: dataset otype: AnnData │ hash: j1FXsX7hs7u+eBqnWnmNHw size: 24.7 GB │ branch: main space: all │ created_at: 2025-02-25 23:22:17 UTC created_by: sunnyosun │ n_observations: 8044908 ├── storage/path: gs://arc-ctc-tahoe100/2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad ├── Dataset features │ ├── var (62710 bionty.Gene.stable… │ │ TSPAN6 float │ │ TNMD float │ │ DPM1 float │ │ SCYL3 float │ │ C1orf112 float │ │ FGR float │ │ CFH float │ │ FUCA2 float │ │ GCLC float │ │ NFYA float │ │ STPG1 float │ │ NIPAL3 float │ │ LAS1L float │ │ ENPP4 float │ │ SEMA3F float │ │ CFTR float │ │ ANKIB1 float │ │ CYP51A1 float │ │ KRIT1 float │ │ RAD52 float │ └── obs (16) │ cell_name bionty.CellLine A-172, A-427, A498, A549, AN3 CA, AsPC-… │ drug wetlab.Compound 5-Azacytidine, 5-Fluorouracil, Abirater… │ drugname_drugconc wetlab.CompoundPerturbation [('5-Azacytidine', 0.05, 'uM')], [('5-F… │ pass_filter ULabel[PassFilter] full, minimal │ phase ULabel[Phase] G1, G2M, S │ plate ULabel[Plate] plate10 │ sample wetlab.Biosample smp_2359, smp_2360, smp_2361, smp_2362,… │ cell_line bionty.CellLine.description │ gene_count int │ tscp_count int │ mread_count int │ pcnt_mito float │ S_score float │ G2M_score float │ sublibrary str │ BARCODE str └── Labels └── .ulabels ULabel plate10, G1, G2M, S, full, minimal .projects Project Tahoe-100M .references Reference Tahoe-100M: A Giga-Scale Single-Cell Pe… .compounds wetlab.Compound Omeprazole (sodium), Ranolazine, Proglu… .compound_perturbations wetlab.CompoundPerturbation [('Bestatin (hydrochloride)', 0.05, 'uM… .biosamples wetlab.Biosample smp_2359, smp_2360, smp_2361, smp_2362,… .organisms bionty.Organism human .cell_lines bionty.CellLine NCI-H1573, NCI-H460, hTERT-HPNE, SW48, …
16 obs metadata features.
artifact1.features.slots["obs"].members.to_dataframe()
Show code cell output
| uid | name | dtype | is_type | unit | description | array_rank | array_size | array_shape | proxy_dtype | synonyms | is_locked | created_at | branch_id | space_id | created_by_id | run_id | type_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||||||||||
| 19 | gQE1h3fIBiSf | sample | cat[wetlab.Biosample] | None | None | Unique treatment identifier, distinguishes rep... | 0 | 0 | None | None | None | False | 2025-02-26 10:59:36.743558+00:00 | 1 | 1 | 1 | 3 | None |
| 5 | IjSP1lCY3Hyw | gene_count | int | None | None | Number of genes with at least one count | 0 | 0 | None | None | None | False | 2025-02-25 22:30:30.668750+00:00 | 1 | 1 | 1 | 3 | None |
| 6 | LHUmmYKjIGPl | tscp_count | int | None | None | Number of transcripts, aka UMI count | 0 | 0 | None | None | None | False | 2025-02-25 22:30:31.236532+00:00 | 1 | 1 | 1 | 3 | None |
| 7 | PZDiL36nJSFv | mread_count | int | None | None | Number of reads per cell | 0 | 0 | None | None | None | False | 2025-02-25 22:30:31.810331+00:00 | 1 | 1 | 1 | 3 | None |
| 18 | fLwdFKBUhBY9 | drugname_drugconc | cat[wetlab.CompoundPerturbation] | None | None | Drug name, concentration, and concentration unit | 0 | 0 | None | None | None | False | 2025-02-25 23:04:17.541812+00:00 | 1 | 1 | 1 | 3 | None |
| 17 | Q0cj2JR5Juwn | drug | cat[wetlab.Compound] | None | None | Drug name, parsed out from the drugname_drugco... | 0 | 0 | None | None | None | False | 2025-02-25 23:02:05.717794+00:00 | 1 | 1 | 1 | 3 | None |
| 4 | vshELphl73qp | cell_line | cat[bionty.CellLine.description] | None | None | Cell line information (if applicable) | 0 | 0 | None | None | None | False | 2025-02-25 22:27:22.393997+00:00 | 1 | 1 | 1 | 3 | None |
| 15 | 3X4d0QEUuprp | sublibrary | str | None | None | Sublibrary ID (related to library prep and seq... | 0 | 0 | None | None | None | False | 2025-02-25 22:35:14.673178+00:00 | 1 | 1 | 1 | 3 | None |
| 16 | dQELv2sIVnJX | BARCODE | str | None | None | Barcode ID | 0 | 0 | None | None | None | False | 2025-02-25 22:35:15.627971+00:00 | 1 | 1 | 1 | 3 | None |
| 8 | X640W5tBUPOQ | pcnt_mito | float | None | None | Percentage of mitochondrial reads | 0 | 0 | None | None | None | False | 2025-02-25 22:31:21.581885+00:00 | 1 | 1 | 1 | 3 | None |
| 9 | bujDkB4Nd1S5 | S_score | float | None | None | Inferred S phase score | 0 | 0 | None | None | None | False | 2025-02-25 22:31:22.144135+00:00 | 1 | 1 | 1 | 3 | None |
| 10 | CF0O0e0WZxFz | G2M_score | float | None | None | Inferred G2M score | 0 | 0 | None | None | None | False | 2025-02-25 22:31:22.708895+00:00 | 1 | 1 | 1 | 3 | None |
| 2 | QboQ1Q1Yxsjn | phase | cat[ULabel[Phase]] | None | None | Inferred cell cycle phase | 0 | 0 | None | None | None | False | 2025-02-25 22:21:56.935262+00:00 | 1 | 1 | 1 | 3 | None |
| 3 | PVpyJhciLdCQ | pass_filter | cat[ULabel[PassFilter]] | None | None | "Full" filters are more stringent on gene_coun... | 0 | 0 | None | None | None | False | 2025-02-25 22:25:30.918235+00:00 | 1 | 1 | 1 | 3 | None |
| 11 | KPT70T8xJLIt | cell_name | cat[bionty.CellLine] | None | None | Commonly-used cell name (related to the cell_l... | 0 | 0 | None | None | None | False | 2025-02-25 22:32:56.082195+00:00 | 1 | 1 | 1 | 3 | None |
| 1 | YRSYWdIiesqL | plate | cat[ULabel[Plate]] | None | None | Plate identifier | 0 | 0 | None | None | None | False | 2025-02-25 22:03:51.786985+00:00 | 1 | 1 | 1 | 3 | None |
Query artifacts of interest based on metadata¶
Since all metadata are registered in the sql database, we can explore the datasets without accessing them.
Let’s find which datasets contain A549 cells perturbed with Piroxicam.
# lookup objects give you pythonic access to the values
cell_lines = db.bionty.CellLine.lookup("ontology_id")
drugs = db.wetlab.Compound.lookup()
artifacts_a549_piroxicam = artifacts_tahoe.filter(
cell_lines=cell_lines.cvcl_0023, compounds=drugs.piroxicam
)
artifacts_a549_piroxicam.to_dataframe()
Show code cell output
| uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | version | is_latest | is_locked | created_at | branch_id | space_id | storage_id | run_id | schema_id | created_by_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||||||||||||
| 1364 | S2h2rPLCaUhZAM9u0000 | 2025-02-25/h5ad/plate12_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 37495736876 | VjAkWVFGVpzAMi9Innusuw | None | 10487057 | None | True | False | 2025-02-25 23:22:18.600910+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1363 | omn7JStfJMzy8m6O0000 | 2025-02-25/h5ad/plate11_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 23230802756 | N2mzoYlMLEl6PdecaYyDvw | None | 7435869 | None | True | False | 2025-02-25 23:22:18.229629+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
| 1362 | 56uA9lPPmJ4zLUcr0000 | 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 26536400717 | j1FXsX7hs7u+eBqnWnmNHw | None | 8044908 | None | True | False | 2025-02-25 23:22:17.849980+00:00 | 1 | 1 | 2 | 1 | 3 | 1 |
You can download an .h5ad into your local cache:
artifact1.cache()
Or stream it:
artifact1.open()
Open the obs metadata parquet file as a PyArrow Dataset¶
Open the obs metadata file (2.29G) with PyArrow.Dataset.
obs_metadata = db.Artifact.filter(
key__endswith="obs_metadata.parquet", projects=project_tahoe
).one()
obs_metadata
Show code cell output
Artifact(uid='y1TTR9wbrmZEwpOa0000', version=None, is_latest=True, key='2025-02-25/metadata/obs_metadata.parquet', description=None, suffix='.parquet', kind='dataset', otype='DataFrame', size=2293981573, hash='qEWOpGw9CmQVzaElyMWT1Q', n_files=None, n_observations=100648790, branch_id=1, space_id=1, storage_id=2, run_id=1, schema_id=None, created_by_id=1, created_at=2025-02-25 19:33:42 UTC, is_locked=False)
obs_metadata_ds = obs_metadata.open()
obs_metadata_ds.schema
Show code cell output
! run input wasn't tracked, call `ln.track()` and re-run
plate: string
BARCODE_SUB_LIB_ID: string
sample: string
gene_count: int64
tscp_count: int64
mread_count: int64
drugname_drugconc: string
drug: string
cell_line: dictionary<values=string, indices=int8, ordered=0>
sublibrary: string
BARCODE: string
pcnt_mito: float
S_score: double
G2M_score: double
phase: dictionary<values=string, indices=int8, ordered=0>
pass_filter: dictionary<values=string, indices=int8, ordered=0>
cell_name: dictionary<values=string, indices=int8, ordered=0>
__index_level_0__: int64
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 2487
Which A549 cells are perturbed with Piroxicam.
filter_expr = (pc.field("cell_name") == cell_lines.cvcl_0023.name) & (
pc.field("drug") == drugs.piroxicam.name
)
obs_metadata_df = obs_metadata_ds.scanner(filter=filter_expr).to_table().to_pandas()
obs_metadata_df.value_counts("plate")
Retrieve the corresponding cells from h5ad files.
plate_cells = df.groupby("plate")["BARCODE_SUB_LIB_ID"].apply(list)
adatas = []
for artifact in artifacts_a549_piroxicam:
plate = artifact.features.get_values()["plate"]
idxs = plate_cells.get(plate)
print(f"Loading {len(idxs)} cells from plate {plate}")
with artifact.open() as store:
adata = store[idxs].to_memory() # can also subst genes here
adatas.append(adata)
scBaseCount¶
project_scbasecount = db.Project.get(name="scBaseCount")
project_scbasecount
Show code cell output
Project(uid='vdK00t9DGwHP', name='scBaseCount', description=None, is_type=False, abbr=None, url='https://arcinstitute.org/tools/virtualcellatlas', start_date=None, end_date=None, branch_id=1, space_id=1, created_by_id=1, run_id=None, type_id=None, created_at=2025-02-26 16:04:08 UTC, is_locked=False)
This project has 105 collections (21 organisms x 5 count features):
project_scbasecount.collections.to_dataframe()
Show code cell output
| uid | key | description | hash | reference | reference_type | version | is_latest | is_locked | created_at | branch_id | space_id | created_by_id | run_id | meta_artifact_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | |||||||||||||||
| 107 | wwvSKTeDmTri9Ppf0000 | scBaseCount/Velocyto/Mus_musculus | None | j3BeJyLuclN11yQpqHJj6Q | None | None | 2025-02-25 | True | False | 2025-03-03 11:09:45.776463+00:00 | 1 | 1 | 1 | 10 | None |
| 106 | wdVaulVvESgAWwtf0000 | scBaseCount/GeneFull_ExonOverIntron/Mus_musculus | None | Yr9AxC-eL10vVMuigJOlrg | None | None | 2025-02-25 | True | False | 2025-03-03 11:09:34.372387+00:00 | 1 | 1 | 1 | 10 | None |
| 105 | 83gTx3oxX5S4SxQ30000 | scBaseCount/GeneFull_Ex50pAS/Mus_musculus | None | x-Tm3VldcW71n3mYE2KknQ | None | None | 2025-02-25 | True | False | 2025-03-03 11:09:22.891607+00:00 | 1 | 1 | 1 | 10 | None |
| 104 | zLwr9k0TkiRt6ymZ0000 | scBaseCount/GeneFull/Mus_musculus | None | i30e5gnKklC8UBqSS0aVSA | None | None | 2025-02-25 | True | False | 2025-03-03 11:09:11.674645+00:00 | 1 | 1 | 1 | 10 | None |
| 103 | wQQNz6vrQeKuro540000 | scBaseCount/Gene/Mus_musculus | None | QeF9x4hTGYLw8MzFvLBCoQ | None | None | 2025-02-25 | True | False | 2025-03-03 11:09:00.351899+00:00 | 1 | 1 | 1 | 10 | None |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 12 | Aioyo5zHXzPUkSuT0000 | scBaseCount/Velocyto/Bos_taurus | None | HkQe255ahyg8xbV35eRs4Q | None | None | 2025-02-25 | True | False | 2025-03-03 11:00:28.668980+00:00 | 1 | 1 | 1 | 10 | None |
| 11 | gJrcdOm2sG7JUINS0000 | scBaseCount/GeneFull_ExonOverIntron/Bos_taurus | None | BFGras5oupzn4iVCjSjZ0A | None | None | 2025-02-25 | True | False | 2025-03-03 11:00:23.782698+00:00 | 1 | 1 | 1 | 10 | None |
| 10 | gY3xsMES4idjZb320000 | scBaseCount/GeneFull_Ex50pAS/Bos_taurus | None | 7E9sWxY48KZlzq0K9vT-rw | None | None | 2025-02-25 | True | False | 2025-03-03 11:00:18.903653+00:00 | 1 | 1 | 1 | 10 | None |
| 9 | owfF1Bfuq660eiDp0000 | scBaseCount/GeneFull/Bos_taurus | None | ionjx_HD9P6K9u5dJKgR3w | None | None | 2025-02-25 | True | False | 2025-03-03 11:00:14.013350+00:00 | 1 | 1 | 1 | 10 | None |
| 8 | ttGkPgXxLDO4sSXF0000 | scBaseCount/Gene/Bos_taurus | None | jn1Nhcdt0lpB1I3hQ4SgFw | None | None | 2025-02-25 | True | False | 2025-03-03 11:00:09.130314+00:00 | 1 | 1 | 1 | 10 | None |
100 rows × 15 columns
Query artifacts of interest based on metadata¶
Often you might not want to access all the h5ads in a collection, but rather filter them by metadata:
organisms = db.bionty.Organism.lookup()
tissues = db.bionty.Tissue.lookup()
efos = db.bionty.ExperimentalFactor.lookup()
feature_counts = db.ULabel.filter(type__name="STARsolo count features").lookup()
h5ads_brain = db.Artifact.filter(
suffix=".h5ad",
projects=project_scbasecount,
organisms=organisms.human,
ulabels=feature_counts.genefull_ex50pas,
tissues=tissues.brain,
experimental_factors=efos.single_cell,
experiments__name__contains="CRISPRi", # `perturbation` column is registered in `wetlab.Experiment`
).distinct()
h5ads_brain.to_dataframe()
Show code cell output
| uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | version | is_latest | is_locked | created_at | branch_id | space_id | storage_id | run_id | schema_id | created_by_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||||||||||||
| 114219 | KmwFfMZts5AaTWiz0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 5075135 | 6U8gpvtdL39AydWS2RF+mQ | None | 47000 | None | True | False | 2025-02-28 16:46:25.771217+00:00 | 1 | 1 | 3 | 10 | 55 | 1 |
| 114218 | P8yGlfAQ0wzDTsfl0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 322043307 | D4mrkCwFgr/GCHFBG/bpsw | None | 26839 | None | True | False | 2025-02-28 16:46:25.771217+00:00 | 1 | 1 | 3 | 10 | 55 | 1 |
| 114217 | C9AGAtLn0SycrD0H0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 4803205 | lI5E9UQl5BGLjXTF0tL0eg | None | 42081 | None | True | False | 2025-02-28 16:46:25.771217+00:00 | 1 | 1 | 3 | 10 | 55 | 1 |
| 114216 | PqczIL8HAmnqj3qD0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 5341115 | JE5IltnBicpHIl4+yIIMlw | None | 48937 | None | True | False | 2025-02-28 16:46:25.771217+00:00 | 1 | 1 | 3 | 10 | 55 | 1 |
| 114215 | GeyKZowZ0w8wjk860000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 310534680 | qJOKaNf4BfbK3oldqfhYyw | None | 25826 | None | True | False | 2025-02-28 16:46:25.771217+00:00 | 1 | 1 | 3 | 10 | 55 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 104170 | YqiNrGCXc1cM9Dg90000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 5494309 | kMbDZo5QMSt3WzLKZjsdCg | None | 7383 | None | True | False | 2025-02-28 16:46:25.771217+00:00 | 1 | 1 | 3 | 10 | 55 | 1 |
| 104169 | obSEgMzCzxBMajAG0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 31893717 | 7CxAkyanJAjL0oqRuKuOMQ | None | 8328 | None | True | False | 2025-02-28 16:46:25.771217+00:00 | 1 | 1 | 3 | 10 | 55 | 1 |
| 104166 | ZmSJbhRC4WeK1nyA0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 40518635 | gdcEf34j7wAVvxcUby9UDw | None | 7114 | None | True | False | 2025-02-28 16:46:25.771217+00:00 | 1 | 1 | 3 | 10 | 55 | 1 |
| 104165 | dsdwNB7SxJVms3RM0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 40558858 | pqOSB0P/86wxdtzWC+Y2Iw | None | 7740 | None | True | False | 2025-02-28 16:46:25.771217+00:00 | 1 | 1 | 3 | 10 | 55 | 1 |
| 104164 | HDcm6w76zhgllPPL0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 38875181 | FWv1CWlbr5a3hdzfgrztXQ | None | 7641 | None | True | False | 2025-02-28 16:46:25.771217+00:00 | 1 | 1 | 3 | 10 | 55 | 1 |
64 rows × 20 columns
Load the h5ad files with obs metadata¶
Load the h5ads as a single AnnData:
adatas = []
for artifact in h5ads_brain[:5]: # only load the first 5 artifacts to save CI time
adatas.append(artifact.load())
# the obs metadatas are present in the parquet files
adata_concat = ad.concat(adatas)
adata_concat
Show code cell output
! run input wasn't tracked, call `ln.track()` and re-run
! run input wasn't tracked, call `ln.track()` and re-run
! run input wasn't tracked, call `ln.track()` and re-run
! run input wasn't tracked, call `ln.track()` and re-run
! run input wasn't tracked, call `ln.track()` and re-run
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/anndata/_core/anndata.py:1792: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
AnnData object with n_obs × n_vars = 38206 × 36601
obs: 'gene_count', 'umi_count', 'SRX_accession'
Open the sample metadata:
sample_meta = db.Artifact.filter(
key__endswith="sample_metadata.parquet",
projects=project_scbasecount,
organisms=organisms.human,
ulabels=feature_counts.genefull_ex50pas,
).one()
sample_meta
Show code cell output
Artifact(uid='WCHkcyWN8L6pDI4E0000', version=None, is_latest=True, key='2025-02-25/metadata/GeneFull_Ex50pAS/Homo_sapiens/sample_metadata.parquet', description=None, suffix='.parquet', kind='dataset', otype='DataFrame', size=531878, hash='4QrqW8DQVRl6bKNYiJhq3g', n_files=None, n_observations=16077, branch_id=1, space_id=1, storage_id=3, run_id=2, schema_id=None, created_by_id=1, created_at=2025-02-25 20:41:32 UTC, is_locked=False)
sample_meta_dataset = sample_meta.open()
sample_meta_dataset.schema
Show code cell output
! run input wasn't tracked, call `ln.track()` and re-run
entrez_id: int64
srx_accession: string
file_path: string
obs_count: int64
lib_prep: string
tech_10x: string
cell_prep: string
organism: string
tissue: string
disease: string
perturbation: string
cell_line: string
czi_collection_id: string
czi_collection_name: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 1755
Fetch corresponding sample metadata:
filter_expr = pc.field("srx_accession").isin(
adata_concat.obs["SRX_accession"].astype(str)
)
df = sample_meta_dataset.scanner(filter=filter_expr).to_table().to_pandas()
Add the sample metadata to the AnnData:
adata_concat.obs = adata_concat.obs.merge(
df, left_on="SRX_accession", right_on="srx_accession"
)
adata_concat
Show code cell output
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.
return dispatch(args[0].__class__)(*args, **kw)
AnnData object with n_obs × n_vars = 38206 × 36601
obs: 'gene_count', 'umi_count', 'SRX_accession', 'entrez_id', 'srx_accession', 'file_path', 'obs_count', 'lib_prep', 'tech_10x', 'cell_prep', 'organism', 'tissue', 'disease', 'perturbation', 'cell_line', 'czi_collection_id', 'czi_collection_name'
adata_concat.obs.head()
Show code cell output
| gene_count | umi_count | SRX_accession | entrez_id | srx_accession | file_path | obs_count | lib_prep | tech_10x | cell_prep | organism | tissue | disease | perturbation | cell_line | czi_collection_id | czi_collection_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2748 | 5134.0 | SRX10606628 | 14083632 | SRX10606628 | gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_... | 7641 | 10x_Genomics | 3_prime_gex | single_cell | Homo sapiens | brain | Down syndrome | CRISPR/Cas9, CRISPRi, or small-molecule inhibi... | DS1 | None | None |
| 1 | 2351 | 4639.0 | SRX10606628 | 14083632 | SRX10606628 | gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_... | 7641 | 10x_Genomics | 3_prime_gex | single_cell | Homo sapiens | brain | Down syndrome | CRISPR/Cas9, CRISPRi, or small-molecule inhibi... | DS1 | None | None |
| 2 | 2184 | 4293.0 | SRX10606628 | 14083632 | SRX10606628 | gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_... | 7641 | 10x_Genomics | 3_prime_gex | single_cell | Homo sapiens | brain | Down syndrome | CRISPR/Cas9, CRISPRi, or small-molecule inhibi... | DS1 | None | None |
| 3 | 2469 | 5307.0 | SRX10606628 | 14083632 | SRX10606628 | gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_... | 7641 | 10x_Genomics | 3_prime_gex | single_cell | Homo sapiens | brain | Down syndrome | CRISPR/Cas9, CRISPRi, or small-molecule inhibi... | DS1 | None | None |
| 4 | 4144 | 9340.0 | SRX10606628 | 14083632 | SRX10606628 | gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_... | 7641 | 10x_Genomics | 3_prime_gex | single_cell | Homo sapiens | brain | Down syndrome | CRISPR/Cas9, CRISPRi, or small-molecule inhibi... | DS1 | None | None |