Arc Virtual Cell Atlas .md .md

With 2.5B expression profiles that map to about 600M cells, the Arc Virtual Cell Atlas is the world’s largest collection of uniformly processed scRNA-seq datasets. Arc distributes the atlas as 460k parquet and h5ad files totaling 41TB on Google Cloud Storage, see github.com/ArcInstitute/arc-virtual-cell-atlas. Lamin mirrors the atlas in a database: lamin.ai/laminlabs/arc-virtual-cell-atlas.

If you use the data academically, please cite the original publications, Youngblut et al. (2025)[1] and Zhang et al. (2025).[2]

To query the atlas with lamindb, you have to install it with the GCP (Google Cloud Platform) extra. We also recommend configuring the bionty and pertdb modules.

# pip install 'lamindb[gcp]'
!lamin settings modules set bionty,pertdb

Create the central query object for this instance:

import lamindb as ln
import pyarrow.compute as pc

db = ln.DB("laminlabs/arc-virtual-cell-atlas")
Hide code cell output
! using anonymous user (to identify, call: lamin login)

Tahoe-100M

Retrieve the fourteen .h5ad datasets of the Tahoe-100M project:

tahoe = db.Project.get(name="Tahoe-100M")
artifacts_tahoe = db.Artifact.filter(projects=tahoe, suffix=".h5ad")
artifacts_tahoe.to_dataframe()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations ... is_latest is_locked created_at branch_id created_on_id space_id storage_id run_id schema_id created_by_id
id
1375 BDttiuV3Te8VB0dU0000 tahoe100M/2025-02-25/h5ad/plate9_filt_Vevo_Tah... None .h5ad dataset AnnData 18791302576 4kHbVbmreg6akW6ZgsjxaA None 5866669.0 ... True False 2025-02-25 23:22:22.759201+00:00 1 1 1 2 1 3 1
1374 czC19UpUEszVH2bU0000 tahoe100M/2025-02-25/h5ad/plate8_filt_Vevo_Tah... None .h5ad dataset AnnData 30390935958 ilAzEPIh4FlDeTFaJ1dILw None 8880979.0 ... True False 2025-02-25 23:22:22.387666+00:00 1 1 1 2 1 3 1
1373 DC5cacdJr1VoEXnl0000 tahoe100M/2025-02-25/h5ad/plate7_filt_Vevo_Tah... None .h5ad dataset AnnData 16514746341 NOS4MY6eYYPOnAB8ViyWYg None 5692117.0 ... True False 2025-02-25 23:22:22.009157+00:00 1 1 1 2 1 3 1
1372 aAHQ3zbD7n1asyYr0000 tahoe100M/2025-02-25/h5ad/plate6_filt_Vevo_Tah... None .h5ad dataset AnnData 28934897078 NYvQEqVClziHm0ozWhOw1w None 7545393.0 ... True False 2025-02-25 23:22:21.629962+00:00 1 1 1 2 1 3 1
1371 EZATJLC4jE7pmwo40000 tahoe100M/2025-02-25/h5ad/plate5_filt_Vevo_Tah... None .h5ad dataset AnnData 19763140865 VMBKFzOI5cj7UC1UDENP4A None 6419498.0 ... True False 2025-02-25 23:22:21.255154+00:00 1 1 1 2 1 3 1
1370 tKTeff0ugWqAm4P70000 tahoe100M/2025-02-25/h5ad/plate4_filt_Vevo_Tah... None .h5ad dataset AnnData 23292672278 BkBXznbSovNWXtzPFITPcQ None 7004356.0 ... True False 2025-02-25 23:22:20.879928+00:00 1 1 1 2 1 3 1
1369 XVSrkq9pyF1OBLgG0000 tahoe100M/2025-02-25/h5ad/plate3_filt_Vevo_Tah... None .h5ad dataset AnnData 13173722269 Jnrt7DaSUCGn8D8LS2itaw None 4705402.0 ... True False 2025-02-25 23:22:20.497965+00:00 1 1 1 2 1 3 1
1368 ZFeVfd0ugAHeWCxm0000 tahoe100M/2025-02-25/h5ad/plate2_filt_Vevo_Tah... None .h5ad dataset AnnData 29037152127 usxviuqGbuw0RYnECCVCWw None 8064658.0 ... True False 2025-02-25 23:22:20.113956+00:00 1 1 1 2 1 3 1
1367 aJIqo7bNyJAs9z0r0000 tahoe100M/2025-02-25/h5ad/plate1_filt_Vevo_Tah... None .h5ad dataset AnnData 19070623904 9iCNcouMqfNS3HA/2GUWOA None 5481420.0 ... True False 2025-02-25 23:22:19.737995+00:00 1 1 1 2 1 3 1
1366 vn5cUJCHbjpPPsZx0000 tahoe100M/2025-02-25/h5ad/plate14_filt_Vevo_Ta... None .h5ad dataset AnnData 22427932564 FrnStRehP16siRGG35ou+g None 6518806.0 ... True False 2025-02-25 23:22:19.357999+00:00 1 1 1 2 1 3 1
1365 9L9HZ55HqUL0aqaR0000 tahoe100M/2025-02-25/h5ad/plate13_filt_Vevo_Ta... None .h5ad dataset AnnData 28071589885 RKOiaay+CHvv+Ukk/N+28A None 8501658.0 ... True False 2025-02-25 23:22:18.977981+00:00 1 1 1 2 1 3 1
1364 S2h2rPLCaUhZAM9u0000 tahoe100M/2025-02-25/h5ad/plate12_filt_Vevo_Ta... None .h5ad dataset AnnData 37495736876 VjAkWVFGVpzAMi9Innusuw None 10487057.0 ... True False 2025-02-25 23:22:18.600910+00:00 1 1 1 2 1 3 1
1363 omn7JStfJMzy8m6O0000 tahoe100M/2025-02-25/h5ad/plate11_filt_Vevo_Ta... None .h5ad dataset AnnData 23230802756 N2mzoYlMLEl6PdecaYyDvw None 7435869.0 ... True False 2025-02-25 23:22:18.229629+00:00 1 1 1 2 1 3 1
1362 56uA9lPPmJ4zLUcr0000 tahoe100M/2025-02-25/h5ad/plate10_filt_Vevo_Ta... None .h5ad dataset AnnData 26536400717 j1FXsX7hs7u+eBqnWnmNHw None 8044908.0 ... True False 2025-02-25 23:22:17.849980+00:00 1 1 1 2 1 3 1
15 3Yl20zyG926CkvP50000 tahoe100M/2025-02-25/tutorial/plate3_2k-obs.h5ad None .h5ad dataset AnnData 7253540 vv16qryJsVY98jDBqhkr9w None 2000.0 ... True False 2025-02-25 19:31:01.255128+00:00 1 1 1 2 1 3 1

15 rows × 22 columns

See the schema and annotations of the first dataset:

artifact1 = artifacts_tahoe[0]
artifact1.describe()
Hide code cell output
Artifact: tahoe100M/2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad (0000)
├── uid: 56uA9lPPmJ4zLUcr0000            run: 0xj4zui (register-tahoe100.ipynb)
kind: dataset                        otype: AnnData                        
hash: j1FXsX7hs7u+eBqnWnmNHw         size: 24.7 GB                         
branch: main                         space: all                            
created_at: 2025-02-25 23:22:17 UTC  created_by: sunnyosun                 
n_observations: 8044908.0            schema: tahoe100_anndata_schema       
├── storage/path: 
gs://arc-institute-virtual-cell-atlas/tahoe100M/2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_Parse
Gigalab.h5ad
├── Dataset features
├── var (62710.0 bionty.Gene.sta…                                                                              
│   ANKIB1                         float                                                                       
│   C1orf112                       float                                                                       
│   CFH                            float                                                                       
│   CFTR                           float                                                                       
│   CYP51A1                        float                                                                       
│   DPM1                           float                                                                       
│   ENPP4                          float                                                                       
│   FGR                            float                                                                       
│   FUCA2                          float                                                                       
│   GCLC                           float                                                                       
│   KRIT1                          float                                                                       
│   LAS1L                          float                                                                       
│   NFYA                           float                                                                       
│   NIPAL3                         float                                                                       
│   RAD52                          float                                                                       
│   SCYL3                          float                                                                       
│   SEMA3F                         float                                                                       
│   STPG1                          float                                                                       
│   TNMD                           float                                                                       
│   TSPAN6                         float                                                                       
└── obs (16.0)                                                                                                 
    BARCODE                        str                                                                         
    G2M_score                      float                                                                       
    S_score                        float                                                                       
    cell_line                      bionty.CellLine                      A-172, A-427, A498, A549, AN3 CA, AsPC…
    cell_name                      bionty.CellLine                      A-172, A-427, A498, A549, AN3 CA, AsPC…
    drug                           pertdb.Compound                      5-Azacytidine, 5-Fluorouracil, Abirate…
    drugname_drugconc              pertdb.CompoundPerturbation          [('5-Azacytidine', 0.05, 'uM')], [('5-…
    gene_count                     int                                                                         
    mread_count                    int                                                                         
    pass_filter                    ULabel[yMABN5Dr]                     full, minimal                          
    pcnt_mito                      float                                                                       
    phase                          ULabel[kTzOKZ54]                     G1, G2M, S                             
    plate                          ULabel[SjVCuE2Q]                     plate10                                
    sample                         str                                                                         
    sublibrary                     str                                                                         
    tscp_count                     int                                                                         
└── Labels
    └── .ulabels                       ULabel                               plate10, G1, G2M, S, full, minimal     
        .projects                      Project                              Tahoe-100M                             
        .compounds                     pertdb.Compound                      Bestatin (hydrochloride), Ataluren, Ca…
        .compound_perturbations        pertdb.CompoundPerturbation          [('Bestatin (hydrochloride)', 0.05, 'u…
        .organisms                     bionty.Organism                      human                                  
        .cell_lines                    bionty.CellLine                      NCI-H1573, NCI-H460, hTERT-HPNE, SW48,…

You can download an .h5ad into your local cache, load it into memory, or open it for streaming:

local_filepath = artifact1.cache()  # sync into cache 
adata = artifact1.load()  # sync into cache and load into memory
with artifact1.open() as adata:  # open for streaming
    ...

You can query the CellLine ontology, the Compound, and the CompoundPerturbation registries via their relationship to Artifact. You’ll find 50 cell lines:

db.bionty.CellLine.filter(artifacts__in=artifacts_tahoe).distinct().to_dataframe()
Hide code cell output
uid name ontology_id abbr synonyms description is_locked created_at branch_id created_on_id space_id created_by_id run_id source_id
id
50 7VaGVBNBdVEmdf NCI-H596 CVCL_1571 None H596|H-596|NCI-HUT-596|NCIH596 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
49 7dL2LJjx9iARSo NCI-H661 CVCL_1577 None H661|H-661|NCIH661 Part of: AKT genetic alteration cell panel (AT... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
48 J5Ylm8TVnGfDCh NCI-H2122 CVCL_1531 None H2122|H-2122|NCIH2122 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
47 6O2MPQMm2fYmHx A-427 CVCL_1055 None A427|A427N Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
46 219BOZMeuZbXcN SW 1088 CVCL_1715 None SW-1088|SW 1088 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
45 2eQosYlsjKld2O CHP-212 CVCL_1125 None CHP 212|CHP212|NB9|NB-9|Children's Hospital of... Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
44 5JZNtoDJwHkHJl AsPC-1 CVCL_0152 None AsPc-1|Aspc-1|ASPC-1|As-PC1|ASPC1|AsPC1|Aspc1|... Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
43 3Oz9gRsuI8mZjq HepG2/C3A CVCL_1098 None HepG2/C3A|Hep G2/C3A|C3A Group: Patented cell line. Part of: Cancer Dep... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
42 1K1CzNSiUR7v5M C32 CVCL_1097 None C-32|C32-mel|C32 mel|C32r Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
41 4QH2SpWAsqvVcp NCI-H2030 CVCL_1517 None H2030|H-2030|NCIH2030 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
40 vEfTp1Hk8gZOig A549 CVCL_0023 None A 549|A549|NCI-A549|A549/ATCC|A549 ATCC|A549AT... Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
39 39rNVaPPNTJKGZ SK-MEL-2 CVCL_0069 None SK-Mel-2|SK-Mel 2|SK-mel-2|SK-MEL2|SK.MEL.2|SK... Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
38 1lhqeW2vEMefxI RPMI-7951 CVCL_1666 None RPMI 7951|RPMI7951|Roswell Park Memorial Insti... Part of: BRAF genetic alteration cell panel (A... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
37 5lqReFKRVIe0Mh AN3 CA CVCL_0028 None AN3_CA|AN3 CA|AN3 Ca|AN3CA|AN-3|AN3|Acanthosis... Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
36 4hyU9oFufh0bK1 BT-474 CVCL_0179 None Bt-474|BT474 Part of: AKT genetic alteration cell panel (AT... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
35 5XupQdHONULNf4 COLO 205 CVCL_0218 None Colo 205|CoLo 205|COLO-205|Colo-205|COLO.205|C... Part of: AstraZeneca Colorectal cell line (AZC... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
34 63ZWvcHVlrgadr HCT15 CVCL_0292 None HCT-15|HCT.15|HCT15 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
33 1mLuzzowHZrJ1x HEC-1-A CVCL_0293 None Hec-1-A|HEC-1A|HEC1-A|HEC1A|Hec1A Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
32 1CdEQ5dJ3wSnz0 LS 180 CVCL_0397 None LS-180|LS 180|Laboratory of Surgery 180 Group: Patented cell line. Part of: AstraZenec... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
31 5ewuYry0FV3ZfQ Panc 03.27 CVCL_1635 None Panc 3.27|Panc-03.27|PANC-03-27|Panc_03_27|Pan... Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
30 bC8JbRlgoo7I8v NCI-H23 CVCL_1547 None NCI.H23|NCI H23|H-23|H23|NCIH23 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
29 VEd9akJoczffaq LOX-IMVI CVCL_1381 None LOX/IMVI|LOX IMVI|LOXIM-VI|LOXIMVI|LOX Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
28 6NWX3dtqrcuJGn NCI-H2347 CVCL_1550 None H2347|H-2347|NCIH2347 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
27 7QShig8FWwanTz A498 CVCL_1056 None A498 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
26 6GOSOOuitVyT5w HS-578T CVCL_0332 None HS 578T|Hs-578T|HS-578T|Hs_578t|Hs-578-T|HS-57... Group: Triple negative breast cancer (TNBC) ce... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
25 5N1doHnPUuqaUH SNU-423 CVCL_0366 None SNU423|NCI-SNU-423 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
24 NvPXo2HuAsQ6E7 SHP-77 CVCL_1693 None SHP77|Shadyside Hospital Pittsburgh-77 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
23 3PDtUj4smgl5Yt A-172 CVCL_0131 None A172|A 172|A-172 MG|A-172MG Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
22 4IfJB0Y2Yna96G J82 CVCL_0359 None J-82|J 82|J82COT|J82 COT|J82 CO'T|J82/WT Part of: BLA-40 bladder carcinoma cell line pa... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
21 18kaNqu03Lokac SNU-1 CVCL_0099 None SNU1|NCI-SNU-1 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
20 EtRJf7f9w2Ydzv C-33 A CVCL_1094 None C33A|C33a|C33-A|C-33-A|C-33A|C33 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
19 5Vjc1Ubr4Wmx3e KATO III CVCL_0371 None Kato III|Kato-III|KATO-III|KATOIII|KatoIII|KAT... Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
18 5Jp9rqX7fq0WfO SW 900 CVCL_1731 None SW-900|SW 900 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
17 tdp1HNANxpdEL8 CFPAC-1 CVCL_1119 None CFPac-1|CF PAC-1|CF-PAC1|CF-Pac1|CF Pac1|CFPAC... Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
16 HLBTHKPgnWwXFc SW1417 CVCL_1717 None SW-1417|SW 1417 Part of: AstraZeneca Colorectal cell line (AZC... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
15 1v7MehiuDq1fxi H4 CVCL_1239 None H-4 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
14 4n6gUGHYAgaMa7 RKO CVCL_0504 None None Part of: BRAF genetic alteration cell panel (A... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
13 7aEsdKjgrquDHM SW 1271 CVCL_1716 None SW-1271|SW 1271 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
12 6h8KcJYpcyGPMW MIA PaCa-2 CVCL_0428 None MIA-PaCa-2|MIA-PACA-2|MIA-Pa-Ca-2|MIA Paca2|MI... Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
11 65gF96lUqmFSOS PANC-1 CVCL_0480 None Panc-1|PANC.1|Panc 1|PanC1|Panc1|PANC1|Panc-1-P Part of: AKT genetic alteration cell panel (AT... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
10 4Ch2fV9acNggnI Hs 766T CVCL_0334 None Hs 766.T|HS-766T|Hs-766T|HS 766T|HS-766-T|Hs-7... Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
9 7KUHx7VCg6T5IW LoVo CVCL_0399 None LOVO Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
8 1p59Uds7MHG1E8 HT-29 CVCL_0320 None HT 29|HT29 Part of: AstraZeneca Colorectal cell line (AZC... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
7 3Yy5mGISUvNXgf SW480 CVCL_0546 None SW-480|SW 480|SW480E Part of: AstraZeneca Colorectal cell line (AZC... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
6 39vFskbziFrczt NCI-H1792 CVCL_1495 None H1792|H-1792|NCIH1792 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
5 6SFnBlyJJ99Ed4 HOP62 CVCL_1285 None HOP 62|Hop 62|HOP.62|HOP62|Hop62|Hopkins-62 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
4 2MwkQgWOrXrJRa SW48 CVCL_1724 None SW-48|SW 48 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
3 729jQiCVI8G8Tt hTERT-HPNE CVCL_C466 None hTERT-Human Pancreatic Nestin-Expressing cells Doubling time: ~26 hours (PBCF). Genetic integ... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
2 2yrJ1RO9K5lUMm NCI-H460 CVCL_0459 None NCI.H460|H460|H-460|NCIH460|NCI-HUT-460|NCI-460 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120
1 505Oto0blKYUyu NCI-H1573 CVCL_1478 None H1573|H-1573|NCIH1573 Part of: Cancer Dependency Map project (DepMap... False 2025-02-25 22:20:20.217993+00:00 1 1 1 1 3 120

380 compounds:

db.pertdb.Compound.filter(artifacts__in=artifacts_tahoe).distinct().to_dataframe()
Hide code cell output
! truncated query result to limit=100 Compound objects (will change to limit=20 in lamindb 2.7)
uid ontology_id abbr synonyms description name type chembl_id smiles canonical_smiles ... molformula moa is_locked created_at branch_id created_on_id space_id created_by_id run_id source_id
id
380 JRDV3CsZkr3Yuu None None None Cyclooxygenase inhibitor Tolmetin None None None None ... None None False 2025-02-25 22:48:58.568677+00:00 1 1 1 1 3 None
379 x3BfjX6JwJhYJu None None None Retinoic receptor agonist Peretinoin None None None None ... None None False 2025-02-25 22:48:58.568677+00:00 1 1 1 1 3 None
378 18PJ8Lu8ZS7C48 None None None None Niclosamide (olamine) None None None None ... None None False 2025-02-25 22:48:58.568677+00:00 1 1 1 1 3 None
377 5yeFtKHyuB3qyk None None None Androgen receptor antagonist Apalutamide None None None None ... None None False 2025-02-25 22:48:58.568677+00:00 1 1 1 1 3 None
376 2AEfoFfqFY4Tom None None None None Mifepristone None None None None ... None None False 2025-02-25 22:48:58.568677+00:00 1 1 1 1 3 None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
285 enh3vp0fY6eiEk None None None Proteasome inhibitor Ixazomib None None None None ... None None False 2025-02-25 22:48:58.568677+00:00 1 1 1 1 3 None
284 4iDSuajqnzq19j None None None None Pimitespib None None None None ... None None False 2025-02-25 22:48:58.568677+00:00 1 1 1 1 3 None
283 5y8FwPtAyVVNaS None None None Protein synthesis inhibitor 4EGI-1 None None None None ... None None False 2025-02-25 22:48:58.568677+00:00 1 1 1 1 3 None
282 5lagGAeuQzHKB5 None None None MTOR inhibitor Torkinib None None None None ... None None False 2025-02-25 22:48:58.568677+00:00 1 1 1 1 3 None
281 1NkoYXbqmIuCb9 None None None Microtubule inhibitor Tubulin inhibitor 6 None None None None ... None None False 2025-02-25 22:48:58.568677+00:00 1 1 1 1 3 None

100 rows × 22 columns

1,138 perturbations:

db.pertdb.CompoundPerturbation.filter(artifacts__in=artifacts_tahoe).distinct().to_dataframe()
Hide code cell output
! truncated query result to limit=100 CompoundPerturbation objects (will change to limit=20 in lamindb 2.7)
uid abbr synonyms description name concentration concentration_unit duration is_locked created_at branch_id created_on_id space_id created_by_id run_id source_id compound_id
id
1138 37rn9xx8Vzca2L None None None [('Tolmetin', 5.0, 'uM')] 5.0 uM None False 2025-02-25 22:59:15.764901+00:00 1 1 1 1 3 None 380
1137 5FRNvnyvkZokCn None None None [('Peretinoin', 5.0, 'uM')] 5.0 uM None False 2025-02-25 22:59:15.764901+00:00 1 1 1 1 3 None 379
1136 6Wdgxi67CDiFF1 None None None [('Niclosamide (olamine)', 5.0, 'uM')] 5.0 uM None False 2025-02-25 22:59:15.764901+00:00 1 1 1 1 3 None 378
1135 ZYlNIuNh7HVszg None None None [('Apalutamide', 5.0, 'uM')] 5.0 uM None False 2025-02-25 22:59:15.764901+00:00 1 1 1 1 3 None 377
1134 17Tb1OMJxUcytA None None None [('Mifepristone', 5.0, 'uM')] 5.0 uM None False 2025-02-25 22:59:15.764901+00:00 1 1 1 1 3 None 376
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1043 1N1hH9TH6gCCAn None None None [('Imiquimod (maleate)', 0.5, 'uM')] 0.5 uM None False 2025-02-25 22:59:15.764901+00:00 1 1 1 1 3 None 367
1042 7B5xJGikWaz8zw None None None [('Dapagliflozin ((2S)-1,2-propanediol, hydrat... 0.5 uM None False 2025-02-25 22:59:15.764901+00:00 1 1 1 1 3 None 366
1041 19sIQab2xR60pl None None None [('Glycyrrhizic acid', 0.5, 'uM')] 0.5 uM None False 2025-02-25 22:59:15.764901+00:00 1 1 1 1 3 None 365
1040 6EZluRoNfJ217T None None None [('Menadione', 0.5, 'uM')] 0.5 uM None False 2025-02-25 22:59:15.764901+00:00 1 1 1 1 3 None 364
1039 2TUEvZYfeqtVeZ None None None [('Doxorubicin (hydrochloride)', 0.5, 'uM')] 0.5 uM None False 2025-02-25 22:59:15.764901+00:00 1 1 1 1 3 None 363

100 rows × 17 columns

Query artifacts based on metadata

Let’s find which datasets contain A549 cells perturbed with Piroxicam.

a549 = db.bionty.CellLine.get(name="A549")
piro = db.pertdb.Compound.get(name="Piroxicam")

artifacts_a549_piro = artifacts_tahoe.filter(compounds=piro, cell_lines=a549)
artifacts_a549_piro.to_dataframe()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations ... is_latest is_locked created_at branch_id created_on_id space_id storage_id run_id schema_id created_by_id
id
1364 S2h2rPLCaUhZAM9u0000 tahoe100M/2025-02-25/h5ad/plate12_filt_Vevo_Ta... None .h5ad dataset AnnData 37495736876 VjAkWVFGVpzAMi9Innusuw None 10487057.0 ... True False 2025-02-25 23:22:18.600910+00:00 1 1 1 2 1 3 1
1363 omn7JStfJMzy8m6O0000 tahoe100M/2025-02-25/h5ad/plate11_filt_Vevo_Ta... None .h5ad dataset AnnData 23230802756 N2mzoYlMLEl6PdecaYyDvw None 7435869.0 ... True False 2025-02-25 23:22:18.229629+00:00 1 1 1 2 1 3 1
1362 56uA9lPPmJ4zLUcr0000 tahoe100M/2025-02-25/h5ad/plate10_filt_Vevo_Ta... None .h5ad dataset AnnData 26536400717 j1FXsX7hs7u+eBqnWnmNHw None 8044908.0 ... True False 2025-02-25 23:22:17.849980+00:00 1 1 1 2 1 3 1

3 rows × 22 columns

Stream the dataset content

While the artifact metadata tells us which files contain A549 cells and Piroxicam, we use a parquet file to find the exact cells within those files. To this end, we open the metadata file with pyarrow.Dataset:

obs_af = db.Artifact.get(key__endswith="obs_metadata.parquet", projects=tahoe)
obs_af.describe()
Hide code cell output
Artifact: tahoe100M/2025-02-25/metadata/obs_metadata.parquet (0000)
├── uid: y1TTR9wbrmZEwpOa0000            run: 0xj4zui (register-tahoe100.ipynb)
kind: dataset                        otype: DataFrame                      
hash: qEWOpGw9CmQVzaElyMWT1Q         size: 2.1 GB                          
branch: main                         space: all                            
created_at: 2025-02-25 19:33:42 UTC  created_by: sunnyosun                 
n_observations: 100648790.0                                                
├── storage/path: gs://arc-institute-virtual-cell-atlas/tahoe100M/2025-02-25/metadata/obs_metadata.parquet
└── Labels
    └── .ulabels                       ULabel                               metadata                               
        .projects                      Project                              Tahoe-100M                             
        .organisms                     bionty.Organism                      human                                  

The schema of the parquet file maps to the pyarrow schema:

obs_ds = obs_af.open()  # consider using with obs_af.open() as obs_ds
obs_ds.schema
Hide code cell output
plate: string
BARCODE_SUB_LIB_ID: string
sample: string
gene_count: int64
tscp_count: int64
mread_count: int64
drugname_drugconc: string
drug: string
cell_line: dictionary<values=string, indices=int8, ordered=0>
sublibrary: string
BARCODE: string
pcnt_mito: float
S_score: double
G2M_score: double
phase: dictionary<values=string, indices=int8, ordered=0>
pass_filter: dictionary<values=string, indices=int8, ordered=0>
cell_name: dictionary<values=string, indices=int8, ordered=0>
__index_level_0__: int64
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 2487

Streaming speed

Streaming large parquet and h5ad files from cloud storage crucially depends on where you run your code. It’ll be much faster if you run it in the data center that hosts the data. It’ll typically be prohibitively slow if you run it locally. The gs://arc-institute-virtual-cell-atlas storage location is accessible from any Google Cloud data center in the US with low latency and no egress fees.

If you want to run logic locally, consider caching datasets prior to opening them for streaming via .open():

local_filepath = obs_af.cache()  # subsequent obs_af.open() will automatically read from the cache

Let us now query the columns of interest:

filter_expr = (pc.field("cell_name") == a549.name) & (pc.field("drug") == piro.name)

Retrieve the corresponding cells:

plate_cells = obs_df.groupby("plate")["BARCODE_SUB_LIB_ID"].apply(list)

And their counts:

adatas = []
for artifact in artifacts_a549_piro:
    plate_name = artifact.features["plate"].name
    idxs = plate_cells.get(plate_name)
    print(f"loading {len(idxs)} cells from plate {plate_name}")
    with artifact.open() as astore:
        adata = astore[idxs].to_memory()  # can also subset genes here
        adatas.append(adata)

# this will print something like this
#> loading 2812 cells from plate plate10
#> ...
# continue with concatenating or other processing of the AnnData objects

Train ML models

By applying fast data loaders such as annbatch[3] or scdataset[4] to locally cached arrays, one can achieve loading times of 50k - 80k vectors/second. This is much faster than cloud-based streaming of the array content.

Here we zero-copy transferred the Tahoe-100M datasets into a database for benchmarking different ML data loaders:

LaminHub example of lineage-aware syncing of Tahoe-100M datasets

Here is an example for a data loading run that loads these Tahoe-100M datasets from a pre-shuffled .zarr store, obtained as a transformation of the original 14 .h5ad files.

scBaseCount

scbase = db.Project.get(name="scBaseCount")
scbase
Hide code cell output
Project(uid='vdK00t9DGwHP', is_type=False, name='scBaseCount', description=None, abbr=None, url='https://arcinstitute.org/tools/virtualcellatlas', start_date=None, end_date=None, branch_id=1, created_on_id=1, space_id=1, created_by_id=1, run_id=None, type_id=None, created_at=2025-02-26 16:04:08 UTC, is_locked=False)

Query artifacts based on metadata

An exemplary query:

organisms = db.bionty.Organism.lookup()
tissues = db.bionty.Tissue.lookup()
efos = db.bionty.ExperimentalFactor.lookup()
feature_counts = db.ULabel.filter(type__name="STARsolo count features").lookup()

h5ads_brain = db.Artifact.filter(
    version_tag="2026-01-12",
    suffix=".h5ad",
    projects=scbase,
    organisms=organisms.human,
    ulabels=feature_counts.genefull_ex50pas,
    tissues=tissues.brain,
    experimental_factors=efos.single_cell,
).order_by("size").distinct()

h5ads_brain.to_dataframe()
Hide code cell output
! truncated query result to limit=100 Artifact objects (will change to limit=20 in lamindb 2.7)
uid key description suffix kind otype size hash n_files n_observations ... is_latest is_locked created_at branch_id created_on_id space_id storage_id run_id schema_id created_by_id
id
554817 qvSEhDQmKxucI3760000 scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/H... None .h5ad dataset AnnData 2738016 6B9vkGpnXxI9Q23rTDyC9w None None ... True False 2026-05-20 20:13:55.494800+00:00 1 1 1 2 33 62 1
541733 eYQD5k3PzRRMstxW0001 scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/H... None .h5ad dataset AnnData 2878065 eUQqSyIIYUHwABkh80+tPA None None ... True False 2026-05-20 20:13:55.494800+00:00 1 1 1 2 33 62 1
541705 SqUgBbn2N6boglmW0001 scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/H... None .h5ad dataset AnnData 2972876 fGuBQlQ0LMWrnjGr4ZaD0Q None None ... True False 2026-05-20 20:13:55.494800+00:00 1 1 1 2 33 62 1
541726 KVWrMQWjAozuBHz20001 scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/H... None .h5ad dataset AnnData 2984182 UFWF5MNgROBmEWe5PnxbGQ None None ... True False 2026-05-20 20:13:55.494800+00:00 1 1 1 2 33 62 1
541706 RdS7w8hsDNyD8iKz0001 scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/H... None .h5ad dataset AnnData 2998051 3ng4kuS5t/GepXCPYI2m6Q None None ... True False 2026-05-20 20:13:55.494800+00:00 1 1 1 2 33 62 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
526702 jZKuJtBsP748tMMC0001 scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/H... None .h5ad dataset AnnData 132546552 3P2R0gdONgSQuHkKP67x8Q None None ... True False 2026-05-20 20:13:55.494800+00:00 1 1 1 2 33 62 1
548665 AZKlT0uOjnZA63Dd0001 scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/H... None .h5ad dataset AnnData 133550258 dKQCfPar2QDujsIl6lxgBQ None None ... True False 2026-05-20 20:13:55.494800+00:00 1 1 1 2 33 62 1
549914 aRPEzHPWGrN84N220001 scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/H... None .h5ad dataset AnnData 137144381 66E/I/4m9Veuz8LNJ0HyCw None None ... True False 2026-05-20 20:13:55.494800+00:00 1 1 1 2 33 62 1
548671 mc3ktqZABaKIxn7F0001 scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/H... None .h5ad dataset AnnData 138411052 Y7hQAs+p4h7An3OVNPoT+Q None None ... True False 2026-05-20 20:13:55.494800+00:00 1 1 1 2 33 62 1
529745 Bnvwvu7V04bJLNsT0001 scbasecount/2026-01-12/h5ad/GeneFull_Ex50pAS/H... None .h5ad dataset AnnData 141506046 DMdLohhFH0mF0wHdWvuwhg None None ... True False 2026-05-20 20:13:55.494800+00:00 1 1 1 2 33 62 1

100 rows × 22 columns

Cache and load datasets into memory

Load the h5ads as a single AnnData by caching the datasets, concatenating them, and loading them into memory:

adata_concat = h5ads_brain[:5].load()
adata_concat
Hide code cell output
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/site-packages/anndata/_core/anndata.py:1823: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
AnnData object with n_obs × n_vars = 34765 × 36601
    obs: 'gene_count_Unique', 'umi_count_Unique', 'gene_count_UniqueAndMult-EM', 'umi_count_UniqueAndMult-EM', 'gene_count_UniqueAndMult-Uniform', 'umi_count_UniqueAndMult-Uniform', 'SRX_accession', 'cell_type', 'cell_ontology_term_id', 'artifact_uid'
    layers: 'UniqueAndMult-EM', 'UniqueAndMult-Uniform'

Open the sample metadata:

sample_meta = db.Artifact.get(
    version_tag="2026-01-12",
    key__endswith="sample_metadata.parquet",
    projects=scbase,
    organisms=organisms.human,
    ulabels=feature_counts.genefull_ex50pas,
)
sample_meta_dataset = sample_meta.open()
sample_meta_dataset.schema
Hide code cell output
entrez_id: int64
srx_accession: string
file_path: string
obs_count: int64
lib_prep: string
tech_10x: string
cell_prep: string
organism: string
tissue: string
tissue_ontology_term_id: string
disease: string
disease_ontology_term_id: string
perturbation: string
cell_line: string
antibody_derived_tag: string
czi_collection_id: string
czi_collection_name: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 2189

Query the corresponding sample metadata:

filter_expr = pc.field("srx_accession").isin(
    adata_concat.obs["SRX_accession"].astype(str)
)
df = sample_meta_dataset.scanner(filter=filter_expr).to_table().to_pandas()

Add the sample metadata to the AnnData object:

adata_concat.obs = adata_concat.obs.merge(
    df, left_on="SRX_accession", right_on="srx_accession"
)
adata_concat
Hide code cell output
/opt/hostedtoolcache/Python/3.12.13/x64/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.
  return dispatch(args[0].__class__)(*args, **kw)
AnnData object with n_obs × n_vars = 34765 × 36601
    obs: 'gene_count_Unique', 'umi_count_Unique', 'gene_count_UniqueAndMult-EM', 'umi_count_UniqueAndMult-EM', 'gene_count_UniqueAndMult-Uniform', 'umi_count_UniqueAndMult-Uniform', 'SRX_accession', 'cell_type', 'cell_ontology_term_id', 'artifact_uid', 'entrez_id', 'srx_accession', 'file_path', 'obs_count', 'lib_prep', 'tech_10x', 'cell_prep', 'organism', 'tissue', 'tissue_ontology_term_id', 'disease', 'disease_ontology_term_id', 'perturbation', 'cell_line', 'antibody_derived_tag', 'czi_collection_id', 'czi_collection_name'
    layers: 'UniqueAndMult-EM', 'UniqueAndMult-Uniform'

See the metadata in the AnnData:

adata_concat.obs.head()
Hide code cell output
gene_count_Unique umi_count_Unique gene_count_UniqueAndMult-EM umi_count_UniqueAndMult-EM gene_count_UniqueAndMult-Uniform umi_count_UniqueAndMult-Uniform SRX_accession cell_type cell_ontology_term_id artifact_uid ... organism tissue tissue_ontology_term_id disease disease_ontology_term_id perturbation cell_line antibody_derived_tag czi_collection_id czi_collection_name
0 1 1.0 1 1.0 1 1.0 SRX25506069 qvSEhDQmKxucI3760000 ... Homo sapiens brain UBERON:0000955 glioblastoma MONDO:0018177 not mentioned MGG23 maybe None None
1 2 2.0 2 2.0 2 2.0 SRX25506069 qvSEhDQmKxucI3760000 ... Homo sapiens brain UBERON:0000955 glioblastoma MONDO:0018177 not mentioned MGG23 maybe None None
2 7 7.0 7 7.0 7 7.0 SRX25506069 qvSEhDQmKxucI3760000 ... Homo sapiens brain UBERON:0000955 glioblastoma MONDO:0018177 not mentioned MGG23 maybe None None
3 1 1.0 1 1.0 1 1.0 SRX25506069 qvSEhDQmKxucI3760000 ... Homo sapiens brain UBERON:0000955 glioblastoma MONDO:0018177 not mentioned MGG23 maybe None None
4 2 2.0 2 2.0 2 2.0 SRX25506069 qvSEhDQmKxucI3760000 ... Homo sapiens brain UBERON:0000955 glioblastoma MONDO:0018177 not mentioned MGG23 maybe None None

5 rows × 27 columns

Explore collections

This project has 135 collections of artifacts (27 organisms x 5 count features) for the latest version:

db.Collection.filter(version_tag="2026-01-12", projects=scbase).to_dataframe()
Hide code cell output
! truncated query result to limit=100 Collection objects (will change to limit=20 in lamindb 2.7)
uid key description hash reference reference_type version_tag is_latest is_locked created_at branch_id created_on_id space_id created_by_id run_id meta_artifact_id
id
242 Aioyo5zHXzPUkSuT0001 scBaseCount/Velocyto/Bos_taurus None olJQs43iwGReP02Ig0bsMg None None 2026-01-12 True False 2026-05-21 11:26:48.999352+00:00 1 1 1 1 33 None
241 gJrcdOm2sG7JUINS0001 scBaseCount/GeneFull_ExonOverIntron/Bos_taurus None fHZY39j5Tl2HzuhH4JkbMA None None 2026-01-12 True False 2026-05-21 11:26:41.691858+00:00 1 1 1 1 33 None
240 gY3xsMES4idjZb320001 scBaseCount/GeneFull_Ex50pAS/Bos_taurus None Z-jbRmTLrqXa1OSnX0vXLg None None 2026-01-12 True False 2026-05-21 11:26:34.363847+00:00 1 1 1 1 33 None
239 owfF1Bfuq660eiDp0001 scBaseCount/GeneFull/Bos_taurus None z0dDiak_8xV-nqOAaOCMUQ None None 2026-01-12 True False 2026-05-21 11:26:27.041837+00:00 1 1 1 1 33 None
238 ttGkPgXxLDO4sSXF0001 scBaseCount/Gene/Bos_taurus None -FAJ3zwNRX34JZMFNWiGrQ None None 2026-01-12 True False 2026-05-21 11:26:19.701347+00:00 1 1 1 1 33 None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
147 B1HLE6Qepvw6L8310001 scBaseCount/Velocyto/Callithrix_jacchus None ZQxX6o_lRKBTAbxfeIpIBw None None 2026-01-12 True False 2026-05-21 11:13:46.627336+00:00 1 1 1 1 33 None
146 aTcybpWfhW2UDU1P0001 scBaseCount/GeneFull_ExonOverIntron/Callithrix... None zBDAIbMPR2QWkZ22t6t8Vw None None 2026-01-12 True False 2026-05-21 11:13:39.199856+00:00 1 1 1 1 33 None
145 bcFkDfopVBSBOzUP0001 scBaseCount/GeneFull_Ex50pAS/Callithrix_jacchus None FHqQHBRhxZrFJZU1zomVIQ None None 2026-01-12 True False 2026-05-21 11:13:31.732353+00:00 1 1 1 1 33 None
144 76ZAzqY4L1HSbjg90001 scBaseCount/GeneFull/Callithrix_jacchus None hTyo9bGHvQnIE0Pg-Vag5g None None 2026-01-12 True False 2026-05-21 11:13:24.182111+00:00 1 1 1 1 33 None
143 lGEHS62GtBIAjP560001 scBaseCount/Gene/Callithrix_jacchus None yOd-O77DppiRdAqzIr7CCA None None 2026-01-12 True False 2026-05-21 11:13:16.610080+00:00 1 1 1 1 33 None

100 rows × 16 columns

Collections are immutable collections of artifacts, useful for model training or analytical workflows that need to rely on an immutable set rather than a mutable set of artifact that’s grouped by a folder or label annotation.

Hide code cell content
assert db.bionty.CellLine.filter(artifacts__in=artifacts_tahoe).distinct().count() == 50
assert db.pertdb.Compound.filter(artifacts__in=artifacts_tahoe).distinct().count() == 380
assert (
    db.pertdb.CompoundPerturbation.filter(artifacts__in=artifacts_tahoe)
    .distinct()
    .count()
    == 1138
)

References