hub

Arc Virtual Cell Atlas: scRNA-seq

The Arc Virtual Cell Atlas hosts one of the biggest collections of scRNA-seq datasets.

Lamin mirrors the dataset for simplified access here: laminlabs/arc-virtual-cell-atlas.

If you use the data academically, please cite the original publications, Youngblut et al. (2025) and Zhang et al. (2025).

If you’d like to transfer data into your own LaminDB instance, see the transfer guide.

# pip install 'lamindb[gcp]'
!lamin init --modules bionty,wetlab --storage ./test-arc-virtual-cell-atlas
Hide code cell output
 initialized lamindb: testuser1/test-arc-virtual-cell-atlas
import lamindb as ln
import bionty as bt
import wetlab as wl
import pyarrow.compute as pc
import anndata as ad
Hide code cell output
 connected lamindb: testuser1/test-arc-virtual-cell-atlas

Create the central query object for this instance:

db = ln.DB("laminlabs/arc-virtual-cell-atlas")

Tahoe-100M

project_tahoe = db.Project.get(name="Tahoe-100M")
project_tahoe
Hide code cell output
Project(uid='H5MwZwyA62rG', name='Tahoe-100M', description=None, is_type=False, abbr=None, url='https://arcinstitute.org/tools/virtualcellatlas', start_date=None, end_date=None, branch_id=1, space_id=1, created_by_id=1, run_id=None, type_id=None, created_at=2025-02-26 16:03:40 UTC, is_locked=False)
# one collection in this project
project_tahoe.collections.to_dataframe()
Hide code cell output
uid key description hash reference reference_type version is_latest is_locked created_at branch_id space_id created_by_id run_id meta_artifact_id
id
1 BpavRL4ntRTzWEE50000 tahoe100 None GCLk4ZgQxgWspjmEUk3gIg None None 2025-02-25 True False 2025-02-26 13:51:22.787537+00:00 1 1 1 3 None

Every individual dataset in the atlas is an .h5ad file that is registered as an artifact in LaminDB.

Artifact level metadata are registered and can be explored as follows:

# get the collection: https://lamin.ai/laminlabs/arc-virtual-cell-atlas/collection/BpavRL4ntRTzWEE5
collection_tahoe = db.Collection.get(key="tahoe100")
# 14 artifacts in this collection, each correspond to a plate
artifacts_tahoe = collection_tahoe.artifacts.distinct()
artifacts_tahoe.to_dataframe()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations version is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
1375 BDttiuV3Te8VB0dU0000 2025-02-25/h5ad/plate9_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 18791302576 4kHbVbmreg6akW6ZgsjxaA None 5866669 None True False 2025-02-25 23:22:22.759201+00:00 1 1 2 1 3 1
1374 czC19UpUEszVH2bU0000 2025-02-25/h5ad/plate8_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 30390935958 ilAzEPIh4FlDeTFaJ1dILw None 8880979 None True False 2025-02-25 23:22:22.387666+00:00 1 1 2 1 3 1
1373 DC5cacdJr1VoEXnl0000 2025-02-25/h5ad/plate7_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 16514746341 NOS4MY6eYYPOnAB8ViyWYg None 5692117 None True False 2025-02-25 23:22:22.009157+00:00 1 1 2 1 3 1
1372 aAHQ3zbD7n1asyYr0000 2025-02-25/h5ad/plate6_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 28934897078 NYvQEqVClziHm0ozWhOw1w None 7545393 None True False 2025-02-25 23:22:21.629962+00:00 1 1 2 1 3 1
1371 EZATJLC4jE7pmwo40000 2025-02-25/h5ad/plate5_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 19763140865 VMBKFzOI5cj7UC1UDENP4A None 6419498 None True False 2025-02-25 23:22:21.255154+00:00 1 1 2 1 3 1
1370 tKTeff0ugWqAm4P70000 2025-02-25/h5ad/plate4_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 23292672278 BkBXznbSovNWXtzPFITPcQ None 7004356 None True False 2025-02-25 23:22:20.879928+00:00 1 1 2 1 3 1
1369 XVSrkq9pyF1OBLgG0000 2025-02-25/h5ad/plate3_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 13173722269 Jnrt7DaSUCGn8D8LS2itaw None 4705402 None True False 2025-02-25 23:22:20.497965+00:00 1 1 2 1 3 1
1368 ZFeVfd0ugAHeWCxm0000 2025-02-25/h5ad/plate2_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 29037152127 usxviuqGbuw0RYnECCVCWw None 8064658 None True False 2025-02-25 23:22:20.113956+00:00 1 1 2 1 3 1
1367 aJIqo7bNyJAs9z0r0000 2025-02-25/h5ad/plate1_filt_Vevo_Tahoe100M_WSe... None .h5ad dataset AnnData 19070623904 9iCNcouMqfNS3HA/2GUWOA None 5481420 None True False 2025-02-25 23:22:19.737995+00:00 1 1 2 1 3 1
1366 vn5cUJCHbjpPPsZx0000 2025-02-25/h5ad/plate14_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 22427932564 FrnStRehP16siRGG35ou+g None 6518806 None True False 2025-02-25 23:22:19.357999+00:00 1 1 2 1 3 1
1365 9L9HZ55HqUL0aqaR0000 2025-02-25/h5ad/plate13_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 28071589885 RKOiaay+CHvv+Ukk/N+28A None 8501658 None True False 2025-02-25 23:22:18.977981+00:00 1 1 2 1 3 1
1364 S2h2rPLCaUhZAM9u0000 2025-02-25/h5ad/plate12_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 37495736876 VjAkWVFGVpzAMi9Innusuw None 10487057 None True False 2025-02-25 23:22:18.600910+00:00 1 1 2 1 3 1
1363 omn7JStfJMzy8m6O0000 2025-02-25/h5ad/plate11_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 23230802756 N2mzoYlMLEl6PdecaYyDvw None 7435869 None True False 2025-02-25 23:22:18.229629+00:00 1 1 2 1 3 1
1362 56uA9lPPmJ4zLUcr0000 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 26536400717 j1FXsX7hs7u+eBqnWnmNHw None 8044908 None True False 2025-02-25 23:22:17.849980+00:00 1 1 2 1 3 1

50 cell lines.

artifacts_tahoe.to_list("cell_lines__name")[:5]
Hide code cell output
['A-172', 'A-427', 'A498', 'A549', 'AN3 CA']

380 compounds.

artifacts_tahoe.to_list("compounds__name")[:5]
Hide code cell output
['18β-Glycyrrhetinic acid',
 '4EGI-1',
 '5-Azacytidine',
 '5-Fluorouracil',
 '8-Hydroxyquinoline']

1,138 perturbations.

artifacts_tahoe.to_list("compound_perturbations__name")[:5]
Hide code cell output
["[('18β-Glycyrrhetinic acid', 0.05, 'uM')]",
 "[('18β-Glycyrrhetinic acid', 0.5, 'uM')]",
 "[('18β-Glycyrrhetinic acid', 5.0, 'uM')]",
 "[('4EGI-1', 0.05, 'uM')]",
 "[('4EGI-1', 0.5, 'uM')]"]
# check the curated metadata of the first artifact
artifact1 = artifacts_tahoe[0]
artifact1.describe()
Hide code cell output
Artifact: 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad (0000)
├── uid: 56uA9lPPmJ4zLUcr0000            run: 0xj4zui (register-tahoe100.ipynb)
kind: dataset                        otype: AnnData                        
hash: j1FXsX7hs7u+eBqnWnmNHw         size: 24.7 GB                         
branch: main                         space: all                            
created_at: 2025-02-25 23:22:17 UTC  created_by: sunnyosun                 
n_observations: 8044908                                                    
├── storage/path: gs://arc-ctc-tahoe100/2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad
├── Dataset features
├── var (62710 bionty.Gene.stable…                                                                             
│   TSPAN6                          float                                                                      
│   TNMD                            float                                                                      
│   DPM1                            float                                                                      
│   SCYL3                           float                                                                      
│   C1orf112                        float                                                                      
│   FGR                             float                                                                      
│   CFH                             float                                                                      
│   FUCA2                           float                                                                      
│   GCLC                            float                                                                      
│   NFYA                            float                                                                      
│   STPG1                           float                                                                      
│   NIPAL3                          float                                                                      
│   LAS1L                           float                                                                      
│   ENPP4                           float                                                                      
│   SEMA3F                          float                                                                      
│   CFTR                            float                                                                      
│   ANKIB1                          float                                                                      
│   CYP51A1                         float                                                                      
│   KRIT1                           float                                                                      
│   RAD52                           float                                                                      
└── obs (16)                                                                                                   
    cell_name                       bionty.CellLine                    A-172, A-427, A498, A549, AN3 CA, AsPC-…
    drug                            wetlab.Compound                    5-Azacytidine, 5-Fluorouracil, Abirater…
    drugname_drugconc               wetlab.CompoundPerturbation        [('5-Azacytidine', 0.05, 'uM')], [('5-F…
    pass_filter                     ULabel[PassFilter]                 full, minimal                           
    phase                           ULabel[Phase]                      G1, G2M, S                              
    plate                           ULabel[Plate]                      plate10                                 
    sample                          wetlab.Biosample                   smp_2359, smp_2360, smp_2361, smp_2362,…
    cell_line                       bionty.CellLine.description                                                
    gene_count                      int                                                                        
    tscp_count                      int                                                                        
    mread_count                     int                                                                        
    pcnt_mito                       float                                                                      
    S_score                         float                                                                      
    G2M_score                       float                                                                      
    sublibrary                      str                                                                        
    BARCODE                         str                                                                        
└── Labels
    └── .ulabels                        ULabel                             plate10, G1, G2M, S, full, minimal      
        .projects                       Project                            Tahoe-100M                              
        .references                     Reference                          Tahoe-100M: A Giga-Scale Single-Cell Pe…
        .compounds                      wetlab.Compound                    Omeprazole (sodium), Ranolazine, Proglu…
        .compound_perturbations         wetlab.CompoundPerturbation        [('Bestatin (hydrochloride)', 0.05, 'uM…
        .biosamples                     wetlab.Biosample                   smp_2359, smp_2360, smp_2361, smp_2362,…
        .organisms                      bionty.Organism                    human                                   
        .cell_lines                     bionty.CellLine                    NCI-H1573, NCI-H460, hTERT-HPNE, SW48, …

16 obs metadata features.

artifact1.features.slots["obs"].members.to_dataframe()
Hide code cell output
uid name dtype is_type unit description array_rank array_size array_shape proxy_dtype synonyms is_locked created_at branch_id space_id created_by_id run_id type_id
id
19 gQE1h3fIBiSf sample cat[wetlab.Biosample] None None Unique treatment identifier, distinguishes rep... 0 0 None None None False 2025-02-26 10:59:36.743558+00:00 1 1 1 3 None
5 IjSP1lCY3Hyw gene_count int None None Number of genes with at least one count 0 0 None None None False 2025-02-25 22:30:30.668750+00:00 1 1 1 3 None
6 LHUmmYKjIGPl tscp_count int None None Number of transcripts, aka UMI count 0 0 None None None False 2025-02-25 22:30:31.236532+00:00 1 1 1 3 None
7 PZDiL36nJSFv mread_count int None None Number of reads per cell 0 0 None None None False 2025-02-25 22:30:31.810331+00:00 1 1 1 3 None
18 fLwdFKBUhBY9 drugname_drugconc cat[wetlab.CompoundPerturbation] None None Drug name, concentration, and concentration unit 0 0 None None None False 2025-02-25 23:04:17.541812+00:00 1 1 1 3 None
17 Q0cj2JR5Juwn drug cat[wetlab.Compound] None None Drug name, parsed out from the drugname_drugco... 0 0 None None None False 2025-02-25 23:02:05.717794+00:00 1 1 1 3 None
4 vshELphl73qp cell_line cat[bionty.CellLine.description] None None Cell line information (if applicable) 0 0 None None None False 2025-02-25 22:27:22.393997+00:00 1 1 1 3 None
15 3X4d0QEUuprp sublibrary str None None Sublibrary ID (related to library prep and seq... 0 0 None None None False 2025-02-25 22:35:14.673178+00:00 1 1 1 3 None
16 dQELv2sIVnJX BARCODE str None None Barcode ID 0 0 None None None False 2025-02-25 22:35:15.627971+00:00 1 1 1 3 None
8 X640W5tBUPOQ pcnt_mito float None None Percentage of mitochondrial reads 0 0 None None None False 2025-02-25 22:31:21.581885+00:00 1 1 1 3 None
9 bujDkB4Nd1S5 S_score float None None Inferred S phase score 0 0 None None None False 2025-02-25 22:31:22.144135+00:00 1 1 1 3 None
10 CF0O0e0WZxFz G2M_score float None None Inferred G2M score 0 0 None None None False 2025-02-25 22:31:22.708895+00:00 1 1 1 3 None
2 QboQ1Q1Yxsjn phase cat[ULabel[Phase]] None None Inferred cell cycle phase 0 0 None None None False 2025-02-25 22:21:56.935262+00:00 1 1 1 3 None
3 PVpyJhciLdCQ pass_filter cat[ULabel[PassFilter]] None None "Full" filters are more stringent on gene_coun... 0 0 None None None False 2025-02-25 22:25:30.918235+00:00 1 1 1 3 None
11 KPT70T8xJLIt cell_name cat[bionty.CellLine] None None Commonly-used cell name (related to the cell_l... 0 0 None None None False 2025-02-25 22:32:56.082195+00:00 1 1 1 3 None
1 YRSYWdIiesqL plate cat[ULabel[Plate]] None None Plate identifier 0 0 None None None False 2025-02-25 22:03:51.786985+00:00 1 1 1 3 None

Query artifacts of interest based on metadata

Since all metadata are registered in the sql database, we can explore the datasets without accessing them.

Let’s find which datasets contain A549 cells perturbed with Piroxicam.

# lookup objects give you pythonic access to the values
cell_lines = db.bionty.CellLine.lookup("ontology_id")
drugs = db.wetlab.Compound.lookup()

artifacts_a549_piroxicam = artifacts_tahoe.filter(
    cell_lines=cell_lines.cvcl_0023, compounds=drugs.piroxicam
)
artifacts_a549_piroxicam.to_dataframe()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations version is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
1364 S2h2rPLCaUhZAM9u0000 2025-02-25/h5ad/plate12_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 37495736876 VjAkWVFGVpzAMi9Innusuw None 10487057 None True False 2025-02-25 23:22:18.600910+00:00 1 1 2 1 3 1
1363 omn7JStfJMzy8m6O0000 2025-02-25/h5ad/plate11_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 23230802756 N2mzoYlMLEl6PdecaYyDvw None 7435869 None True False 2025-02-25 23:22:18.229629+00:00 1 1 2 1 3 1
1362 56uA9lPPmJ4zLUcr0000 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WS... None .h5ad dataset AnnData 26536400717 j1FXsX7hs7u+eBqnWnmNHw None 8044908 None True False 2025-02-25 23:22:17.849980+00:00 1 1 2 1 3 1

You can download an .h5ad into your local cache:

artifact1.cache()

Or stream it:

artifact1.open()

Open the obs metadata parquet file as a PyArrow Dataset

Open the obs metadata file (2.29G) with PyArrow.Dataset.

obs_metadata = db.Artifact.filter(
    key__endswith="obs_metadata.parquet", projects=project_tahoe
).one()
obs_metadata
Hide code cell output
Artifact(uid='y1TTR9wbrmZEwpOa0000', version=None, is_latest=True, key='2025-02-25/metadata/obs_metadata.parquet', description=None, suffix='.parquet', kind='dataset', otype='DataFrame', size=2293981573, hash='qEWOpGw9CmQVzaElyMWT1Q', n_files=None, n_observations=100648790, branch_id=1, space_id=1, storage_id=2, run_id=1, schema_id=None, created_by_id=1, created_at=2025-02-25 19:33:42 UTC, is_locked=False)
obs_metadata_ds = obs_metadata.open()
obs_metadata_ds.schema
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
plate: string
BARCODE_SUB_LIB_ID: string
sample: string
gene_count: int64
tscp_count: int64
mread_count: int64
drugname_drugconc: string
drug: string
cell_line: dictionary<values=string, indices=int8, ordered=0>
sublibrary: string
BARCODE: string
pcnt_mito: float
S_score: double
G2M_score: double
phase: dictionary<values=string, indices=int8, ordered=0>
pass_filter: dictionary<values=string, indices=int8, ordered=0>
cell_name: dictionary<values=string, indices=int8, ordered=0>
__index_level_0__: int64
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 2487

Which A549 cells are perturbed with Piroxicam.

filter_expr = (pc.field("cell_name") == cell_lines.cvcl_0023.name) & (
    pc.field("drug") == drugs.piroxicam.name
)
obs_metadata_df = obs_metadata_ds.scanner(filter=filter_expr).to_table().to_pandas()
obs_metadata_df.value_counts("plate")

Retrieve the corresponding cells from h5ad files.

plate_cells = df.groupby("plate")["BARCODE_SUB_LIB_ID"].apply(list)

adatas = []
for artifact in artifacts_a549_piroxicam:
    plate = artifact.features.get_values()["plate"]
    idxs = plate_cells.get(plate)
    print(f"Loading {len(idxs)} cells from plate {plate}")
    with artifact.open() as store:
        adata = store[idxs].to_memory() # can also subst genes here
        adatas.append(adata)

scBaseCount

project_scbasecount = db.Project.get(name="scBaseCount")
project_scbasecount
Hide code cell output
Project(uid='vdK00t9DGwHP', name='scBaseCount', description=None, is_type=False, abbr=None, url='https://arcinstitute.org/tools/virtualcellatlas', start_date=None, end_date=None, branch_id=1, space_id=1, created_by_id=1, run_id=None, type_id=None, created_at=2025-02-26 16:04:08 UTC, is_locked=False)

This project has 105 collections (21 organisms x 5 count features):

project_scbasecount.collections.to_dataframe()
Hide code cell output
uid key description hash reference reference_type version is_latest is_locked created_at branch_id space_id created_by_id run_id meta_artifact_id
id
107 wwvSKTeDmTri9Ppf0000 scBaseCount/Velocyto/Mus_musculus None j3BeJyLuclN11yQpqHJj6Q None None 2025-02-25 True False 2025-03-03 11:09:45.776463+00:00 1 1 1 10 None
106 wdVaulVvESgAWwtf0000 scBaseCount/GeneFull_ExonOverIntron/Mus_musculus None Yr9AxC-eL10vVMuigJOlrg None None 2025-02-25 True False 2025-03-03 11:09:34.372387+00:00 1 1 1 10 None
105 83gTx3oxX5S4SxQ30000 scBaseCount/GeneFull_Ex50pAS/Mus_musculus None x-Tm3VldcW71n3mYE2KknQ None None 2025-02-25 True False 2025-03-03 11:09:22.891607+00:00 1 1 1 10 None
104 zLwr9k0TkiRt6ymZ0000 scBaseCount/GeneFull/Mus_musculus None i30e5gnKklC8UBqSS0aVSA None None 2025-02-25 True False 2025-03-03 11:09:11.674645+00:00 1 1 1 10 None
103 wQQNz6vrQeKuro540000 scBaseCount/Gene/Mus_musculus None QeF9x4hTGYLw8MzFvLBCoQ None None 2025-02-25 True False 2025-03-03 11:09:00.351899+00:00 1 1 1 10 None
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12 Aioyo5zHXzPUkSuT0000 scBaseCount/Velocyto/Bos_taurus None HkQe255ahyg8xbV35eRs4Q None None 2025-02-25 True False 2025-03-03 11:00:28.668980+00:00 1 1 1 10 None
11 gJrcdOm2sG7JUINS0000 scBaseCount/GeneFull_ExonOverIntron/Bos_taurus None BFGras5oupzn4iVCjSjZ0A None None 2025-02-25 True False 2025-03-03 11:00:23.782698+00:00 1 1 1 10 None
10 gY3xsMES4idjZb320000 scBaseCount/GeneFull_Ex50pAS/Bos_taurus None 7E9sWxY48KZlzq0K9vT-rw None None 2025-02-25 True False 2025-03-03 11:00:18.903653+00:00 1 1 1 10 None
9 owfF1Bfuq660eiDp0000 scBaseCount/GeneFull/Bos_taurus None ionjx_HD9P6K9u5dJKgR3w None None 2025-02-25 True False 2025-03-03 11:00:14.013350+00:00 1 1 1 10 None
8 ttGkPgXxLDO4sSXF0000 scBaseCount/Gene/Bos_taurus None jn1Nhcdt0lpB1I3hQ4SgFw None None 2025-02-25 True False 2025-03-03 11:00:09.130314+00:00 1 1 1 10 None

100 rows × 15 columns

Query artifacts of interest based on metadata

Often you might not want to access all the h5ads in a collection, but rather filter them by metadata:

organisms = db.bionty.Organism.lookup()
tissues = db.bionty.Tissue.lookup()
efos = db.bionty.ExperimentalFactor.lookup()
feature_counts = db.ULabel.filter(type__name="STARsolo count features").lookup()
h5ads_brain = db.Artifact.filter(
    suffix=".h5ad",
    projects=project_scbasecount,
    organisms=organisms.human,
    ulabels=feature_counts.genefull_ex50pas,
    tissues=tissues.brain,
    experimental_factors=efos.single_cell,
    experiments__name__contains="CRISPRi",  # `perturbation` column is registered in `wetlab.Experiment`
).distinct()

h5ads_brain.to_dataframe()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations version is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
114219 KmwFfMZts5AaTWiz0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 5075135 6U8gpvtdL39AydWS2RF+mQ None 47000 None True False 2025-02-28 16:46:25.771217+00:00 1 1 3 10 55 1
114218 P8yGlfAQ0wzDTsfl0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 322043307 D4mrkCwFgr/GCHFBG/bpsw None 26839 None True False 2025-02-28 16:46:25.771217+00:00 1 1 3 10 55 1
114217 C9AGAtLn0SycrD0H0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 4803205 lI5E9UQl5BGLjXTF0tL0eg None 42081 None True False 2025-02-28 16:46:25.771217+00:00 1 1 3 10 55 1
114216 PqczIL8HAmnqj3qD0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 5341115 JE5IltnBicpHIl4+yIIMlw None 48937 None True False 2025-02-28 16:46:25.771217+00:00 1 1 3 10 55 1
114215 GeyKZowZ0w8wjk860000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 310534680 qJOKaNf4BfbK3oldqfhYyw None 25826 None True False 2025-02-28 16:46:25.771217+00:00 1 1 3 10 55 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
104170 YqiNrGCXc1cM9Dg90000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 5494309 kMbDZo5QMSt3WzLKZjsdCg None 7383 None True False 2025-02-28 16:46:25.771217+00:00 1 1 3 10 55 1
104169 obSEgMzCzxBMajAG0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 31893717 7CxAkyanJAjL0oqRuKuOMQ None 8328 None True False 2025-02-28 16:46:25.771217+00:00 1 1 3 10 55 1
104166 ZmSJbhRC4WeK1nyA0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 40518635 gdcEf34j7wAVvxcUby9UDw None 7114 None True False 2025-02-28 16:46:25.771217+00:00 1 1 3 10 55 1
104165 dsdwNB7SxJVms3RM0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 40558858 pqOSB0P/86wxdtzWC+Y2Iw None 7740 None True False 2025-02-28 16:46:25.771217+00:00 1 1 3 10 55 1
104164 HDcm6w76zhgllPPL0000 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... None .h5ad dataset AnnData 38875181 FWv1CWlbr5a3hdzfgrztXQ None 7641 None True False 2025-02-28 16:46:25.771217+00:00 1 1 3 10 55 1

64 rows × 20 columns

Load the h5ad files with obs metadata

Load the h5ads as a single AnnData:

adatas = []
for artifact in h5ads_brain[:5]:  # only load the first 5 artifacts to save CI time
    adatas.append(artifact.load())

# the obs metadatas are present in the parquet files
adata_concat = ad.concat(adatas)
adata_concat
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
! run input wasn't tracked, call `ln.track()` and re-run
! run input wasn't tracked, call `ln.track()` and re-run
! run input wasn't tracked, call `ln.track()` and re-run
! run input wasn't tracked, call `ln.track()` and re-run
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/site-packages/anndata/_core/anndata.py:1792: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
  utils.warn_names_duplicates("obs")
AnnData object with n_obs × n_vars = 38206 × 36601
    obs: 'gene_count', 'umi_count', 'SRX_accession'

Open the sample metadata:

sample_meta = db.Artifact.filter(
    key__endswith="sample_metadata.parquet",
    projects=project_scbasecount,
    organisms=organisms.human,
    ulabels=feature_counts.genefull_ex50pas,
).one()
sample_meta
Hide code cell output
Artifact(uid='WCHkcyWN8L6pDI4E0000', version=None, is_latest=True, key='2025-02-25/metadata/GeneFull_Ex50pAS/Homo_sapiens/sample_metadata.parquet', description=None, suffix='.parquet', kind='dataset', otype='DataFrame', size=531878, hash='4QrqW8DQVRl6bKNYiJhq3g', n_files=None, n_observations=16077, branch_id=1, space_id=1, storage_id=3, run_id=2, schema_id=None, created_by_id=1, created_at=2025-02-25 20:41:32 UTC, is_locked=False)
sample_meta_dataset = sample_meta.open()
sample_meta_dataset.schema
Hide code cell output
! run input wasn't tracked, call `ln.track()` and re-run
entrez_id: int64
srx_accession: string
file_path: string
obs_count: int64
lib_prep: string
tech_10x: string
cell_prep: string
organism: string
tissue: string
disease: string
perturbation: string
cell_line: string
czi_collection_id: string
czi_collection_name: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 1755

Fetch corresponding sample metadata:

filter_expr = pc.field("srx_accession").isin(
    adata_concat.obs["SRX_accession"].astype(str)
)
df = sample_meta_dataset.scanner(filter=filter_expr).to_table().to_pandas()

Add the sample metadata to the AnnData:

adata_concat.obs = adata_concat.obs.merge(
    df, left_on="SRX_accession", right_on="srx_accession"
)
adata_concat
Hide code cell output
/opt/hostedtoolcache/Python/3.12.12/x64/lib/python3.12/functools.py:912: ImplicitModificationWarning: Transforming to str index.
  return dispatch(args[0].__class__)(*args, **kw)
AnnData object with n_obs × n_vars = 38206 × 36601
    obs: 'gene_count', 'umi_count', 'SRX_accession', 'entrez_id', 'srx_accession', 'file_path', 'obs_count', 'lib_prep', 'tech_10x', 'cell_prep', 'organism', 'tissue', 'disease', 'perturbation', 'cell_line', 'czi_collection_id', 'czi_collection_name'
adata_concat.obs.head()
Hide code cell output
gene_count umi_count SRX_accession entrez_id srx_accession file_path obs_count lib_prep tech_10x cell_prep organism tissue disease perturbation cell_line czi_collection_id czi_collection_name
0 2748 5134.0 SRX10606628 14083632 SRX10606628 gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_... 7641 10x_Genomics 3_prime_gex single_cell Homo sapiens brain Down syndrome CRISPR/Cas9, CRISPRi, or small-molecule inhibi... DS1 None None
1 2351 4639.0 SRX10606628 14083632 SRX10606628 gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_... 7641 10x_Genomics 3_prime_gex single_cell Homo sapiens brain Down syndrome CRISPR/Cas9, CRISPRi, or small-molecule inhibi... DS1 None None
2 2184 4293.0 SRX10606628 14083632 SRX10606628 gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_... 7641 10x_Genomics 3_prime_gex single_cell Homo sapiens brain Down syndrome CRISPR/Cas9, CRISPRi, or small-molecule inhibi... DS1 None None
3 2469 5307.0 SRX10606628 14083632 SRX10606628 gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_... 7641 10x_Genomics 3_prime_gex single_cell Homo sapiens brain Down syndrome CRISPR/Cas9, CRISPRi, or small-molecule inhibi... DS1 None None
4 4144 9340.0 SRX10606628 14083632 SRX10606628 gs://arc-scbasecount/2025-02-25/h5ad/GeneFull_... 7641 10x_Genomics 3_prime_gex single_cell Homo sapiens brain Down syndrome CRISPR/Cas9, CRISPRi, or small-molecule inhibi... DS1 None None