##### Arc Virtual Cell Atlas [image: .md][image]

With 2.5B expression profiles that map to about 600M cells, the Arc
Virtual Cell Atlas is the world's largest collection of uniformly
processed scRNA-seq datasets. Arc distributes the atlas as 460k
parquet and h5ad files totaling 41TB on Google Cloud Storage, see
github.com/ArcInstitute/arc-virtual-cell-atlas. Lamin mirrors the
atlas in a database: lamin.ai/laminlabs/arc-virtual-cell-atlas.

If you use the data academically, please cite the original
publications, Youngblut *et al.* (2025) and Zhang *et al.* (2025).

To query the atlas with "lamindb", you have to install it with the GCP
(Google Cloud Platform) extra. We also recommend configuring the
"bionty" and "pertdb" modules.

 # pip install 'lamindb[gcp]'
 !lamin settings modules set bionty,pertdb

Create the central query object for this instance:

 import lamindb as ln
 import pyarrow.compute as pc

 db = ln.DB("laminlabs/arc-virtual-cell-atlas")

#### Tahoe-100M

Retrieve the fourteen ".h5ad" datasets of the "Tahoe-100M" project:

 tahoe = db.Project.get(name="Tahoe-100M")
 artifacts_tahoe = db.Artifact.filter(projects=tahoe, suffix=".h5ad")
 artifacts_tahoe.to_dataframe()

See the schema and annotations of the first dataset:

 artifact1 = artifacts_tahoe[0]
 artifact1.describe()

You can download an ".h5ad" into your local cache, load it into
memory, or open it for streaming:

 local_filepath = artifact1.cache()  # sync into cache 
 adata = artifact1.load()  # sync into cache and load into memory
 with artifact1.open() as adata:  # open for streaming
 ...

You can query the "CellLine" ontology, the "Compound", and the
"CompoundPerturbation" registries via their relationship to
"Artifact". You'll find 50 cell lines:

 db.bionty.CellLine.filter(artifacts__in=artifacts_tahoe).distinct().to_dataframe()

380 compounds:

 db.pertdb.Compound.filter(artifacts__in=artifacts_tahoe).distinct().to_dataframe()

1,138 perturbations:

 db.pertdb.CompoundPerturbation.filter(artifacts__in=artifacts_tahoe).distinct().to_dataframe()

###### Query artifacts based on metadata

Let's find which datasets contain A549 cells perturbed with Piroxicam.

 a549 = db.bionty.CellLine.get(name="A549")
 piro = db.pertdb.Compound.get(name="Piroxicam")

 artifacts_a549_piro = artifacts_tahoe.filter(compounds=piro, cell_lines=a549)
 artifacts_a549_piro.to_dataframe()

###### Stream the dataset content

While the artifact metadata tells us which files contain A549 cells
and Piroxicam, we use a parquet file to find the exact cells within
those files. To this end, we open the metadata file with
"pyarrow.Dataset":

 obs_af = db.Artifact.get(key__endswith="obs_metadata.parquet", projects=tahoe)
 obs_af.describe()

The schema of the parquet file maps to the "pyarrow" schema:

 obs_ds = obs_af.open()  # consider using with obs_af.open() as obs_ds
 obs_ds.schema

Streaming speed: Streaming large parquet and h5ad files from cloud
storage crucially depends on where you run your code. It'll be *much*
faster if you run it in the data center that hosts the data. It'll
typically be prohibitively slow if you run it locally. The "gs://arc-
institute-virtual-cell-atlas" storage location is accessible from any
Google Cloud data center in the US with low latency and no egress
fees.If you want to run logic locally, consider caching datasets prior
to opening them for streaming via ".open()":

 local_filepath = obs_af.cache()  # subsequent obs_af.open() will automatically read from the cache

Let us now query the columns of interest:

 filter_expr = (pc.field("cell_name") == a549.name) & (pc.field("drug") == piro.name)

Retrieve the corresponding cells:

 plate_cells = obs_df.groupby("plate")["BARCODE_SUB_LIB_ID"].apply(list)

And their counts:

 adatas = []
 for artifact in artifacts_a549_piro:
 plate_name = artifact.features["plate"].name
 idxs = plate_cells.get(plate_name)
 print(f"loading {len(idxs)} cells from plate {plate_name}")
 with artifact.open() as astore:
 adata = astore[idxs].to_memory()  # can also subset genes here
 adatas.append(adata)

 # this will print something like this
 #> loading 2812 cells from plate plate10
 #> ...
 # continue with concatenating or other processing of the AnnData objects

#### scBaseCount

 scbase = db.Project.get(name="scBaseCount")
 scbase

###### Query artifacts based on metadata

An exemplary query:

 organisms = db.bionty.Organism.lookup()
 tissues = db.bionty.Tissue.lookup()
 efos = db.bionty.ExperimentalFactor.lookup()
 feature_counts = db.ULabel.filter(type__name="STARsolo count features").lookup()

 h5ads_brain = db.Artifact.filter(
 version_tag="2026-01-12",
 suffix=".h5ad",
 projects=scbase,
 organisms=organisms.human,
 ulabels=feature_counts.genefull_ex50pas,
 tissues=tissues.brain,
 experimental_factors=efos.single_cell,
 ).order_by("size").distinct()

 h5ads_brain.to_dataframe()

###### Cache and load datasets into memory

Load the h5ads as a single "AnnData" by caching the datasets,
concatenating them, and loading them into memory:

 adata_concat = h5ads_brain[:5].load()
 adata_concat

Open the sample metadata:

 sample_meta = db.Artifact.get(
 version_tag="2026-01-12",
 key__endswith="sample_metadata.parquet",
 projects=scbase,
 organisms=organisms.human,
 ulabels=feature_counts.genefull_ex50pas,
 )
 sample_meta_dataset = sample_meta.open()
 sample_meta_dataset.schema

Query the corresponding sample metadata:

 filter_expr = pc.field("srx_accession").isin(
 adata_concat.obs["SRX_accession"].astype(str)
 )
 df = sample_meta_dataset.scanner(filter=filter_expr).to_table().to_pandas()

Add the sample metadata to the "AnnData" object:

 adata_concat.obs = adata_concat.obs.merge(
 df, left_on="SRX_accession", right_on="srx_accession"
 )
 adata_concat

See the metadata in the "AnnData":

 adata_concat.obs.head()

###### Explore collections

This project has 135 collections of artifacts (27 organisms x 5 count
features) for the latest version:

 db.Collection.filter(version_tag="2026-01-12", projects=scbase).to_dataframe()

Collections are immutable collections of artifacts, useful for model
training or analytical workflows that need to rely on an immutable set
rather than a mutable set of artifact that's grouped by a folder or
label annotation.

 assert db.bionty.CellLine.filter(artifacts__in=artifacts_tahoe).distinct().count() == 50
 assert db.pertdb.Compound.filter(artifacts__in=artifacts_tahoe).distinct().count() == 380
 assert (
 db.pertdb.CompoundPerturbation.filter(artifacts__in=artifacts_tahoe)
 .distinct()
 .count()
 == 1138
 )