RxRx: cell imaging¶
rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.
High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.
In this guide, you’ll see how to query some of these data using LaminDB.
If you’d like to transfer data into your own LaminDB instance, see the transfer guide.
# !pip install 'lamindb[bionty,jupyter,gcp]' wetlab
!lamin load laminlabs/lamindata
→ connected lamindb: laminlabs/lamindata
import lamindb as ln
import bionty as bt
import wetlab as wl
→ connected lamindb: laminlabs/lamindata
Search & look up metadata¶
We’ll find all genetic treatments in the GeneticTreatment
registry:
df = wl.GeneticTreatment.df()
df.shape
(100, 9)
Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:
sirnas = wl.GeneticTreatment.filter(system="siRNA").lookup(return_field="name")
We’re also interested in cell lines & wells:
cell_lines = bt.CellLine.lookup(return_field="abbr")
wells = wl.Well.lookup(return_field="name")
Load the collection¶
This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.
Let us get the corresponding object and some information about it:
collection = ln.Collection.get("Br2Z1lVSQBAkkbbt7ILu")
collection.view_lineage()
collection.describe()
Show code cell output
Collection(uid='Br2Z1lVSQBAkkbbt7ILu', version='1', is_latest=True, name='Annotated RxRx1 images', hash='dycM8ypgnRRF9zXLSeD_', meta_artifact=Artifact(uid='hOK9CAKy93bwcVjLyM8r', is_latest=True, description='Metadata with file paths for each RxRx1 image.', key='rxrx1/metadata.parquet', suffix='.parquet', size=5722212, hash='hsO3u4SA5AnttXTxwxttzg', _hash_type='md5', _accessor='DataFrame', visibility=1, _key_is_virtual=True, storage_id=2, transform_id=105, run_id=227, created_by_id=2, created_at=2024-06-26 09:59:21 UTC), visibility=1, created_at=2024-06-17 12:43:02 UTC)
Provenance
.created_by = 'sunnyosun'
.transform = 'Ingest the RxRx1 dataset'
.run = '2024-06-17 12:31:43 UTC'
.meta_artifact = 'Metadata with file paths for each RxRx1 image.'
The dataset consists in a metadata file and a folder path pointing to the image files:
collection.meta_artifact.load().head()
! run input wasn't tracked, call `ln.track()` and re-run
site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | path | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1.png |
1 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w2.png |
2 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w3.png |
3 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w4.png |
4 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w5.png |
Query image files¶
Because we didn’t choose to register each image as a record in the Artifact
registry, we have to query the images through the metadata file of the dataset:
df = collection.meta_artifact.load()
! run input wasn't tracked, call `ln.track()` and re-run
We can query a subset of images using metadata registries & pandas query syntax:
query = df[
(df.cell_line == cell_lines.hep_g2_cell)
& (df.sirna == sirnas.s15652)
& (df.well == wells.m15)
& (df.plate == 1)
& (df.site == 2)
]
query
site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | path | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3066 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w1.png |
3067 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w2.png |
3068 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w3.png |
3069 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w4.png |
3070 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w5.png |
3071 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w6.png |
To access the individual images based on this query result:
collection.data_artifact.storage.root
'gs://rxrx1-europe-west4'
images = [f"{collection.data_artifact.storage.root}/{key}" for key in query.path]
images
['gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w1.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w2.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w3.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w4.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w5.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w6.png']
Download an image to disk:
path = ln.UPath(images[1])
path.download_to(".")
from IPython.display import Image
Image(f"./{path.name}")
Use DuckDB to query metadata¶
As an alternative to pandas, we could use DuckDB to query image metadata.
import duckdb # pip install duckdb
features = ln.Feature.lookup(return_field="name")
filter = (
f"{features.cell_line} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
f"{features.plate} == '1' and {features.site} == '2'"
)
region = ln.setup.settings.storage.region
parquet_data = duckdb.from_parquet(
collection.meta_artifact.path.as_posix() + f"?s3_region={region}"
)
parquet_data.filter(filter)
┌──────────────────┬────────────────┬───────────┬─────────┬────────────┬───────┬─────────┬───────┬──────────────────┬─────────┬──────────┬───────────────────────────────────────────┐
│ site_id │ well_id │ cell_line │ split │ experiment │ plate │ well │ site │ well_type │ sirna │ sirna_id │ path │
│ varchar │ varchar │ varchar │ varchar │ varchar │ int64 │ varchar │ int64 │ varchar │ varchar │ int64 │ varchar │
├──────────────────┼────────────────┼───────────┼─────────┼────────────┼───────┼─────────┼───────┼──────────────────┼─────────┼──────────┼───────────────────────────────────────────┤
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w1.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w2.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w3.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w4.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w5.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ HEPG2-08 │ 1 │ M15 │ 2 │ positive_control │ s15652 │ 1114 │ images/test/HEPG2-08/Plate1/M15_s2_w6.png │
└──────────────────┴────────────────┴───────────┴─────────┴────────────┴───────┴─────────┴───────┴──────────────────┴─────────┴──────────┴───────────────────────────────────────────┘