RxRx: cell imaging¶
rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.
High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.
In this guide, you’ll see how to query some of these data using LaminDB.
If you’d like to transfer data into your own LaminDB instance, see the transfer guide.
Setup¶
import lamindb as ln
import bionty as bt
import wetlab as wl
ln.connect("laminlabs/lamindata")
Search & look up metadata¶
We’ll find all treatments in the Treatment
registry:
df = wl.Treatment.df()
df.shape
(1139, 12)
Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:
sirnas = wl.Treatment.filter(system="siRNA").lookup(return_field="name")
We’re also interested in cell lines & wells:
cell_lines = bt.CellLine.lookup(return_field="abbr")
wells = wl.Well.lookup(return_field="name")
Load the collection¶
This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.
Let us get the corresponding object and some information about it:
collection = ln.Collection.filter(uid="Br2Z1lVSQBAkkbbt7ILu").one()
collection.view_lineage()
collection.describe()
Show code cell output
Collection(uid='Br2Z1lVSQBAkkbbt7ILu', version='1', name='Annotated RxRx1 images', hash='dycM8ypgnRRF9zXLSeD_', visibility=1, updated_at='2024-06-17 12:43:02 UTC')
Provenance
.created_by = 'sunnyosun'
.transform = 'Ingest the RxRx1 dataset'
.run = '2024-06-17 11:33:27 UTC'
.artifact = None
Feature sets
'columns' = 'path', 'well_id', 'plate', 'well', 'site', 'well_type', 'sirna', 'sirna_id', 'experiment', 'cell_line', 'split'
The dataset consists in a metadata file and a folder path pointing to the image files:
collection.artifact.load().head()
site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | path | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w1.png |
1 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w2.png |
2 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w3.png |
3 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w4.png |
4 | HEPG2-08_1_B02_1 | HEPG2-08_1_B02 | HEPG2 | test | HEPG2-08 | 1 | B02 | 1 | negative_control | EMPTY | 1138 | images/test/HEPG2-08/Plate1/B02_s1_w5.png |
Query image files¶
Because we didn’t choose to register each image as a record in the Artifact
registry, we have to query the images through the metadata file of the dataset:
df = collection.artifact.load()
We can query a subset of images using metadata registries & pandas query syntax:
query = df[
(df.cell_line == cell_lines.hep_g2_cell)
& (df.sirna == sirnas.s15652)
& (df.well == wells.m15)
& (df.plate == 1)
& (df.site == 2)
]
query
site_id | well_id | cell_line | split | experiment | plate | well | site | well_type | sirna | sirna_id | path | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3066 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w1.png |
3067 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w2.png |
3068 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w3.png |
3069 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w4.png |
3070 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w5.png |
3071 | HEPG2-08_1_M15_2 | HEPG2-08_1_M15 | HEPG2 | test | HEPG2-08 | 1 | M15 | 2 | positive_control | s15652 | 1114 | images/test/HEPG2-08/Plate1/M15_s2_w6.png |
To access the individual images based on this query result:
collection.artifacts.df()
uid | version | description | key | suffix | type | accessor | size | hash | hash_type | n_objects | n_observations | visibility | key_is_virtual | storage_id | transform_id | run_id | created_by_id | updated_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||
780 | WACLWuJySSYLBFTMSYG1 | None | RxRx1 image files | images/test/HEPG2-08 | None | None | 994441606 | 6r5Hkce0UTy7X6gLeaqzBA | md5-d | 14772 | None | 1 | False | 4 | 105 | 227 | 9 | 2024-06-17 12:33:01.392706+00:00 |
collection.artifacts[0].storage.root
'gs://rxrx1-europe-west4'
images = [f"{collection.artifacts[0].storage.root}/{key}" for key in query.path]
images
['gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w1.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w2.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w3.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w4.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w5.png',
'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w6.png']
Download an image to disk:
path = ln.UPath(images[1])
path.download_to(".")
from IPython.display import Image
Image(f"./{path.name}")
![_images/e9ab80eeba21bdcf86c18651e2665c5a5406cd56b4860eaa76eb961fa3a225fd.png](_images/e9ab80eeba21bdcf86c18651e2665c5a5406cd56b4860eaa76eb961fa3a225fd.png)
Use DuckDB to query metadata¶
As an alternative to pandas, we could use DuckDB to query image metadata.
import duckdb # pip install duckdb
features = ln.Feature.lookup(return_field="name")
filter = (
f"{features.cell_line} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
f"{features.plate} == '1' and {features.site} == '2'"
)
region = ln.setup.settings.storage.region
parquet_data = duckdb.from_parquet(
collection.artifact.path.as_posix() + f"?s3_region={region}"
)
parquet_data.filter(filter)
┌──────────────────┬────────────────┬───────────┬─────────┬───┬─────────┬──────────┬──────────────────────┐
│ site_id │ well_id │ cell_line │ split │ … │ sirna │ sirna_id │ path │
│ varchar │ varchar │ varchar │ varchar │ │ varchar │ int64 │ varchar │
├──────────────────┼────────────────┼───────────┼─────────┼───┼─────────┼──────────┼──────────────────────┤
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ … │ s15652 │ 1114 │ images/test/HEPG2-… │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ … │ s15652 │ 1114 │ images/test/HEPG2-… │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ … │ s15652 │ 1114 │ images/test/HEPG2-… │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ … │ s15652 │ 1114 │ images/test/HEPG2-… │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ … │ s15652 │ 1114 │ images/test/HEPG2-… │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2 │ test │ … │ s15652 │ 1114 │ images/test/HEPG2-… │
├──────────────────┴────────────────┴───────────┴─────────┴───┴─────────┴──────────┴──────────────────────┤
│ 6 rows 12 columns (7 shown) │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────┘