RxRx: cell imaging¶

rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.

High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.

In this guide, you’ll see how to query some of these data using LaminDB.
If you’d like to transfer data into your own LaminDB instance, see the transfer guide.

# !pip install 'lamindb[bionty,jupyter,gcp]' wetlab
!lamin connect laminlabs/lamindata

→ connected lamindb: laminlabs/lamindata

import lamindb as ln
import bionty as bt
import wetlab as wl

→ connected lamindb: laminlabs/lamindata

Search & look up metadata¶

We’ll find all genetic treatments in the GeneticPerturbation registry:

df = wl.GeneticPerturbation.df()
df.shape

(100, 13)

Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:

sirnas = wl.GeneticPerturbation.filter(system="siRNA").lookup(return_field="name")

We’re also interested in cell lines & wells:

cell_lines = bt.CellLine.lookup(return_field="abbr")
wells = wl.Well.lookup(return_field="name")

Load the collection¶

This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.

Let us get the corresponding object and some information about it:

collection = ln.Collection.get("Br2Z1lVSQBAkkbbt7ILu")
collection.view_lineage()
collection.describe()

The dataset consists in a metadata file and a folder path pointing to the image files:

collection.meta_artifact.load().head()

! run input wasn't tracked, call `ln.track()` and re-run

	site_id	well_id	cell_line	split	experiment	plate	well	site	well_type	sirna	sirna_id	path
0	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	1	B02	1	negative_control	EMPTY	1138	images/test/HEPG2-08/Plate1/B02_s1_w1.png
1	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	1	B02	1	negative_control	EMPTY	1138	images/test/HEPG2-08/Plate1/B02_s1_w2.png
2	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	1	B02	1	negative_control	EMPTY	1138	images/test/HEPG2-08/Plate1/B02_s1_w3.png
3	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	1	B02	1	negative_control	EMPTY	1138	images/test/HEPG2-08/Plate1/B02_s1_w4.png
4	HEPG2-08_1_B02_1	HEPG2-08_1_B02	HEPG2	test	HEPG2-08	1	B02	1	negative_control	EMPTY	1138	images/test/HEPG2-08/Plate1/B02_s1_w5.png

Query image files¶

Because we didn’t choose to register each image as a record in the Artifact registry, we have to query the images through the metadata file of the dataset:

df = collection.meta_artifact.load()

! run input wasn't tracked, call `ln.track()` and re-run

We can query a subset of images using metadata registries & pandas query syntax:

query = df[
    (df.cell_line == cell_lines.hep_g2_cell)
    & (df.sirna == sirnas.s15652)
    & (df.well == wells.m15)
    & (df.plate == 1)
    & (df.site == 2)
]
query

	site_id	well_id	cell_line	split	experiment	plate	well	site	well_type	sirna	sirna_id	path
3066	HEPG2-08_1_M15_2	HEPG2-08_1_M15	HEPG2	test	HEPG2-08	1	M15	2	positive_control	s15652	1114	images/test/HEPG2-08/Plate1/M15_s2_w1.png
3067	HEPG2-08_1_M15_2	HEPG2-08_1_M15	HEPG2	test	HEPG2-08	1	M15	2	positive_control	s15652	1114	images/test/HEPG2-08/Plate1/M15_s2_w2.png
3068	HEPG2-08_1_M15_2	HEPG2-08_1_M15	HEPG2	test	HEPG2-08	1	M15	2	positive_control	s15652	1114	images/test/HEPG2-08/Plate1/M15_s2_w3.png
3069	HEPG2-08_1_M15_2	HEPG2-08_1_M15	HEPG2	test	HEPG2-08	1	M15	2	positive_control	s15652	1114	images/test/HEPG2-08/Plate1/M15_s2_w4.png
3070	HEPG2-08_1_M15_2	HEPG2-08_1_M15	HEPG2	test	HEPG2-08	1	M15	2	positive_control	s15652	1114	images/test/HEPG2-08/Plate1/M15_s2_w5.png
3071	HEPG2-08_1_M15_2	HEPG2-08_1_M15	HEPG2	test	HEPG2-08	1	M15	2	positive_control	s15652	1114	images/test/HEPG2-08/Plate1/M15_s2_w6.png

To access the individual images based on this query result:

collection.data_artifact.storage.root

'gs://rxrx1-europe-west4'

images = [f"{collection.data_artifact.storage.root}/{key}" for key in query.path]
images

['gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w1.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w2.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w3.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w4.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w5.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w6.png']

Download an image to disk:

path = ln.UPath(images[1])
path.download_to(".")

from IPython.display import Image

Image(f"./{path.name}")

_images/e9ab80eeba21bdcf86c18651e2665c5a5406cd56b4860eaa76eb961fa3a225fd.png

Use DuckDB to query metadata¶

As an alternative to pandas, we could use DuckDB to query image metadata.

import duckdb  # pip install duckdb

features = ln.Feature.lookup(return_field="name")

filter = (
    f"{features.cell_line} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
    f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
    f"{features.plate} == '1' and {features.site} == '2'"
)

region = ln.setup.settings.storage.region
parquet_data = duckdb.from_parquet(
    collection.meta_artifact.path.as_posix() + f"?s3_region={region}"
)

parquet_data.filter(filter)

┌──────────────────┬────────────────┬───────────┬─────────┬────────────┬───────┬─────────┬───────┬──────────────────┬─────────┬──────────┬───────────────────────────────────────────┐
│     site_id      │    well_id     │ cell_line │  split  │ experiment │ plate │  well   │ site  │    well_type     │  sirna  │ sirna_id │                   path                    │
│     varchar      │    varchar     │  varchar  │ varchar │  varchar   │ int64 │ varchar │ int64 │     varchar      │ varchar │  int64   │                  varchar                  │
├──────────────────┼────────────────┼───────────┼─────────┼────────────┼───────┼─────────┼───────┼──────────────────┼─────────┼──────────┼───────────────────────────────────────────┤
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w1.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w2.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w3.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w4.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w5.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w6.png │
└──────────────────┴────────────────┴───────────┴─────────┴────────────┴───────┴─────────┴───────┴──────────────────┴─────────┴──────────┴───────────────────────────────────────────┘