RxRx: cell imaging

rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.

High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.

  • In this guide, you’ll see how to query some of these data using LaminDB: laminlabs/rxrx.

  • If you’d like to transfer data into your own LaminDB instance, see the transfer guide.

  • If you’d like to understand how the laminlabs/rxrx instance was curated, see this repository.

Setup

import lamindb as ln
import bionty as bt
import wetlab as wl

ln.connect("laminlabs/lamindata")

Search & look up metadata

We’ll find all treatments in the Treatment registry:

df = wl.Treatment.df()
df.shape
(1139, 12)

Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:

sirnas = wl.Treatment.filter(system="siRNA").lookup(return_field="name")

We’re also interested in cell lines & wells:

cell_lines = bt.CellLine.lookup(return_field="abbr")
wells = wl.Well.lookup(return_field="name")

Load the collection

This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.

Let us get the corresponding object and some information about it:

collection = ln.Collection.filter(uid="Br2Z1lVSQBAkkbbt7ILu").one()
collection.view_lineage()
collection.describe()
Hide code cell output
_images/b9e01f3613530aa1e6355c9527f3e5932f1a5974213c1ed805ea914acfaffa1a.svg
Collection(uid='Br2Z1lVSQBAkkbbt7ILu', version='1', name='Annotated RxRx1 images', hash='dycM8ypgnRRF9zXLSeD_', visibility=1, updated_at='2024-06-17 12:43:02 UTC')
  Provenance
    .created_by = 'sunnyosun'
    .transform = 'Ingest the RxRx1 dataset'
    .run = '2024-06-17 11:33:27 UTC'
    .artifact = None
  Feature sets
    'columns' = 'path', 'well_id', 'plate', 'well', 'site', 'well_type', 'sirna', 'sirna_id', 'experiment', 'cell_line', 'split'

The dataset consists in a metadata file and a folder path pointing to the image files:

collection.artifact.load().head()
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id path
0 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1.png
1 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w2.png
2 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w3.png
3 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w4.png
4 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w5.png

Query image files

Because we didn’t choose to register each image as a record in the Artifact registry, we have to query the images through the metadata file of the dataset:

df = collection.artifact.load()

We can query a subset of images using metadata registries & pandas query syntax:

query = df[
    (df.cell_line == cell_lines.hep_g2_cell)
    & (df.sirna == sirnas.s15652)
    & (df.well == wells.m15)
    & (df.plate == 1)
    & (df.site == 2)
]
query
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id path
3066 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w1.png
3067 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w2.png
3068 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w3.png
3069 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w4.png
3070 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w5.png
3071 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w6.png

To access the individual images based on this query result:

collection.artifacts.df()
uid version description key suffix type accessor size hash hash_type n_objects n_observations visibility key_is_virtual storage_id transform_id run_id created_by_id updated_at
id
780 WACLWuJySSYLBFTMSYG1 None RxRx1 image files images/test/HEPG2-08 None None 994441606 6r5Hkce0UTy7X6gLeaqzBA md5-d 14772 None 1 False 4 105 227 9 2024-06-17 12:33:01.392706+00:00
images = [collection.artifacts[0].path.parent / key for key in query.path]
images
[GCSPath('gs://rxrx1-europe-west4/images/test/images/test/HEPG2-08/Plate1/M15_s2_w1.png'),
 GCSPath('gs://rxrx1-europe-west4/images/test/images/test/HEPG2-08/Plate1/M15_s2_w2.png'),
 GCSPath('gs://rxrx1-europe-west4/images/test/images/test/HEPG2-08/Plate1/M15_s2_w3.png'),
 GCSPath('gs://rxrx1-europe-west4/images/test/images/test/HEPG2-08/Plate1/M15_s2_w4.png'),
 GCSPath('gs://rxrx1-europe-west4/images/test/images/test/HEPG2-08/Plate1/M15_s2_w5.png'),
 GCSPath('gs://rxrx1-europe-west4/images/test/images/test/HEPG2-08/Plate1/M15_s2_w6.png')]

Download an image to disk:

# path = UPath(images[1])
# path.download_to(".")
# from IPython.display import Image
# Image(f"./{path.name}")
Use DuckDB to query metadata

As an alternative to pandas, we could use DuckDB to query image metadata.

import duckdb

filter = (
    f"{features.cell_type} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
    f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
    f"{features.plate} == '1' and {features.site} == '2'"
)

parquet_data = duckdb.from_parquet(artifact.path.as_posix())

parquet_data.filter(filter)