RxRx: cell imaging

rxrx.ai hosts high-throughput cell imaging datasets generated by Recursion.

High numbers of fluorescent microscopy images characterize cellular phenotypes in vitro based on morphology and protein expression (5-10 stains) across a range of conditions.

  • In this guide, you’ll see how to query some of these data using LaminDB.

  • If you’d like to transfer data into your own LaminDB instance, see the transfer guide.

# !pip install 'lamindb[bionty,jupyter,gcp]' wetlab
!lamin connect laminlabs/lamindata
 connected lamindb: laminlabs/lamindata
import lamindb as ln
import bionty as bt
import wetlab as wl
 connected lamindb: laminlabs/lamindata

Search & look up metadata

We’ll find all genetic treatments in the GeneticPerturbation registry:

df = wl.GeneticPerturbation.df()
df.shape
(100, 13)

Let us create a look up object for siRNAs so that we can easily auto-complete queries involving it:

sirnas = wl.GeneticPerturbation.filter(system="siRNA").lookup(return_field="name")

We’re also interested in cell lines & wells:

cell_lines = bt.CellLine.lookup(return_field="abbr")
wells = wl.Well.lookup(return_field="name")

Load the collection

This is RxRx1: 125k images for 1138 siRNA perturbation across 4 cell lines reading out 5 stains, image dimension is 512x512x6.

Let us get the corresponding object and some information about it:

collection = ln.Collection.get("Br2Z1lVSQBAkkbbt7ILu")
collection.view_lineage()
collection.describe()
Hide code cell output
_images/cf03b68bea080b54eeb473ce391b2bed452fdcab2a5e8ed0521c3d8eb4031b5a.svg
Collection 
└── General
    ├── .uid = 'Br2Z1lVSQBAkkbbt7ILu'
    ├── .key = 'Annotated RxRx1 images'
    ├── .hash = 'dycM8ypgnRRF9zXLSeD_'
    ├── .version = '1'
    ├── .created_by = sunnyosun (Sunny Sun)
    ├── .created_at = 2024-06-17 12:43:02
    └── .transform = 'Ingest the RxRx1 dataset'

The dataset consists in a metadata file and a folder path pointing to the image files:

collection.meta_artifact.load().head()
! run input wasn't tracked, call `ln.track()` and re-run
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id path
0 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w1.png
1 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w2.png
2 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w3.png
3 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w4.png
4 HEPG2-08_1_B02_1 HEPG2-08_1_B02 HEPG2 test HEPG2-08 1 B02 1 negative_control EMPTY 1138 images/test/HEPG2-08/Plate1/B02_s1_w5.png

Query image files

Because we didn’t choose to register each image as a record in the Artifact registry, we have to query the images through the metadata file of the dataset:

df = collection.meta_artifact.load()
! run input wasn't tracked, call `ln.track()` and re-run

We can query a subset of images using metadata registries & pandas query syntax:

query = df[
    (df.cell_line == cell_lines.hep_g2_cell)
    & (df.sirna == sirnas.s15652)
    & (df.well == wells.m15)
    & (df.plate == 1)
    & (df.site == 2)
]
query
site_id well_id cell_line split experiment plate well site well_type sirna sirna_id path
3066 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w1.png
3067 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w2.png
3068 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w3.png
3069 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w4.png
3070 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w5.png
3071 HEPG2-08_1_M15_2 HEPG2-08_1_M15 HEPG2 test HEPG2-08 1 M15 2 positive_control s15652 1114 images/test/HEPG2-08/Plate1/M15_s2_w6.png

To access the individual images based on this query result:

collection.data_artifact.storage.root
'gs://rxrx1-europe-west4'
images = [f"{collection.data_artifact.storage.root}/{key}" for key in query.path]
images
['gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w1.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w2.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w3.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w4.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w5.png',
 'gs://rxrx1-europe-west4/images/test/HEPG2-08/Plate1/M15_s2_w6.png']

Download an image to disk:

path = ln.UPath(images[1])
path.download_to(".")
from IPython.display import Image

Image(f"./{path.name}")
_images/e9ab80eeba21bdcf86c18651e2665c5a5406cd56b4860eaa76eb961fa3a225fd.png

Use DuckDB to query metadata

As an alternative to pandas, we could use DuckDB to query image metadata.

import duckdb  # pip install duckdb

features = ln.Feature.lookup(return_field="name")

filter = (
    f"{features.cell_line} == '{cell_lines.hep_g2_cell}' and {features.sirna} =="
    f" '{sirnas.s15652}' and {features.well} == '{wells.m15}' and "
    f"{features.plate} == '1' and {features.site} == '2'"
)

region = ln.setup.settings.storage.region
parquet_data = duckdb.from_parquet(
    collection.meta_artifact.path.as_posix() + f"?s3_region={region}"
)

parquet_data.filter(filter)
┌──────────────────┬────────────────┬───────────┬─────────┬────────────┬───────┬─────────┬───────┬──────────────────┬─────────┬──────────┬───────────────────────────────────────────┐
│     site_id      │    well_id     │ cell_line │  split  │ experiment │ plate │  well   │ site  │    well_type     │  sirna  │ sirna_id │                   path                    │
│     varchar      │    varchar     │  varchar  │ varchar │  varchar   │ int64 │ varchar │ int64 │     varchar      │ varchar │  int64   │                  varchar                  │
├──────────────────┼────────────────┼───────────┼─────────┼────────────┼───────┼─────────┼───────┼──────────────────┼─────────┼──────────┼───────────────────────────────────────────┤
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w1.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w2.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w3.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w4.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w5.png │
│ HEPG2-08_1_M15_2 │ HEPG2-08_1_M15 │ HEPG2     │ test    │ HEPG2-08   │     1 │ M15     │     2 │ positive_control │ s15652  │     1114 │ images/test/HEPG2-08/Plate1/M15_s2_w6.png │
└──────────────────┴────────────────┴───────────┴─────────┴────────────┴───────┴─────────┴───────┴──────────────────┴─────────┴──────────┴───────────────────────────────────────────┘