Jupyter Notebook Binder

Project flow

LaminDB allows tracking data lineage on the entire project level.

Here, we walk through exemplified app uploads, pipelines & notebooks following Schmidt et al., 2022.

A CRISPR screen reading out a phenotypic endpoint on T cells is paired with scRNA-seq to generate insights into IFN-γ production.

These insights get linked back to the original data through the steps taken in the project to provide context for interpretation & future decision making.

More specifically: Why should I care about data flow?

Data flow tracks data sources & transformations to trace biological insights, verify experimental outcomes, meet regulatory standards, increase the robustness of research and optimize the feedback loop of team-wide learning iterations.

While tracking data flow is easier when it’s governed by deterministic pipelines, it becomes hard when it’s governed by interactive human-driven analyses.

LaminDB interfaces workflow mangers for the former and embraces the latter.

# !pip install 'lamindb[jupyter,bionty,aws]'
!lamin init --storage ./mydata
Hide code cell output
→ connected lamindb: testuser1/mydata

Import lamindb:

import lamindb as ln
from IPython.display import Image, display
→ connected lamindb: testuser1/mydata

Steps

In the following, we walk through exemplified steps covering different types of transforms (Transform).

Note

The full notebooks are in this repository.

App upload of phenotypic data

Register data through app upload from wetlab by testuser1:

# This function mimics the upload of artifacts via the UI
# In reality, you simply drag and drop files into the UI
def mock_upload_crispra_result_app():
    ln.setup.login("testuser1")
    transform = ln.Transform(name="Upload GWS CRISPRa result", type="upload")
    ln.track(transform=transform)
    output_path = ln.core.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage.root)
    output_file = ln.Artifact(
        output_path, description="Raw data of schmidt22 crispra GWS"
    )
    output_file.save()


mock_upload_crispra_result_app()
Hide code cell output
→ created Transform('AJZ8h6nV'), started new Run('3DI7SoTx') at 2024-11-21 06:57:35 UTC

Hit identification in notebook

Access, transform & register data in drylab by testuser2 in notebook hit-identification.

Hide code cell content
# the following mimics the integrated analysis notebook
# In reality, you would execute inside the notebook
import nbproject_test
from pathlib import Path

cwd = Path.cwd()
nbproject_test.execute_notebooks(
    cwd / "project-flow-scripts/hit-identification.ipynb", write=True
)
Executing notebooks in /home/runner/work/lamin-usecases/lamin-usecases/docs/project-flow-scripts/hit-identification.ipynb
Scheduled: ['hit-identification']
hit-identification 
✓ (3.992s)
Total time: 3.993s

Inspect data flow:

artifact = ln.Artifact.get(description="hits from schmidt22 crispra GWS")
artifact.view_lineage()
_images/210967d358a290c0e1ba04aed2051a7df5383f08e5f0c250b95065c4088ae977.svg

Sequencer upload

Upload files from sequencer via script chromium_10x_upload.py:

!python project-flow-scripts/chromium_10x_upload.py
Hide code cell output
→ connected lamindb: testuser1/mydata
→ created Transform('qCJPkOuZ'), started new Run('j74MkhOW') at 2024-11-21 06:57:42 UTC
→ finished Run('j74MkhOW') after 0d 0h 0m 0s at 2024-11-21 06:57:42 UTC

scRNA-seq bioinformatics pipeline

Process uploaded files using a script or workflow manager: Pipelines – workflow managers and obtain 3 output files in a directory filtered_feature_bc_matrix/:

cellranger.py

!python project-flow-scripts/cellranger.py
Hide code cell output
→ connected lamindb: testuser1/mydata
→ created Transform('HLC0QrPl'), started new Run('MV9eckYa') at 2024-11-21 06:57:44 UTC

postprocess_cellranger.py

!python project-flow-scripts/postprocess_cellranger.py
Hide code cell output
→ connected lamindb: testuser1/mydata
→ created Transform('YqmbO6oM'), started new Run('jUmfaV3c') at 2024-11-21 06:57:46 UTC

Inspect data flow:

output_file = ln.Artifact.get(description="perturbseq counts")
output_file.view_lineage()
_images/b5230f0c0ccf3bf7c98584038e2fb46340750c47f3bb593c755a248c77d07e20.svg

Integrate scRNA-seq & phenotypic data

Integrate data in notebook integrated-analysis.

Hide code cell content
# the following mimics the integrated analysis notebook
# In reality, you would execute inside the notebook
nbproject_test.execute_notebooks(
    cwd / "project-flow-scripts/integrated-analysis.ipynb", write=True
)
Executing notebooks in /home/runner/work/lamin-usecases/lamin-usecases/docs/project-flow-scripts/integrated-analysis.ipynb
Scheduled: ['integrated-analysis']
integrated-analysis 
✓ (4.243s)
Total time: 4.245s

Review results

Let’s load one of the plots:

# track the current notebook as transform
ln.track("1LCd8kco9lZU0000")
→ created Transform('1LCd8kco'), started new Run('7VmIewKn') at 2024-11-21 06:57:52 UTC
→ notebook imports: ipython==8.29.0 lamindb==0.76.16 nbproject_test==0.5.1
artifact = ln.Artifact.get(key__contains="figures/matrixplot")
artifact.cache()
Hide code cell output
PosixUPath('/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb/dYVQxC2rVPKtyTzz0000.png')
display(Image(filename=artifact.path))
_images/28b5626f0a00bc85fcf72624b858433cdfde27a5e67d133573789865113fa491.png

We see that the image artifact is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

artifact.view_lineage()
_images/933a2818c4738be295215298dd7fda71a6a1937034becf9c30a4ed30d74471b0.svg

Alternatively, we can also look at the sequence of transforms:

transform = ln.Transform.search("Project flow").first()
transform.predecessors.df()
uid version is_latest name key description type source_code hash reference reference_type _source_code_artifact_id created_at created_by_id
id
6 lB3IyPLQSmvt0000 None True Perform single cell analysis, integrate with C... integrated-analysis.ipynb None notebook None None None None None 2024-11-21 06:57:50.671567+00:00 2
transform.view_lineage()
_images/812155aa9ab1746bc9493a03281da07e812f297258fdb365f5e561181e8310d4.svg

Understand runs

We tracked pipeline and notebook runs through track(), which stores a Transform and a Run record within a global context.

Artifact objects are the inputs and outputs of runs.

What if I don’t want a global context?

Sometimes, we don’t want to create a global run context but manually pass a run when creating an artifact:

run = ln.Run(transform=transform)
ln.Artifact(filepath, run=run)
When does an artifact appear as a run input?

When accessing an artifact via cache(), load() or open(), two things happen:

  1. The current run gets added to artifact.input_of

  2. The transform of that artifact gets added as a parent of the current transform

You can then switch off auto-tracking of run inputs if you set ln.settings.track_run_inputs = False: Can I disable tracking run inputs?

You can also track run inputs on a case by case basis via is_run_input=True, e.g., here:

artifact.load(is_run_input=True)

Query by provenance

We can query or search for the notebook that created the artifact:

transform = ln.Transform.search("GWS CRIPSRa analysis").first()

And then find all the artifacts created by that notebook:

ln.Artifact.filter(transform=transform).df()
uid version is_latest description key suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_at created_by_id
id
2 bl5jvh5CXGRS2UQ60000 None True hits from schmidt22 crispra GWS None .parquet dataset 16957 zADdl2UTm8XXA-fT1oaWwg None None md5 DataFrame 1 True 1 2 2 2024-11-21 06:57:40.070377+00:00 2

Which transform ingested a given artifact?

artifact = ln.Artifact.filter().first()
artifact.transform
Transform(uid='AJZ8h6nVjicM0000', is_latest=True, name='Upload GWS CRISPRa result', type='upload', created_by_id=1, created_at=2024-11-21 06:57:35 UTC)

And which user?

artifact.created_by
User(uid='DzTjkKse', handle='testuser1', name='Test User1', created_at=2024-11-21 06:57:33 UTC)

Which transforms were created by a given user?

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).df()
uid version is_latest name key description type source_code hash reference reference_type _source_code_artifact_id created_at created_by_id
id
1 AJZ8h6nVjicM0000 None True Upload GWS CRISPRa result None None upload None None None None None 2024-11-21 06:57:35.928048+00:00 1
3 qCJPkOuZAi9q0000 None True chromium_10x_upload.py chromium_10x_upload.py None script import lamindb as ln\n\nln.setup.login("testus... nXWdh475QhVKuoAfToWZTw None None None 2024-11-21 06:57:42.125478+00:00 1
7 1LCd8kco9lZU0000 None True Project flow project-flow.ipynb None notebook None None None None None 2024-11-21 06:57:52.322096+00:00 1

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser1, type="notebook").df()
uid version is_latest name key description type source_code hash reference reference_type _source_code_artifact_id created_at created_by_id
id
7 1LCd8kco9lZU0000 None True Project flow project-flow.ipynb None notebook None None None None None 2024-11-21 06:57:52.322096+00:00 1

We can also view all recent additions to the entire database:

ln.view()
Hide code cell output
Artifact
uid version is_latest description key suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id run_id created_at created_by_id
id
11 dYVQxC2rVPKtyTzz0000 None True None figures/matrixplot_fig2_score-wgs-hits-per-clu... .png None 28814 Uqpe3rI2qCa6KHvUIxadLw None None md5 None 1 True 1 6 6 2024-11-21 06:57:51.410115+00:00 2
10 8hBRQF1m7HaDjTIA0000 None True None figures/umap_fig1_score-wgs-hits.png .png None 118999 JMFvnvCcQzIdtXM8Y12MKg None None md5 None 1 True 1 6 6 2024-11-21 06:57:51.245134+00:00 2
9 edaRMctu3g5Luk8x0000 None True perturbseq counts schmidt22_perturbseq.h5ad .h5ad None 20659936 la7EvqEUMDlug9-rpw-udA None None md5 AnnData 1 False 1 5 5 2024-11-21 06:57:47.408373+00:00 2
8 PodAvbPc65m0XEC20000 None True None perturbseq/filtered_feature_bc_matrix/barcodes... .tsv.gz None 6 KNPlrLqurDujk_DabUqp1Q None None md5 None 1 False 1 4 4 2024-11-21 06:57:44.857688+00:00 2
7 O488EmMhgAYTXE5b0000 None True None perturbseq/filtered_feature_bc_matrix/matrix.m... .mtx.gz None 6 sLNulgDMeVu7f7xHhP1B3g None None md5 None 1 False 1 4 4 2024-11-21 06:57:44.857188+00:00 2
6 cZZ7C7vkg4VSqNCQ0000 None True None perturbseq/filtered_feature_bc_matrix/features... .tsv.gz None 6 FwlR22zmW4Yrz8qlousDgA None None md5 None 1 False 1 4 4 2024-11-21 06:57:44.856434+00:00 2
4 nIQG6iWhGM2reqCl0000 None True None fastq/perturbseq_R2_001.fastq.gz .fastq.gz None 6 USuj1EIt2Xc_ShM1FGyKAA None None md5 None 1 False 1 3 3 2024-11-21 06:57:42.486239+00:00 1
! No records found
! No records found
! No records found
Run
uid started_at finished_at is_consecutive reference reference_type transform_id report_id environment_id parent_id created_at created_by_id
id
1 3DI7SoTxhlNpvA32fd9q 2024-11-21 06:57:35.930351+00:00 NaT True None None 1 None NaN None 2024-11-21 06:57:35.930380+00:00 1
2 kZJl0bxhUmehm5sJxiau 2024-11-21 06:57:39.668776+00:00 NaT True None None 2 None NaN None 2024-11-21 06:57:39.668804+00:00 2
3 j74MkhOWybVgBOWWwtMs 2024-11-21 06:57:42.129908+00:00 2024-11-21 06:57:42.496360+00:00 True None None 3 None 5.0 None 2024-11-21 06:57:42.129942+00:00 1
4 MV9eckYa9rg3dUeIT4PL 2024-11-21 06:57:44.488360+00:00 NaT None None None 4 None NaN None 2024-11-21 06:57:44.488385+00:00 2
5 jUmfaV3ca10E5XkzU4wM 2024-11-21 06:57:46.391938+00:00 NaT None None None 5 None NaN None 2024-11-21 06:57:46.391966+00:00 2
6 t9uMtAYjfnyjaXuCHHek 2024-11-21 06:57:50.675092+00:00 NaT True None None 6 None NaN None 2024-11-21 06:57:50.675123+00:00 2
7 7VmIewKnPbUi9L0lhuYy 2024-11-21 06:57:52.325178+00:00 NaT True None None 7 None NaN None 2024-11-21 06:57:52.325208+00:00 1
Storage
uid root description type region instance_uid run_id created_at created_by_id
id
1 m4tZsf9MaZSI /home/runner/work/lamin-usecases/lamin-usecase... None local None 54ZGqgkROOFf None 2024-11-21 06:57:33.943240+00:00 1
Transform
uid version is_latest name key description type source_code hash reference reference_type _source_code_artifact_id created_at created_by_id
id
7 1LCd8kco9lZU0000 None True Project flow project-flow.ipynb None notebook None None None None None 2024-11-21 06:57:52.322096+00:00 1
6 lB3IyPLQSmvt0000 None True Perform single cell analysis, integrate with C... integrated-analysis.ipynb None notebook None None None None None 2024-11-21 06:57:50.671567+00:00 2
5 YqmbO6oMXjRj0000 None True postprocess_cellranger.py postprocess_cellranger.py None script None None None None None 2024-11-21 06:57:46.389626+00:00 2
4 HLC0QrPlN6f70000 7.2.0 True Cell Ranger None None pipeline None None https://www.10xgenomics.com/support/software/c... None None 2024-11-21 06:57:44.486028+00:00 2
3 qCJPkOuZAi9q0000 None True chromium_10x_upload.py chromium_10x_upload.py None script import lamindb as ln\n\nln.setup.login("testus... nXWdh475QhVKuoAfToWZTw None None None 2024-11-21 06:57:42.125478+00:00 1
2 T0T28btuB0PG0000 None True GWS CRIPSRa analysis hit-identification.ipynb None notebook None None None None None 2024-11-21 06:57:39.665468+00:00 2
1 AJZ8h6nVjicM0000 None True Upload GWS CRISPRa result None None upload None None None None None 2024-11-21 06:57:35.928048+00:00 1
! No records found
User
uid handle name created_at
id
2 bKeW4T6E testuser2 Test User2 2024-11-21 06:57:39.662472+00:00
1 DzTjkKse testuser1 Test User1 2024-11-21 06:57:33.940079+00:00
Hide code cell content
!lamin login testuser1
!rm -r ./mydata
!lamin delete --force mydata
✓ logged in with email [email protected] (uid: DzTjkKse)
• deleting instance testuser1/mydata