Jupyter Notebook Binder

Project flow

LaminDB allows tracking data lineage on the entire project level.

Here, we walk through exemplified app uploads, pipelines & notebooks following Schmidt et al., 2022.

A CRISPR screen reading out a phenotypic endpoint on T cells is paired with scRNA-seq to generate insights into IFN-γ production.

These insights get linked back to the original data through the steps taken in the project to provide context for interpretation & future decision making.

More specifically: Why should I care about data flow?

Data flow tracks data sources & transformations to trace biological insights, verify experimental outcomes, meet regulatory standards, increase the robustness of research and optimize the feedback loop of team-wide learning iterations.

While tracking data flow is easier when it’s governed by deterministic pipelines, it becomes hard when it’s governed by interactive human-driven analyses.

LaminDB interfaces workflow mangers for the former and embraces the latter.

# !pip install 'lamindb[jupyter,bionty,aws]'
!lamin init --storage ./mydata
Hide code cell output
 initialized lamindb: testuser1/mydata

Import lamindb:

import lamindb as ln
from IPython.display import Image, display
 connected lamindb: testuser1/mydata

Steps

In the following, we walk through exemplified steps covering different types of transforms (Transform).

Note

The full notebooks are in this repository.

App upload of phenotypic data

Register data through app upload from wetlab by testuser1:

# This function mimics the upload of artifacts via the UI
# In reality, you simply drag and drop files into the UI
def mock_upload_crispra_result_app():
    ln.setup.login("testuser1")
    transform = ln.Transform(name="Upload GWS CRISPRa result", type="upload")
    ln.track(transform=transform)
    output_path = ln.core.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage.root)
    output_file = ln.Artifact(
        output_path, description="Raw data of schmidt22 crispra GWS"
    )
    output_file.save()


mock_upload_crispra_result_app()
Hide code cell output
/tmp/ipykernel_3481/3297532022.py:5: FutureWarning: `name` will be removed soon, please pass 'Upload GWS CRISPRa result' to `key` instead
  transform = ln.Transform(name="Upload GWS CRISPRa result", type="upload")
 created Transform('WZklSqXmxf2O0000'), started new Run('1vHdXxQU...') at 2025-01-20 07:41:03 UTC

Hit identification in notebook

Access, transform & register data in drylab by testuser2 in notebook hit-identification.

Hide code cell content
# the following mimics the integrated analysis notebook
# In reality, you would execute inside the notebook
import nbproject_test
from pathlib import Path

cwd = Path.cwd()
nbproject_test.execute_notebooks(
    cwd / "project-flow-scripts/hit-identification.ipynb", write=True
)

Inspect data flow:

artifact = ln.Artifact.get(description="hits from schmidt22 crispra GWS")
artifact.view_lineage()
_images/aec51e6468e45b789c4a7d67caa4c7a34d8b0f558142ea7d52faad2299752a1a.svg

Sequencer upload

Upload files from sequencer via script chromium_10x_upload.py:

!python project-flow-scripts/chromium_10x_upload.py

scRNA-seq bioinformatics pipeline

Process uploaded files using a script or workflow manager: Pipelines – workflow managers and obtain 3 output files in a directory filtered_feature_bc_matrix/:

cellranger.py

!python project-flow-scripts/cellranger.py

postprocess_cellranger.py

!python project-flow-scripts/postprocess_cellranger.py

Inspect data flow:

output_file = ln.Artifact.get(description="perturbseq counts")
output_file.view_lineage()
_images/a13015988170dc3a12295cee21a845eaec2928f50548eba948c0f62481f7beb5.svg

Integrate scRNA-seq & phenotypic data

Integrate data in notebook integrated-analysis.

Hide code cell content
# the following mimics the integrated analysis notebook
# In reality, you would execute inside the notebook
nbproject_test.execute_notebooks(
    cwd / "project-flow-scripts/integrated-analysis.ipynb", write=True
)

Review results

Let’s load one of the plots:

# track the current notebook as transform
ln.track("1LCd8kco9lZU0000")
artifact = ln.Artifact.get(key__contains="figures/matrixplot")
artifact.cache()
Hide code cell output
PosixUPath('/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb/x208sxnBUIp915Zw0000.png')
display(Image(filename=artifact.path))
_images/ad4040f3b26c1909ed0b86ae5ce56a288b3f75ac35b4092ca6189d6aca287248.png

We see that the image artifact is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

artifact.view_lineage()
_images/4760554dc81d081dbceea268b824858b7d000b3d3088a9ae18cc5b1d18894074.svg

Alternatively, we can also look at the sequence of transforms:

transform = ln.Transform.search("Project flow").first()
transform.predecessors.df()
uid key description type source_code hash reference reference_type space_id _template_id version is_latest created_at created_by_id _aux _branch_code
id
6 lB3IyPLQSmvt0000 integrated-analysis.ipynb Perform single cell analysis, integrate with C... notebook None None None None 1 None None True 2025-01-20 07:41:20.491000+00:00 2 None 1
transform.view_lineage()
_images/c000d9fe0b92a35485ba41ffe01a722d77ffd365238fc41c36b96fe379b63151.svg

Understand runs

We tracked pipeline and notebook runs through track(), which stores a Transform and a Run record within a global context.

Artifact objects are the inputs and outputs of runs.

What if I don’t want a global context?

Sometimes, we don’t want to create a global run context but manually pass a run when creating an artifact:

run = ln.Run(transform=transform)
ln.Artifact(filepath, run=run)
When does an artifact appear as a run input?

When accessing an artifact via cache(), load() or open(), two things happen:

  1. The current run gets added to artifact.input_of

  2. The transform of that artifact gets added as a parent of the current transform

You can then switch off auto-tracking of run inputs if you set ln.settings.track_run_inputs = False: Can I disable tracking run inputs?

You can also track run inputs on a case by case basis via is_run_input=True, e.g., here:

artifact.load(is_run_input=True)

Query by provenance

We can query or search for the notebook that created the artifact:

transform = ln.Transform.search("GWS CRIPSRa analysis").first()

And then find all the artifacts created by that notebook:

ln.Artifact.filter(transform=transform).df()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
2 Y7YPZKd2MtLHfC6P0000 None hits from schmidt22 crispra GWS .parquet dataset DataFrame 16948 ANdKDt5h3CqV4Bfi4KGCEQ None None md5 True False 1 1 None None True 2 2025-01-20 07:41:08.212000+00:00 2 None 1

Which transform ingested a given artifact?

artifact = ln.Artifact.filter().first()
artifact.transform
Transform(uid='WZklSqXmxf2O0000', is_latest=True, key='Upload GWS CRISPRa result', type='upload', space_id=1, created_by_id=1, created_at=2025-01-20 07:41:03 UTC)

And which user?

artifact.created_by
<User: User object (1)>

Which transforms were created by a given user?

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).df()
uid key description type source_code hash reference reference_type space_id _template_id version is_latest created_at created_by_id _aux _branch_code
id
1 WZklSqXmxf2O0000 Upload GWS CRISPRa result None upload None None None None 1 None None True 2025-01-20 07:41:03.374000+00:00 1 None 1
3 qCJPkOuZAi9q0000 chromium_10x_upload.py None script import lamindb as ln\n\nln.setup.login("testus... nXWdh475QhVKuoAfToWZTw None None 1 None None True 2025-01-20 07:41:11.033000+00:00 1 None 1
7 1LCd8kco9lZU0000 project-flow.ipynb Project flow notebook None None None None 1 None None True 2025-01-20 07:41:21.986000+00:00 1 None 1

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser1, type="notebook").df()
uid key description type source_code hash reference reference_type space_id _template_id version is_latest created_at created_by_id _aux _branch_code
id
7 1LCd8kco9lZU0000 project-flow.ipynb Project flow notebook None None None None 1 None None True 2025-01-20 07:41:21.986000+00:00 1 None 1

We can also view all recent additions to the entire database:

ln.view()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
12 x208sxnBUIp915Zw0000 figures/matrixplot_fig2_score-wgs-hits-per-clu... None .png None None 28815 DT0-ldk79wZs-ZraQT_ipQ None None md5 True False 1 1 None None True 6 2025-01-20 07:41:21.195000+00:00 2 None 1
11 IKmpeQOizVGA77My0000 figures/umap_fig1_score-wgs-hits.png None .png None None 119000 x5pcU-MOhyFKeAQaE8YYOQ None None md5 True False 1 1 None None True 6 2025-01-20 07:41:21.041000+00:00 2 None 1
10 c7RIaDkBBQ91ln5B0000 schmidt22_perturbseq.h5ad perturbseq counts .h5ad None AnnData 20659936 la7EvqEUMDlug9-rpw-udA None None md5 False False 1 1 None None True 5 2025-01-20 07:41:17.184000+00:00 2 None 1
8 c2zr6G9JukgPt3iK0000 perturbseq/filtered_feature_bc_matrix/barcodes... None .tsv.gz None None 6 gKzk3SEpwT8VFYdqu6PpkA None None md5 False False 1 1 None None True 4 2025-01-20 07:41:14.382000+00:00 2 None 1
9 NxE5sIshWDcuYH5c0000 perturbseq/filtered_feature_bc_matrix/features... None .tsv.gz None None 6 nNmYAX8lsQmW1N1tvg_YSg None None md5 False False 1 1 None None True 4 2025-01-20 07:41:14.382000+00:00 2 None 1
7 f62SL4JhgLExx9qH0000 perturbseq/filtered_feature_bc_matrix/matrix.m... None .mtx.gz None None 6 Wj5HuFqLoLnqfZJejN5l-Q None None md5 False False 1 1 None None True 4 2025-01-20 07:41:14.381000+00:00 2 None 1
4 iy75f5dh8ryzAI5J0000 fastq/perturbseq_R2_001.fastq.gz None .fastq.gz None None 6 0kD07C20cgTsFIE9mHMj2A None None md5 False False 1 1 None None True 3 2025-01-20 07:41:11.372000+00:00 1 None 1
uid name started_at finished_at reference reference_type _is_consecutive _status_code space_id transform_id report_id _logfile_id environment_id initiated_by_run_id created_at created_by_id _aux _branch_code
id
1 1vHdXxQUc5NI7M1QlBzh None 2025-01-20 07:41:03.377131+00:00 NaT None None None 0 1 1 NaN None NaN None 2025-01-20 07:41:03.377000+00:00 1 None 1
2 7N6WTAC51Cl808VYcDev None 2025-01-20 07:41:07.823999+00:00 NaT None None None 0 1 2 NaN None NaN None 2025-01-20 07:41:07.824000+00:00 2 None 1
3 9ltfAjWqDItj1VuNn37X None 2025-01-20 07:41:11.035451+00:00 2025-01-20 07:41:11.381045+00:00 None None None 0 1 3 6.0 None 5.0 None 2025-01-20 07:41:11.036000+00:00 1 None 1
4 h8lyJVq00yxbfdlOm3db None 2025-01-20 07:41:14.031049+00:00 NaT None None None 0 1 4 NaN None NaN None 2025-01-20 07:41:14.031000+00:00 2 None 1
5 6L2pYbEpu0OXCf2NBmud None 2025-01-20 07:41:15.827913+00:00 NaT None None None 0 1 5 NaN None NaN None 2025-01-20 07:41:15.828000+00:00 2 None 1
6 kBTIq73uBwlxd5a4Gvtb None 2025-01-20 07:41:20.495118+00:00 NaT None None None 0 1 6 NaN None NaN None 2025-01-20 07:41:20.495000+00:00 2 None 1
7 60pjBNuaRRyg0cKW1PD8 None 2025-01-20 07:41:21.989821+00:00 NaT None None None 0 1 7 NaN None NaN None 2025-01-20 07:41:21.990000+00:00 1 None 1
uid root description type region instance_uid space_id run_id created_at created_by_id _aux _branch_code
id
1 tdCuq1DdQMBU /home/runner/work/lamin-usecases/lamin-usecase... None local None 54ZGqgkROOFf 1 None 2025-01-20 07:41:00.935000+00:00 1 None 1
uid key description type source_code hash reference reference_type space_id _template_id version is_latest created_at created_by_id _aux _branch_code
id
7 1LCd8kco9lZU0000 project-flow.ipynb Project flow notebook None None None None 1 None None True 2025-01-20 07:41:21.986000+00:00 1 None 1
6 lB3IyPLQSmvt0000 integrated-analysis.ipynb Perform single cell analysis, integrate with C... notebook None None None None 1 None None True 2025-01-20 07:41:20.491000+00:00 2 None 1
5 YqmbO6oMXjRj0000 postprocess_cellranger.py None script None None None None 1 None None True 2025-01-20 07:41:15.824000+00:00 2 None 1
4 1J4PZg6IDDOy0000 Cell Ranger None pipeline None None https://www.10xgenomics.com/support/software/c... None 1 None 7.2.0 True 2025-01-20 07:41:14.028000+00:00 2 None 1
3 qCJPkOuZAi9q0000 chromium_10x_upload.py None script import lamindb as ln\n\nln.setup.login("testus... nXWdh475QhVKuoAfToWZTw None None 1 None None True 2025-01-20 07:41:11.033000+00:00 1 None 1
2 T0T28btuB0PG0000 hit-identification.ipynb GWS CRIPSRa analysis notebook None None None None 1 None None True 2025-01-20 07:41:07.820000+00:00 2 None 1
1 WZklSqXmxf2O0000 Upload GWS CRISPRa result None upload None None None None 1 None None True 2025-01-20 07:41:03.374000+00:00 1 None 1
Hide code cell content
!lamin login testuser1
!rm -r ./mydata
!lamin delete --force mydata