Jupyter Notebook

Trace data & code in a project: Schmidt et al. (2022)

LaminDB lets you trace data lineage across an entire project.

Here we follow Schmidt et al. (2022) through pipelines & notebooks.

A CRISPR screen reading out a phenotypic endpoint is paired with scRNA-seq to generate insights into IFN-γ production.

Through data lineage, these insights can be traced back through each preceding step to the original data to provide context for interpretation & future decision making.

More specifically: Why should I care about data lineage?

Data lineage tracks data sources & transformations to trace biological insights, verify experimental outcomes, meet regulatory standards, increase the robustness of research and optimize the feedback loop of team-wide learning iterations.

# pip install lamindb
!lamin init --storage ./mydata
Hide code cell output
 initialized lamindb: testuser1/mydata

Import lamindb:

import lamindb as ln
from IPython.display import Image, display
 connected lamindb: testuser1/mydata

Steps

In the following, we walk through exemplified steps covering different types of transforms (Transform).

Upload of phenotypic data

Register data through app upload from wetlab by testuser1:

# this function mimics the upload of artifacts via the UI
# in reality, one would drag and drop a file
def mock_upload_crispra_result_app():
    ln.setup.login("testuser1")
    output_path = ln.core.datasets.schmidt22_crispra_gws_IFNG(ln.settings.storage.root)
    output_file = ln.Artifact(
        output_path, description="Raw data of schmidt22 crispra GWS"
    )
    output_file.save()


mock_upload_crispra_result_app()
Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run

Hit identification in notebook

Access, transform & register data in drylab by testuser2 in notebook hit-identification.

Hide code cell content
# the following mimics the integrated analysis notebook
# In reality, you would execute inside the notebook
import nbproject_test
from pathlib import Path

cwd = Path.cwd()
nbproject_test.execute_notebooks(
    cwd / "project-flow-scripts/hit-identification.ipynb", write=True
)
Executing notebooks in /home/runner/work/lamin-usecases/lamin-usecases/docs/project-flow-scripts/hit-identification.ipynb
Scheduled: ['hit-identification']
hit-identification 
✓ (5.302s)
Total time: 5.303s

Inspect data lineage:

artifact = ln.Artifact.get(description="hits from schmidt22 crispra GWS")
artifact.view_lineage()
_images/e6b77fcd7eb6f9dc8ec4272b49b743b5dec9f3c2d06ea2ab0138d3edf5d6bf9c.svg

Sequencer upload

Upload files from sequencer via script chromium_10x_upload.py:

!python project-flow-scripts/chromium_10x_upload.py
Hide code cell output
 connected lamindb: testuser1/mydata
 created Transform('qCJPkOuZAi9q0000'), started new Run('vJVToxWY...') at 2025-10-07 12:39:44 UTC

scRNA-seq bioinformatics pipeline

Process uploaded files using a script or workflow manager: Pipelines – workflow managers and obtain 3 output files in a directory filtered_feature_bc_matrix/:

cellranger.py

!python project-flow-scripts/cellranger.py
Hide code cell output
 connected lamindb: testuser1/mydata
/home/runner/work/lamin-usecases/lamin-usecases/docs/project-flow-scripts/cellranger.py:7: FutureWarning: `name` will be removed soon, please pass 'Cell Ranger' to `key` instead
  transform = ln.Transform(
 created Transform('3fkyE88RrVHT0000'), started new Run('qoMzg2vR...') at 2025-10-07 12:39:48 UTC
 recommendation: to identify the script across renames, pass the uid: ln.track("3fkyE88RrVHT")

postprocess_cellranger.py

!python project-flow-scripts/postprocess_cellranger.py
Hide code cell output
 connected lamindb: testuser1/mydata
 created Transform('YqmbO6oMXjRj0000'), started new Run('33o5MBq5...') at 2025-10-07 12:39:50 UTC

Inspect data lineage:

output_file = ln.Artifact.get(description="perturbseq counts")
output_file.view_lineage()
_images/9fddb2daa98a754e1e2f16e8ce6f730f365b4a7dcd36d8f3f232e4329e5cda54.svg

Integrate scRNA-seq & phenotypic data

Integrate data in notebook integrated-analysis.

Hide code cell content
# the following mimics the integrated analysis notebook
# In reality, you would execute inside the notebook
nbproject_test.execute_notebooks(
    cwd / "project-flow-scripts/integrated-analysis.ipynb", write=True
)
Executing notebooks in /home/runner/work/lamin-usecases/lamin-usecases/docs/project-flow-scripts/integrated-analysis.ipynb
Scheduled: ['integrated-analysis']
integrated-analysis 
✓ (5.737s)
Total time: 5.738s

Review results

Let’s load one of the plots:

# track the current notebook as transform
ln.track("1LCd8kco9lZU0000")
 created Transform('1LCd8kco9lZU0000'), started new Run('10myNhXL...') at 2025-10-07 12:39:58 UTC
 notebook imports: ipython==9.6.0 lamindb==1.12.1 nbproject_test==0.6.0
artifact = ln.Artifact.get(key__contains="figures/matrixplot")
artifact.cache()
Hide code cell output
PosixUPath('/home/runner/work/lamin-usecases/lamin-usecases/docs/mydata/.lamindb/K3FMar9XKoKEIyq50000.png')
display(Image(filename=artifact.path))
_images/12b6375090c9cbd53043b0e65f3a37acdcbf31b7459d12c73ef1153748242aa2.png

We see that the image artifact is tracked as an input of the current notebook. The input is highlighted, the notebook follows at the bottom:

artifact.view_lineage()
_images/1aef3fdc4f1c34c877686fd5ff33a87b04cebbe9f263901c625ac881e6a91e27.svg

Understand runs

We tracked pipeline and notebook runs through track(), which stores a Transform and a Run record within a global context.

Artifact objects are the inputs and outputs of runs.

What if I don’t want a global context?

Sometimes, we don’t want to create a global run context but manually pass a run when creating an artifact:

run = ln.Run(transform=transform)
ln.Artifact(filepath, run=run)
When does an artifact appear as a run input?

When accessing an artifact via cache(), load() or open(), two things happen:

  1. The current run gets added to artifact.input_of_runs

  2. The transform of that artifact gets added as a parent of the current transform

You can then switch off auto-tracking of run inputs if you set ln.settings.track_run_inputs = False: Can I disable tracking run inputs?

You can also track run inputs on a case by case basis via is_run_input=True, e.g., here:

artifact.load(is_run_input=True)

Query by lineage

We can query or search for the notebook that created the artifact:

transform = ln.Transform.search("GWS CRIPSRa analysis").first()

And then find all the artifacts created by that notebook:

ln.Artifact.filter(transform=transform).to_dataframe()
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest is_locked run_id created_at created_by_id _aux _real_key branch_id
id
2 iT45r6HUPm9bvYSU0000 None hits from schmidt22 crispra GWS .parquet dataset DataFrame 17075 vEIP8xCoCrRM_mnLRuTywQ None 123 md5 True False 1 1 None None True False 1 2025-10-07 12:39:41.937000+00:00 2 {'af': {'0': True}} None 1

Which transform ingested a given artifact?

artifact = ln.Artifact.filter().first()
artifact.transform

And which user?

artifact.created_by.handle
'testuser1'

Which transforms were created by a given user?

users = ln.User.lookup()
ln.Transform.filter(created_by=users.testuser1).to_dataframe()
uid key description type source_code hash reference reference_type space_id _template_id version is_latest is_locked created_at created_by_id _aux branch_id
id
2 qCJPkOuZAi9q0000 chromium_10x_upload.py None script import lamindb as ln\n\nln.setup.login("testus... nXWdh475QhVKuoAfToWZTw None None 1 None None True False 2025-10-07 12:39:44.907000+00:00 1 None 1
6 1LCd8kco9lZU0000 schmidt22.ipynb Trace data & code in a project: Schmidt _et al... notebook None None None None 1 None None True False 2025-10-07 12:39:58.504000+00:00 1 None 1

Which notebooks were created by a given user?

ln.Transform.filter(created_by=users.testuser1, type="notebook").to_dataframe()
uid key description type source_code hash reference reference_type space_id _template_id version is_latest is_locked created_at created_by_id _aux branch_id
id
6 1LCd8kco9lZU0000 schmidt22.ipynb Trace data & code in a project: Schmidt _et al... notebook None None None None 1 None None True False 2025-10-07 12:39:58.504000+00:00 1 None 1

We can also view all recent additions to the entire database:

ln.view()
Hide code cell output
Artifact
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest is_locked run_id created_at created_by_id _aux _real_key branch_id
id
12 K3FMar9XKoKEIyq50000 figures/matrixplot_fig2_score-wgs-hits-per-clu... None .png None None 28815 6y5M7YitZ_3sZ0UpnY6BCQ None None md5 True False 1 1 None None True False 5 2025-10-07 12:39:57.039000+00:00 2 {'af': {'0': True}} None 1
11 Lzcgzl7l5ixxj3x20000 figures/umap_fig1_score-wgs-hits.png None .png None None 119000 CALKcuqlsOyvDoapy6YniQ None None md5 True False 1 1 None None True False 5 2025-10-07 12:39:56.868000+00:00 2 {'af': {'0': True}} None 1
10 rijyudaJCVv62loG0000 schmidt22_perturbseq.h5ad perturbseq counts .h5ad None AnnData 20659936 la7EvqEUMDlug9-rpw-udA None None md5 False False 1 1 None None True False 4 2025-10-07 12:39:51.652000+00:00 2 None None 1
9 VLZV9CKKqQHiEpMv0000 perturbseq/filtered_feature_bc_matrix/features... None .tsv.gz None None 6 KHJi9hdhkGHX1bYu-tDxLA None None md5 False False 1 1 None None True False 3 2025-10-07 12:39:48.528000+00:00 2 None None 1
7 AAozwyeEtyoZYfTv0000 perturbseq/filtered_feature_bc_matrix/barcodes... None .tsv.gz None None 6 PRcJzE1p1d7ia9gvlAEEzA None None md5 False False 1 1 None None True False 3 2025-10-07 12:39:48.527000+00:00 2 None None 1
8 kZbUNuQBhLntahRL0000 perturbseq/filtered_feature_bc_matrix/matrix.m... None .mtx.gz None None 6 DPMOCY_78j-LWWsHBEXyUw None None md5 False False 1 1 None None True False 3 2025-10-07 12:39:48.527000+00:00 2 None None 1
5 zAVZ5AyMbq5iMmEX0000 fastq/perturbseq_R2_001.fastq.gz None .fastq.gz None None 6 Zy3mwKgUeT6CTNT7-o25RA None None md5 False False 1 1 None None True False 2 2025-10-07 12:39:45.517000+00:00 1 None None 1
Run
uid name started_at finished_at reference reference_type _is_consecutive _status_code space_id transform_id report_id _logfile_id environment_id initiated_by_run_id is_locked created_at created_by_id _aux branch_id
id
1 RdZNZc49Lb2KBEcf None 2025-10-07 12:39:41.532313+00:00 NaT None None None -1 1 1 NaN None NaN None False 2025-10-07 12:39:41.532000+00:00 2 None 1
2 vJVToxWYnU0ywBmh None 2025-10-07 12:39:44.910459+00:00 2025-10-07 12:39:45.518785+00:00 None None True 0 1 2 6.0 None 3.0 None False 2025-10-07 12:39:44.911000+00:00 1 None 1
3 qoMzg2vRCifqEQV7 None 2025-10-07 12:39:48.149665+00:00 NaT None None None -1 1 3 NaN None NaN None False 2025-10-07 12:39:48.150000+00:00 2 None 1
4 33o5MBq5OPKuKiGM None 2025-10-07 12:39:50.469358+00:00 NaT None None True -1 1 4 NaN None 3.0 None False 2025-10-07 12:39:50.470000+00:00 2 None 1
5 B0YeeUaHjeyhqFwa None 2025-10-07 12:39:56.012313+00:00 NaT None None None -1 1 5 NaN None NaN None False 2025-10-07 12:39:56.012000+00:00 2 None 1
6 10myNhXL7zV978Jx None 2025-10-07 12:39:58.507249+00:00 NaT None None None -1 1 6 NaN None NaN None False 2025-10-07 12:39:58.507000+00:00 1 None 1
Storage
uid root description type region instance_uid space_id is_locked run_id created_at created_by_id _aux branch_id
id
1 wxZQVDeBmaaz /home/runner/work/lamin-usecases/lamin-usecase... None local None 54ZGqgkROOFf 1 False None 2025-10-07 12:39:34.150000+00:00 1 None 1
Transform
uid key description type source_code hash reference reference_type space_id _template_id version is_latest is_locked created_at created_by_id _aux branch_id
id
6 1LCd8kco9lZU0000 schmidt22.ipynb Trace data & code in a project: Schmidt _et al... notebook None None None None 1 None None True False 2025-10-07 12:39:58.504000+00:00 1 None 1
5 lB3IyPLQSmvt0000 integrated-analysis.ipynb Perform single cell analysis, integrate with C... notebook None None None None 1 None None True False 2025-10-07 12:39:56.007000+00:00 2 None 1
4 YqmbO6oMXjRj0000 postprocess_cellranger.py None script import lamindb as ln\n\n\n# Post-process 3 cel... A7Fg8VHna7y4ZnGWGPEwVw None None 1 None None True False 2025-10-07 12:39:50.467000+00:00 2 None 1
3 3fkyE88RrVHT0000 Cell Ranger None pipeline None None https://www.10xgenomics.com/support/software/c... None 1 None 7.2.0 True False 2025-10-07 12:39:48.147000+00:00 2 None 1
2 qCJPkOuZAi9q0000 chromium_10x_upload.py None script import lamindb as ln\n\nln.setup.login("testus... nXWdh475QhVKuoAfToWZTw None None 1 None None True False 2025-10-07 12:39:44.907000+00:00 1 None 1
1 T0T28btuB0PG0000 hit-identification.ipynb GWS CRIPSRa analysis notebook None None None None 1 None None True False 2025-10-07 12:39:41.529000+00:00 2 None 1