Nextflow¶
Nextflow is the most widely used workflow manager in bioinformatics.
This guide shows how to register a Nextflow run with inputs & outputs for the example of the nf-core/scrnaseq pipeline by running a Python script.
The approach could be automated by deploying the script via
a serverless environment trigger (e.g., AWS Lambda)
a post-run script on the Seqera Platform
What steps are executed by the nf-core/scrnaseq pipeline?
!lamin init --storage ./test-nextflow --name test-nextflow
Show code cell output
→ initialized lamindb: testuser1/test-nextflow
Run the pipeline¶
Let’s download the input data from an S3 bucket.
import lamindb as ln
input_path = ln.UPath("s3://lamindb-test/scrnaseq_input")
input_path.download_to("scrnaseq_input")
→ connected lamindb: testuser1/test-nextflow
And run the nf-core/scrnaseq
pipeline.
# the test profile uses all downloaded input files as an input
!nextflow run nf-core/scrnaseq -r 2.7.1 -profile docker,test -resume --outdir scrnaseq_output
Show code cell output
N E X T F L O W ~ version 24.10.3
Pulling nf-core/scrnaseq ...
downloaded from https://github.com/nf-core/scrnaseq.git
WARN: It appears you have never run this project before -- Option `-resume` is ignored
Launching `https://github.com/nf-core/scrnaseq` [determined_snyder] DSL2 - revision: 4171377f40 [2.7.1]
Downloading plugin [email protected]
------------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~'
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/scrnaseq v2.7.1-g4171377
------------------------------------------------------
Core Nextflow options
revision : 2.7.1
runName : determined_snyder
containerEngine : docker
launchDir : /home/runner/work/nextflow-lamin/nextflow-lamin/docs
workDir : /home/runner/work/nextflow-lamin/nextflow-lamin/docs/work
projectDir : /home/runner/.nextflow/assets/nf-core/scrnaseq
userName : runner
profile : docker,test
configFiles :
Input/output options
input : https://github.com/nf-core/test-datasets/raw/scrnaseq/samplesheet-2-0.csv
outdir : scrnaseq_output
Mandatory arguments
aligner : star
protocol : 10XV2
Skip Tools
skip_emptydrops : true
Reference genome options
fasta : https://github.com/nf-core/test-datasets/raw/scrnaseq/reference/GRCm38.p6.genome.chr19.fa
gtf : https://github.com/nf-core/test-datasets/raw/scrnaseq/reference/gencode.vM19.annotation.chr19.gtf
Institutional config options
config_profile_name : Test profile
config_profile_description: Minimal test dataset to check pipeline function
Max job request options
max_cpus : 2
max_memory : 6.GB
max_time : 6.h
!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/scrnaseq for your analysis please cite:
* The pipeline
https://doi.org/10.5281/zenodo.3568187
* The nf-core framework
https://doi.org/10.1038/s41587-020-0439-x
* Software dependencies
https://github.com/nf-core/scrnaseq/blob/master/CITATIONS.md
------------------------------------------------------
[a2/75b68c] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:FASTQC_CHECK:FASTQC (Sample_X)
[1f/0a5c0c] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:FASTQC_CHECK:FASTQC (Sample_Y)
[0c/2506ad] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:GTF_GENE_FILTER (GRCm38.p6.genome.chr19.fa)
[51/87435a] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:STARSOLO:STAR_GENOMEGENERATE (GRCm38.p6.genome.chr19.fa)
[a4/053c8f] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:STARSOLO:STAR_ALIGN (Sample_X)
[9b/ba3134] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:STARSOLO:STAR_ALIGN (Sample_Y)
[d6/5dd858] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT (Sample_X)
[97/2d7363] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD (Sample_X)
[f9/e653c0] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD (Sample_X)
[d1/64e020] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT (Sample_X)
[13/5adbce] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT (Sample_Y)
[0b/deeecf] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD (Sample_Y)
[ae/7ce54d] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD (Sample_Y)
[3d/b0f249] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT (Sample_Y)
[03/4213ee] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MULTIQC
[69/269d6d] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:CONCAT_H5AD (2)
[52/b14e18] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:CONCAT_H5AD (1)
-[nf-core/scrnaseq] Pipeline completed successfully-
What is the full run command for the test profile?
nextflow run nf-core/scrnaseq -r 2.7.1 \
-profile docker \
-resume \
--outdir scrnaseq_output \
--input 'scrnaseq_input/samplesheet-2-0.csv' \
--skip_emptydrops \
--fasta 'https://github.com/nf-core/test-datasets/raw/scrnaseq/reference/GRCm38.p6.genome.chr19.fa' \
--gtf 'https://github.com/nf-core/test-datasets/raw/scrnaseq/reference/gencode.vM19.annotation.chr19.gtf' \
--aligner 'star' \
--protocol '10XV2' \
--max_cpus 2 \
--max_memory '6.GB' \
--max_time '6.h'
Run the registration script¶
After the pipeline has completed, a Python script registers inputs & outputs in LaminDB.
import argparse
import lamindb as ln
import json
import re
from pathlib import Path
def parse_arguments() -> argparse.Namespace:
parser = argparse.ArgumentParser()
parser.add_argument("--input", type=str, required=True)
parser.add_argument("--output", type=str, required=True)
return parser.parse_args()
def register_pipeline_io(input_dir: str, output_dir: str, run: ln.Run) -> None:
"""Register input and output artifacts for an `nf-core/scrnaseq` run."""
input_artifacts = ln.Artifact.from_dir(input_dir, run=False)
ln.save(input_artifacts)
run.input_artifacts.set(input_artifacts)
ln.Artifact(f"{output_dir}/multiqc", description="multiqc report", run=run).save()
ln.Artifact(
f"{output_dir}/star/mtx_conversions/combined_filtered_matrix.h5ad",
description="filtered count matrix",
run=run,
).save()
def register_pipeline_metadata(output_dir: str, run: ln.Run) -> None:
"""Register nf-core run metadata stored in the 'pipeline_info' folder."""
ulabel = ln.ULabel(name="nextflow").save()
run.transform.ulabels.add(ulabel)
# nextflow run id
content = next(Path(f"{output_dir}/pipeline_info").glob("execution_report_*.html")).read_text()
match = re.search(r"run id \[([^\]]+)\]", content)
nextflow_id = match.group(1) if match else ""
run.reference = nextflow_id
run.reference_type = "nextflow_id"
# execution report and software versions
for file_pattern, description, run_attr in [
("execution_report*", "execution report", "report"),
("nf_core_pipeline_software*", "software versions", "environment"),
]:
artifact = ln.Artifact(
next(Path(f"{output_dir}/pipeline_info").glob(file_pattern)),
description=f"nextflow run {description} of {nextflow_id}",
visibility=0,
run=False,
).save()
setattr(run, run_attr, artifact)
# nextflow run parameters
params_path = next(Path(f"{output_dir}/pipeline_info").glob("params*"))
with params_path.open() as params_file:
params = json.load(params_file)
ln.Param(name="params", dtype="dict").save()
run.params.add_values({"params": params})
run.save()
args = parse_arguments()
scrnaseq_transform = ln.Transform(
name="scrna-seq",
version="2.7.1",
type="pipeline",
reference="https://github.com/nf-core/scrnaseq",
).save()
run = ln.Run(transform=scrnaseq_transform).save()
register_pipeline_io(args.input, args.output, run)
register_pipeline_metadata(args.output, run)
!python register_scrnaseq_run.py --input scrnaseq_input --output scrnaseq_output
Show code cell output
→ connected lamindb: testuser1/test-nextflow
/home/runner/work/nextflow-lamin/nextflow-lamin/docs/register_scrnaseq_run.py:63: FutureWarning: `name` will be removed soon, please pass 'scrna-seq' to `key` instead
scrnaseq_transform = ln.Transform(
! folder is outside existing storage location, will copy files from scrnaseq_input to /home/runner/work/nextflow-lamin/nextflow-lamin/docs/test-nextflow/scrnaseq_input
Data lineage¶
The output data could now be accessed (in a different notebook/script) for analysis with full lineage.
matrix_af = ln.Artifact.get(description__icontains="filtered count matrix")
matrix_af.view_lineage()
View transforms & runs on the hub¶
View the database content¶
ln.view()
Show code cell output
Artifact
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
5 | r4OXD1sMTJdwKzYM0000 | None | filtered count matrix | .h5ad | None | AnnData | 659821 | nGaGBdvaUeiwNN9MLmiRqg | NaN | None | md5 | True | False | 1 | 1 | None | None | True | 1.0 | 2025-01-20 07:41:08.060000+00:00 | 1 | None | 1 |
4 | DKgdPKI1JtO4oYgi0000 | None | multiqc report | None | None | 9676748 | cea20bdMyw7zCoRzbSpgxA | 59.0 | None | md5-d | True | True | 1 | 1 | None | None | True | 1.0 | 2025-01-20 07:41:08.039000+00:00 | 1 | None | 1 | |
3 | IwTHrDe8vik5CzDF0000 | scrnaseq_input/samplesheet.csv | None | .csv | None | None | 236 | QXMVrT5ZucmidIxbYJ9KHA | NaN | None | md5 | True | False | 1 | 1 | None | None | True | NaN | 2025-01-20 07:41:07.993000+00:00 | 1 | None | 1 |
2 | k0gY7NxouzcIOl6H0000 | scrnaseq_input/S10_L001_R2_001.fastq.gz | None | .fastq.gz | None | None | 4259756 | W-dGV6rDQMWXfGSAv_xL0g | NaN | None | md5 | True | False | 1 | 1 | None | None | True | NaN | 2025-01-20 07:41:07.992000+00:00 | 1 | None | 1 |
1 | rgxkZjbRhAf8XV3s0000 | scrnaseq_input/S10_L001_R1_001.fastq.gz | None | .fastq.gz | None | None | 1727503 | UrpdRtwcAhl3QV7xfzI29w | NaN | None | md5 | True | False | 1 | 1 | None | None | True | NaN | 2025-01-20 07:41:07.991000+00:00 | 1 | None | 1 |
Param
name | dtype | is_type | _expect_many | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||
1 | params | dict | None | False | 1 | None | None | 2025-01-20 07:41:08.101000+00:00 | 1 | None | 1 |
ParamValue
value | hash | space_id | param_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|
id | ||||||||
1 | {'aligner': 'star', 'input': 'https://github.c... | None | 1 | 1 | 2025-01-20 07:41:08.111000+00:00 | 1 | None | 1 |
Run
uid | name | started_at | finished_at | reference | reference_type | _is_consecutive | _status_code | space_id | transform_id | report_id | _logfile_id | environment_id | initiated_by_run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||
1 | n0WdUGC6dDL5OXPZ5552 | None | 2025-01-20 07:41:07.959000+00:00 | None | determined_snyder | nextflow_id | None | 0 | 1 | 1 | 6 | None | 7 | None | 2025-01-20 07:41:07.959000+00:00 | 1 | None | 1 |
Storage
uid | root | description | type | region | instance_uid | space_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
1 | qLUz5lvZxISn | /home/runner/work/nextflow-lamin/nextflow-lami... | None | local | None | 7JUvfoPu6nFp | 1 | None | 2025-01-20 07:33:55.893000+00:00 | 1 | None | 1 |
Transform
uid | key | description | type | source_code | hash | reference | reference_type | space_id | _template_id | version | is_latest | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||
1 | ogKfwRyhj7sW0000 | scrna-seq | None | pipeline | None | None | https://github.com/nf-core/scrnaseq | None | 1 | None | 2.7.1 | True | 2025-01-20 07:41:07.954000+00:00 | 1 | None | 1 |
ULabel
uid | name | is_type | description | reference | reference_type | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||
1 | 2HsL2eZI | nextflow | None | None | None | None | 1 | None | None | 2025-01-20 07:41:08.067000+00:00 | 1 | None | 1 |
Show code cell content
# clean up the test instance:
!rm -rf test-nextflow
!lamin delete --force test-nextflow
• deleting instance testuser1/test-nextflow