Nextflow

Nextflow is the most widely used workflow manager in bioinformatics.

This guide shows how to register a Nextflow run with inputs & outputs for the example of the nf-core/scrnaseq pipeline by running a Python script.

The approach could be automated by deploying the script via

  1. a serverless environment trigger (e.g., AWS Lambda)

  2. a post-run script on the Seqera Platform

What steps are executed by the nf-core/scrnaseq pipeline?

!lamin init --storage ./test-nextflow --name test-nextflow
Hide code cell output
 initialized lamindb: testuser1/test-nextflow

Run the pipeline

Let’s download the input data from an S3 bucket.

import lamindb as ln

input_path = ln.UPath("s3://lamindb-test/scrnaseq_input")
input_path.download_to("scrnaseq_input")
 connected lamindb: testuser1/test-nextflow

And run the nf-core/scrnaseq pipeline.

# the test profile uses all downloaded input files as an input
!nextflow run nf-core/scrnaseq -r 2.7.1 -profile docker,test -resume --outdir scrnaseq_output
Hide code cell output
N E X T F L O W  ~  version 24.10.5
Pulling nf-core/scrnaseq ...
 downloaded from https://github.com/nf-core/scrnaseq.git
WARN: It appears you have never run this project before -- Option `-resume` is ignored
Launching `https://github.com/nf-core/scrnaseq` [soggy_bhaskara] DSL2 - revision: 4171377f40 [2.7.1]
Downloading plugin [email protected]
------------------------------------------------------
                                        ,--./,-.
        ___     __   __   __   ___     /,-._.--~'
  |\ | |__  __ /  ` /  \ |__) |__         }  {
  | \| |       \__, \__/ |  \ |___     \`-._,-`-,
                                        `._,._,'
  nf-core/scrnaseq v2.7.1-g4171377
------------------------------------------------------
Core Nextflow options
  revision                  : 2.7.1
  runName                   : soggy_bhaskara
  containerEngine           : docker
  launchDir                 : /home/runner/work/nextflow-lamin/nextflow-lamin/docs
  workDir                   : /home/runner/work/nextflow-lamin/nextflow-lamin/docs/work
  projectDir                : /home/runner/.nextflow/assets/nf-core/scrnaseq
  userName                  : runner
  profile                   : docker,test
  configFiles               : 

Input/output options
  input                     : https://github.com/nf-core/test-datasets/raw/scrnaseq/samplesheet-2-0.csv
  outdir                    : scrnaseq_output

Mandatory arguments
  aligner                   : star
  protocol                  : 10XV2

Skip Tools
  skip_emptydrops           : true

Reference genome options
  fasta                     : https://github.com/nf-core/test-datasets/raw/scrnaseq/reference/GRCm38.p6.genome.chr19.fa
  gtf                       : https://github.com/nf-core/test-datasets/raw/scrnaseq/reference/gencode.vM19.annotation.chr19.gtf

Institutional config options
  config_profile_name       : Test profile
  config_profile_description: Minimal test dataset to check pipeline function

Max job request options
  max_cpus                  : 2
  max_memory                : 6.GB
  max_time                  : 6.h

!! Only displaying parameters that differ from the pipeline defaults !!
------------------------------------------------------
If you use nf-core/scrnaseq for your analysis please cite:

* The pipeline
  https://doi.org/10.5281/zenodo.3568187

* The nf-core framework
  https://doi.org/10.1038/s41587-020-0439-x

* Software dependencies
  https://github.com/nf-core/scrnaseq/blob/master/CITATIONS.md
------------------------------------------------------
[92/b6afd4] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:FASTQC_CHECK:FASTQC (Sample_Y)
[c1/58d6c2] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:FASTQC_CHECK:FASTQC (Sample_X)
[16/67862e] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:GTF_GENE_FILTER (GRCm38.p6.genome.chr19.fa)
[3b/d06051] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:STARSOLO:STAR_GENOMEGENERATE (GRCm38.p6.genome.chr19.fa)
[e6/ebf2c6] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:STARSOLO:STAR_ALIGN (Sample_X)
[bf/d1bc50] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:STARSOLO:STAR_ALIGN (Sample_Y)
[f8/249553] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT (Sample_X)
[5f/e023c2] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT (Sample_X)
[84/741659] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD (Sample_X)
[e1/061162] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD (Sample_X)
[72/ed0d3d] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT (Sample_Y)
[09/ed34a9] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD (Sample_Y)
[9f/f9fd89] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_SEURAT (Sample_Y)
[3e/7de3b6] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:MTX_TO_H5AD (Sample_Y)
[0d/17dc23] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MULTIQC
[cf/846bf7] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:CONCAT_H5AD (1)
[bd/cdbd1a] Submitted process > NFCORE_SCRNASEQ:SCRNASEQ:MTX_CONVERSION:CONCAT_H5AD (2)
-[nf-core/scrnaseq] Pipeline completed successfully-
What is the full run command for the test profile?
nextflow run nf-core/scrnaseq -r 2.7.1 \
    -profile docker \
    -resume \
    --outdir scrnaseq_output \
    --input 'scrnaseq_input/samplesheet-2-0.csv' \
    --skip_emptydrops \
    --fasta 'https://github.com/nf-core/test-datasets/raw/scrnaseq/reference/GRCm38.p6.genome.chr19.fa' \
    --gtf 'https://github.com/nf-core/test-datasets/raw/scrnaseq/reference/gencode.vM19.annotation.chr19.gtf' \
    --aligner 'star' \
    --protocol '10XV2' \
    --max_cpus 2 \
    --max_memory '6.GB' \
    --max_time '6.h'

Run the registration script

After the pipeline has completed, a Python script registers inputs & outputs in LaminDB.

nf-core/scrnaseq run registration
import argparse
import lamindb as ln
import json
import re
from pathlib import Path


def parse_arguments() -> argparse.Namespace:
    parser = argparse.ArgumentParser()
    parser.add_argument("--input", type=str, required=True)
    parser.add_argument("--output", type=str, required=True)
    return parser.parse_args()


def register_pipeline_io(input_dir: str, output_dir: str, run: ln.Run) -> None:
    """Register input and output artifacts for an `nf-core/scrnaseq` run."""
    input_artifacts = ln.Artifact.from_dir(input_dir, run=False)
    ln.save(input_artifacts)
    run.input_artifacts.set(input_artifacts)
    ln.Artifact(f"{output_dir}/multiqc", key="multiqc report", run=run).save()
    ln.Artifact(
        f"{output_dir}/star/mtx_conversions/combined_filtered_matrix.h5ad",
        key="filtered_count_matrix.h5ad",
        run=run,
    ).save()


def register_pipeline_metadata(output_dir: str, run: ln.Run) -> None:
    """Register nf-core run metadata stored in the 'pipeline_info' folder."""
    ulabel = ln.ULabel(name="nextflow").save()
    run.transform.ulabels.add(ulabel)

    # nextflow run id
    content = next(Path(f"{output_dir}/pipeline_info").glob("execution_report_*.html")).read_text()
    match = re.search(r"run id \[([^\]]+)\]", content)
    nextflow_id = match.group(1) if match else ""
    run.reference = nextflow_id
    run.reference_type = "nextflow_id"

    # completed at
    completion_match = re.search(r'<span id="workflow_complete">([^<]+)</span>', content)
    if completion_match:
        from datetime import datetime

        timestamp_str = completion_match.group(1).strip()
        run.finished_at = datetime.strptime(timestamp_str, "%d-%b-%Y %H:%M:%S")

    # execution report and software versions
    for file_pattern, description, run_attr in [
        ("execution_report*", "execution report", "report"),
        ("nf_core_pipeline_software*", "software versions", "environment"),
    ]:
        artifact = ln.Artifact(
            next(Path(f"{output_dir}/pipeline_info").glob(file_pattern)),
            key=f"nextflow run {description} of {nextflow_id}",
            visibility=0,
            run=False,
        ).save()
        setattr(run, run_attr, artifact)

    # nextflow run parameters
    params_path = next(Path(f"{output_dir}/pipeline_info").glob("params*"))
    with params_path.open() as params_file:
        params = json.load(params_file)
    ln.Param(name="params", dtype="dict").save()
    run.params.add_values({"params": params})
    run.save()


args = parse_arguments()
scrnaseq_transform = ln.Transform(
    key="scrna-seq",
    version="2.7.1",
    type="pipeline",
    reference="https://github.com/nf-core/scrnaseq",
).save()
run = ln.Run(transform=scrnaseq_transform).save()
register_pipeline_io(args.input, args.output, run)
register_pipeline_metadata(args.output, run)
!python register_scrnaseq_run.py --input scrnaseq_input --output scrnaseq_output
Hide code cell output
 connected lamindb: testuser1/test-nextflow
! folder is outside existing storage location, will copy files from scrnaseq_input to /home/runner/work/nextflow-lamin/nextflow-lamin/docs/test-nextflow/scrnaseq_input
Traceback (most recent call last):
  File "/home/runner/work/nextflow-lamin/nextflow-lamin/docs/register_scrnaseq_run.py", line 79, in <module>
    register_pipeline_metadata(args.output, run)
  File "/home/runner/work/nextflow-lamin/nextflow-lamin/docs/register_scrnaseq_run.py", line 53, in register_pipeline_metadata
    artifact = ln.Artifact(
               ^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/lamindb/models/artifact.py", line 1375, in __init__
    kwargs_or_artifact, privates = get_artifact_kwargs_from_data(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/lamindb/models/artifact.py", line 382, in get_artifact_kwargs_from_data
    memory_rep, path, suffix, storage, use_existing_storage_key = process_data(
                                                                  ^^^^^^^^^^^^^
  File "/opt/hostedtoolcache/Python/3.11.11/x64/lib/python3.11/site-packages/lamindb/models/artifact.py", line 261, in process_data
    raise InvalidArgument(message)
lamindb.errors.InvalidArgument: The suffix '.html' of the provided path is inconsistent, it should be ''

Data lineage

The output data could now be accessed (in a different notebook/script) for analysis with full lineage.

matrix_af = ln.Artifact.get(key__icontains="filtered_count_matrix.h5ad")
matrix_af.view_lineage()
_images/4ed55aa47ebb130f893b52d3b50d2201107f9ccb68b7327c9ac620687799f81d.svg

View transforms & runs on the hub

hub

View the database content

ln.view()
Hide code cell output
Artifact
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
id
5 i97q7x29alGeLGqX0000 filtered_count_matrix.h5ad None .h5ad None AnnData 659819 f3z2Jit0LftSkQ2DyB3dBQ NaN None md5 True False 1 1 None None True 1.0 2025-03-16 21:37:38.596000+00:00 1 None 1
4 nrlIvAXdqgalPzhG0000 multiqc report None None None 9676733 qfhOczg-smjLjmQTIQJoHg 59.0 None md5-d True True 1 1 None None True 1.0 2025-03-16 21:37:38.574000+00:00 1 None 1
3 r7nW3kkAmtQc9uAH0000 scrnaseq_input/S10_L001_R2_001.fastq.gz None .fastq.gz None None 4259756 W-dGV6rDQMWXfGSAv_xL0g NaN None md5 True False 1 1 None None True NaN 2025-03-16 21:37:38.529000+00:00 1 None 1
2 FW1TI9KhqsFZ7ssO0000 scrnaseq_input/S10_L001_R1_001.fastq.gz None .fastq.gz None None 1727503 UrpdRtwcAhl3QV7xfzI29w NaN None md5 True False 1 1 None None True NaN 2025-03-16 21:37:38.528000+00:00 1 None 1
1 Bvk8jIR4j3tbneuy0000 scrnaseq_input/samplesheet.csv None .csv None None 236 QXMVrT5ZucmidIxbYJ9KHA NaN None md5 True False 1 1 None None True NaN 2025-03-16 21:37:38.527000+00:00 1 None 1
Run
uid name started_at finished_at reference reference_type _is_consecutive _status_code space_id transform_id report_id _logfile_id environment_id initiated_by_run_id created_at created_by_id _aux _branch_code
id
1 ttit7y6ZxJ2WB6m6yARl None 2025-03-16 21:37:38.497000+00:00 None None None None 0 1 1 None None None None 2025-03-16 21:37:38.497000+00:00 1 None 1
Storage
uid root description type region instance_uid space_id run_id created_at created_by_id _aux _branch_code
id
1 dXarAV3Cjhjo /home/runner/work/nextflow-lamin/nextflow-lami... None local None 7JUvfoPu6nFp 1 None 2025-03-16 21:30:13.256000+00:00 1 None 1
Transform
uid key description type source_code hash reference reference_type space_id _template_id version is_latest created_at created_by_id _aux _branch_code
id
1 reYtJe6MdPGZ0000 scrna-seq None pipeline None None https://github.com/nf-core/scrnaseq None 1 None 2.7.1 True 2025-03-16 21:37:38.494000+00:00 1 None 1
ULabel
uid name is_type description reference reference_type space_id type_id run_id created_at created_by_id _aux _branch_code
id
1 WY90PaxS nextflow False None None None 1 None None 2025-03-16 21:37:38.739000+00:00 1 None 1
Hide code cell content
# clean up the test instance:
!rm -rf test-nextflow
!lamin delete --force test-nextflow
 deleting instance testuser1/test-nextflow