Bulk RNA-seq .md .md

Note

More comprehensive examples are provided for these data types:

# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage test-bulkrna --modules bionty
Hide code cell output
 initialized lamindb: testuser1/test-bulkrna
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad
from pathlib import Path
Hide code cell output
 connected lamindb: testuser1/test-bulkrna

Ingest data

Access

We start by simulating a nf-core RNA-seq run which yields us a count matrix artifact.

(See Nextflow for running this with Nextflow.)

# pretend we're running a bulk RNA-seq pipeline
ln.track(
    transform=ln.Transform(key="nf-core RNA-seq", reference="https://nf-co.re/rnaseq")
)
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.core.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts(
    populate_registries=True
)
# move the count matrix into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register the count matrix
ln.Artifact(path, description="Merged Bulk RNA counts").save()
Hide code cell output
 created Transform('dVZgYtmnYqkn0000', key='nf-core RNA-seq'), started new Run('sBQjDkyLNCANbViM') at 2026-02-11 19:58:18 UTC
Artifact(uid='d4qTrRfRBtg3OjWz0000', version_tag=None, is_latest=True, key='output_dir/salmon.merged.gene_counts.tsv', description='Merged Bulk RNA counts', suffix='.tsv', kind=None, otype=None, size=3787, hash='xxw0k3au3KtxFcgtbEr4eQ', n_files=None, n_observations=None, branch_id=1, space_id=1, storage_id=3, run_id=1, schema_id=None, created_by_id=3, created_at=2026-02-11 19:58:20 UTC, is_locked=False)

Transform

ln.track("s5V0dNMVwL9i0000")
Hide code cell output
 created Transform('s5V0dNMVwL9i0000', key='bulkrna.ipynb'), started new Run('mGAoLmODd1ml3D35') at 2026-02-11 19:58:21 UTC
 notebook imports: anndata==0.12.7 bionty==2.1.0 lamindb==2.1.2 pandas==2.3.3

Let’s query the artifact:

artifact = ln.Artifact.get(description="Merged Bulk RNA counts")
df = artifact.load()

If we look at it, we realize it deviates far from the tidy data standard Wickham14, conventions of statistics & machine learning Hastie09, Murphy12 and the major Python & R data packages.

Variables are not in columns and observations are not in rows:

df
Hide code cell output
gene_id gene_name RAP1_IAA_30M_REP1 RAP1_UNINDUCED_REP1 RAP1_UNINDUCED_REP2 WT_REP1 WT_REP2
0 Gfp_transgene_gene Gfp_transgene_gene 0.0 0.000 0.0 0.0 0.0
1 HRA1 HRA1 0.0 8.572 0.0 0.0 0.0
2 snR18 snR18 3.0 8.000 4.0 8.0 8.0
3 tA(UGC)A TGA1 0.0 0.000 0.0 0.0 0.0
4 tL(CAA)A SUP56 0.0 0.000 0.0 0.0 0.0
... ... ... ... ... ... ... ...
120 YAR064W YAR064W 0.0 2.000 0.0 0.0 0.0
121 YAR066W YAR066W 3.0 13.000 8.0 5.0 11.0
122 YAR068W YAR068W 9.0 28.000 24.0 5.0 7.0
123 YAR069C YAR069C 0.0 0.000 0.0 0.0 1.0
124 YAR070C YAR070C 0.0 0.000 0.0 0.0 0.0

125 rows × 7 columns

Let’s change that and move observations into rows:

df = df.T
df
Hide code cell output
0 1 2 3 4 5 6 7 8 9 ... 115 116 117 118 119 120 121 122 123 124
gene_id Gfp_transgene_gene HRA1 snR18 tA(UGC)A tL(CAA)A tP(UGG)A tS(AGA)A YAL001C YAL002W YAL003W ... YAR050W YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
gene_name Gfp_transgene_gene HRA1 snR18 TGA1 SUP56 TRN1 tS(AGA)A TFC3 VPS8 EFB1 ... FLO1 YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
RAP1_IAA_30M_REP1 0.0 0.0 3.0 0.0 0.0 0.0 1.0 55.0 36.0 632.0 ... 4.357 0.0 1.0 0.0 1.0 0.0 3.0 9.0 0.0 0.0
RAP1_UNINDUCED_REP1 0.0 8.572 8.0 0.0 0.0 0.0 0.0 72.0 33.0 810.0 ... 15.72 0.0 0.0 0.0 3.0 2.0 13.0 28.0 0.0 0.0
RAP1_UNINDUCED_REP2 0.0 0.0 4.0 0.0 0.0 0.0 0.0 115.0 82.0 1693.0 ... 13.772 0.0 4.0 0.0 2.0 0.0 8.0 24.0 0.0 0.0
WT_REP1 0.0 0.0 8.0 0.0 0.0 1.0 0.0 60.0 63.0 1115.0 ... 13.465 0.0 0.0 0.0 1.0 0.0 5.0 5.0 0.0 0.0
WT_REP2 0.0 0.0 8.0 0.0 0.0 0.0 0.0 30.0 25.0 704.0 ... 6.891 0.0 1.0 0.0 0.0 0.0 11.0 7.0 1.0 0.0

7 rows × 125 columns

Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.

Let’s create an AnnData object to model this. First, create a dataframe for the variables:

var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])
var.head()
Hide code cell output
gene_name
gene_id
Gfp_transgene_gene Gfp_transgene_gene
HRA1 HRA1
snR18 snR18
tA(UGC)A TGA1
tL(CAA)A SUP56

Now, let’s create an AnnData object:

# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)
adata
Hide code cell output
AnnData object with n_obs × n_vars = 5 × 125
    var: 'gene_name'

The AnnData object is in tidy form and complies with conventions of statistics and machine learning:

adata.to_df()
Hide code cell output
gene_id Gfp_transgene_gene HRA1 snR18 tA(UGC)A tL(CAA)A tP(UGG)A tS(AGA)A YAL001C YAL002W YAL003W ... YAR050W YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
RAP1_IAA_30M_REP1 0.0 0.000 3.0 0.0 0.0 0.0 1.0 55.0 36.0 632.0 ... 4.357 0.0 1.0 0.0 1.0 0.0 3.0 9.0 0.0 0.0
RAP1_UNINDUCED_REP1 0.0 8.572 8.0 0.0 0.0 0.0 0.0 72.0 33.0 810.0 ... 15.720 0.0 0.0 0.0 3.0 2.0 13.0 28.0 0.0 0.0
RAP1_UNINDUCED_REP2 0.0 0.000 4.0 0.0 0.0 0.0 0.0 115.0 82.0 1693.0 ... 13.772 0.0 4.0 0.0 2.0 0.0 8.0 24.0 0.0 0.0
WT_REP1 0.0 0.000 8.0 0.0 0.0 1.0 0.0 60.0 63.0 1115.0 ... 13.465 0.0 0.0 0.0 1.0 0.0 5.0 5.0 0.0 0.0
WT_REP2 0.0 0.000 8.0 0.0 0.0 0.0 0.0 30.0 25.0 704.0 ... 6.891 0.0 1.0 0.0 0.0 0.0 11.0 7.0 1.0 0.0

5 rows × 125 columns

Curate

We define a simple Schema for Bulk RNA datasets that only expects genes with stable IDs to be stored in the dataset. Later, we can add additional metadata to the curated dataset such as the assay or the organism.

bulk_schema = ln.Schema(itype=bt.Gene.stable_id, otype="AnnData").save()

# set the organism to map to saccharomyces cerevisiae genes
bt.settings.organism = "saccharomyces cerevisiae"

curator = ln.curators.AnnDataCurator(adata, bulk_schema)
curator.validate()

Let’s create and save the artifact:

curated_af = curator.save_artifact(description="Curated bulk RNA counts")
Hide code cell output
 writing the in-memory object into cache

Link additional metadata records:

efs = bt.ExperimentalFactor.lookup()
organism = bt.Organism.lookup()
features = ln.Feature.lookup()
curated_af.labels.add(efs.rna_seq, features.assay)
curated_af.labels.add(organism.saccharomyces_cerevisiae, features.organism)
curated_af.describe()
Hide code cell output
Artifact:  (0000)
|   description: Curated bulk RNA counts
├── uid: 9CgpjRaZ3Lb5jdIc0000            run: mGAoLmO (bulkrna.ipynb)
kind: dataset                        otype: AnnData              
hash: 6bieh8XjOCCz6bJToN4u1g         size: 27.5 KB               
branch: main                         space: all                  
created_at: 2026-02-11 19:58:21 UTC  created_by: testuser1       
n_observations: 5                                                
├── storage/path: 
/home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna/.lamindb/9CgpjRaZ3Lb5jdIc0000.h5ad
├── Features
└── assay                          bionty.ExperimentalFactor            RNA-Seq                                
    organism                       bionty.Organism                      saccharomyces cerevisiae               
└── Labels
    └── .organisms                     bionty.Organism                      saccharomyces cerevisiae               
        .experimental_factors          bionty.ExperimentalFactor            RNA-Seq                                

Query data

We have two files in the artifact registry:

ln.Artifact.to_dataframe()
Hide code cell output
uid key description suffix kind otype size hash n_files n_observations version_tag is_latest is_locked created_at branch_id space_id storage_id run_id schema_id created_by_id
id
2 9CgpjRaZ3Lb5jdIc0000 None Curated bulk RNA counts .h5ad dataset AnnData 28180 6bieh8XjOCCz6bJToN4u1g None 5.0 None True False 2026-02-11 19:58:21.864000+00:00 1 1 3 2 1.0 3
1 d4qTrRfRBtg3OjWz0000 output_dir/salmon.merged.gene_counts.tsv Merged Bulk RNA counts .tsv None None 3787 xxw0k3au3KtxFcgtbEr4eQ None NaN None True False 2026-02-11 19:58:20.697000+00:00 1 1 3 1 NaN 3
curated_af.view_lineage()
Hide code cell output
_images/0cedeaa0a7aa0959ba02da3421955ce53a81d16aa1862223507bb655b2f0d64c.svg
# clean up test instance
!rm -r test-bulkrna
!lamin delete --force test-bulkrna