Bulk RNA-seq¶
# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage test-bulkrna --schema bionty
Show code cell output
→ connected lamindb: testuser1/test-bulkrna
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad
from pathlib import Path
Show code cell output
→ connected lamindb: testuser1/test-bulkrna
Ingest data¶
Access ¶
We start by simulating a nf-core RNA-seq run which yields us a count matrix artifact.
(See Nextflow for running this with Nextflow.)
# pretend we're running a bulk RNA-seq pipeline
ln.track(
transform=ln.Transform(name="nf-core RNA-seq", reference="https://nf-co.re/rnaseq")
)
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.core.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts(
populate_registries=True
)
# move it into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register it
ln.Artifact(path, description="Merged Bulk RNA counts").save()
Show code cell output
→ created Transform('N4EXhlNt'), started new Run('BIxRQyjO') at 2024-12-20 15:06:12 UTC
Artifact(uid='pwJcYYc2Vp3qVJE20000', is_latest=True, key='output_dir/salmon.merged.gene_counts.tsv', description='Merged Bulk RNA counts', suffix='.tsv', size=3787, hash='xxw0k3au3KtxFcgtbEr4eQ', _hash_type='md5', visibility=1, _key_is_virtual=False, storage_id=1, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-20 15:06:13 UTC)
Transform ¶
ln.track("s5V0dNMVwL9i0000")
Show code cell output
→ created Transform('s5V0dNMV'), started new Run('cUAnUA5c') at 2024-12-20 15:06:14 UTC
→ notebook imports: anndata==0.11.1 bionty==0.53.2 lamindb==0.77.3 pandas==2.2.3
Let’s query the artifact:
artifact = ln.Artifact.get(description="Merged Bulk RNA counts")
df = artifact.load()
If we look at it, we realize it deviates far from the tidy data standard Wickham14, conventions of statistics & machine learning Hastie09, Murphy12 and the major Python & R data packages.
Variables are not in columns and observations are not in rows:
df
Show code cell output
gene_id | gene_name | RAP1_IAA_30M_REP1 | RAP1_UNINDUCED_REP1 | RAP1_UNINDUCED_REP2 | WT_REP1 | WT_REP2 | |
---|---|---|---|---|---|---|---|
0 | Gfp_transgene_gene | Gfp_transgene_gene | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
1 | HRA1 | HRA1 | 0.0 | 8.572 | 0.0 | 0.0 | 0.0 |
2 | snR18 | snR18 | 3.0 | 8.000 | 4.0 | 8.0 | 8.0 |
3 | tA(UGC)A | TGA1 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
4 | tL(CAA)A | SUP56 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... |
120 | YAR064W | YAR064W | 0.0 | 2.000 | 0.0 | 0.0 | 0.0 |
121 | YAR066W | YAR066W | 3.0 | 13.000 | 8.0 | 5.0 | 11.0 |
122 | YAR068W | YAR068W | 9.0 | 28.000 | 24.0 | 5.0 | 7.0 |
123 | YAR069C | YAR069C | 0.0 | 0.000 | 0.0 | 0.0 | 1.0 |
124 | YAR070C | YAR070C | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 |
125 rows × 7 columns
Let’s change that and move observations into rows:
df = df.T
df
Show code cell output
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | 64 | 65 | 66 | 67 | 68 | 69 | 70 | 71 | 72 | 73 | 74 | 75 | 76 | 77 | 78 | 79 | 80 | 81 | 82 | 83 | 84 | 85 | 86 | 87 | 88 | 89 | 90 | 91 | 92 | 93 | 94 | 95 | 96 | 97 | 98 | 99 | 100 | 101 | 102 | 103 | 104 | 105 | 106 | 107 | 108 | 109 | 110 | 111 | 112 | 113 | 114 | 115 | 116 | 117 | 118 | 119 | 120 | 121 | 122 | 123 | 124 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene_id | Gfp_transgene_gene | HRA1 | snR18 | tA(UGC)A | tL(CAA)A | tP(UGG)A | tS(AGA)A | YAL001C | YAL002W | YAL003W | YAL004W | YAL005C | YAL007C | YAL008W | YAL009W | YAL010C | YAL011W | YAL012W | YAL013W | YAL014C | YAL015C | YAL016C-A | YAL016C-B | YAL016W | YAL017W | YAL018C | YAL019W | YAL019W-A | YAL020C | YAL021C | YAL022C | YAL023C | YAL024C | YAL025C | YAL026C | YAL026C-A | YAL027W | YAL028W | YAL029C | YAL030W | YAL031C | YAL031W-A | YAL032C | YAL033W | YAL034C | YAL034C-B | YAL034W-A | YAL035W | YAL036C | YAL037C-A | YAL037C-B | YAL037W | YAL038W | YAL039C | YAL040C | YAL041W | YAL042C-A | YAL042W | YAL043C | YAL044C | YAL044W-A | YAL045C | YAL046C | YAL047C | YAL047W-A | YAL048C | YAL049C | YAL051W | YAL053W | YAL054C | YAL055W | YAL056C-A | YAL056W | YAL058W | YAL059C-A | YAL059W | YAL060W | YAL061W | YAL062W | YAL063C | YAL063C-A | YAL064C-A | YAL064W | YAL064W-B | YAL065C | YAL066W | YAL067C | YAL067W-A | YAL068C | YAL068W-A | YAL069W | YAR002C-A | YAR002W | YAR003W | YAR007C | YAR008W | YAR009C | YAR010C | YAR014C | YAR015W | YAR018C | YAR019C | YAR019W-A | YAR020C | YAR023C | YAR027W | YAR028W | YAR029W | YAR030C | YAR031W | YAR033W | YAR035C-A | YAR035W | YAR042W | YAR047C | YAR050W | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
gene_name | Gfp_transgene_gene | HRA1 | snR18 | TGA1 | SUP56 | TRN1 | tS(AGA)A | TFC3 | VPS8 | EFB1 | YAL004W | SSA1 | ERP2 | FUN14 | SPO7 | MDM10 | SWC3 | CYS3 | DEP1 | SYN8 | NTG1 | YAL016C-A | YAL016C-B | TPD3 | PSK1 | LDS1 | FUN30 | YAL019W-A | ATS1 | CCR4 | FUN26 | PMT2 | LTE1 | MAK16 | DRS2 | YAL026C-A | SAW1 | FRT2 | MYO4 | SNC1 | GIP4 | YAL031W-A | PRP45 | POP5 | FUN19 | YAL034C-B | MTW1 | FUN12 | RBG1 | YAL037C-A | YAL037C-B | YAL037W | CDC19 | CYC3 | CLN3 | CDC24 | YAL042C-A | ERV46 | PTA1 | GCV3 | YAL044W-A | YAL045C | AIM1 | SPC72 | YAL047W-A | GEM1 | AIM2 | OAF1 | FLC2 | ACS1 | PEX22 | YAL056C-A | GPB2 | CNE1 | YAL059C-A | ECM1 | BDH1 | BDH2 | GDH3 | FLO9 | YAL063C-A | TDA8 | YAL064W | YAL064W-B | YAL065C | YAL066W | SEO1 | YAL067W-A | PAU8 | YAL068W-A | YAL069W | ERP1 | NUP60 | SWD1 | RFA1 | SEN34 | YAR009C | YAR010C | BUD14 | ADE1 | KIN3 | CDC15 | YAR019W-A | PAU7 | YAR023C | UIP3 | YAR028W | YAR029W | YAR030C | PRM9 | MST28 | YAR035C-A | YAT1 | SWH1 | YAR047C | FLO1 | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
RAP1_IAA_30M_REP1 | 0.0 | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 55.0 | 36.0 | 632.0 | 1.0 | 6174.0 | 46.0 | 14.0 | 14.0 | 11.0 | 10.0 | 247.0 | 8.0 | 16.0 | 12.0 | 0.0 | 0.0 | 148.0 | 100.0 | 0.0 | 101.0 | 0.0 | 12.0 | 105.0 | 42.0 | 302.0 | 49.0 | 19.0 | 122.0 | 12.0 | 10.0 | 9.0 | 178.0 | 14.0 | 38.0 | 0.0 | 14.0 | 13.0 | 16.0 | 0.0 | 8.0 | 409.0 | 49.0 | 0.0 | 1.0 | 4.0 | 5710.0 | 9.0 | 34.0 | 61.0 | 0.0 | 141.0 | 63.0 | 33.0 | 5.0 | 0.0 | 2.0 | 20.0 | 0.0 | 18.0 | 26.0 | 38.0 | 116.0 | 3.0 | 6.0 | 0.0 | 61.0 | 23.0 | 0.0 | 17.0 | 49.0 | 30.0 | 5.0 | 4.643 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 1.0 | 2.0 | 64.0 | 57.0 | 27.0 | 56.0 | 8.0 | 16523.0 | 5741.0 | 55.0 | 53.0 | 20.0 | 21.0 | 0.0 | 0.0 | 1.0 | 25.0 | 18.0 | 1.0 | 2.0 | 17.0 | 1.0 | 0.0 | 2.0 | 104.0 | 0.0 | 4.357 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 9.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP1 | 0.0 | 8.572 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 72.0 | 33.0 | 810.0 | 345.089 | 6000.911 | 56.0 | 17.0 | 15.0 | 12.0 | 25.0 | 232.0 | 18.0 | 20.0 | 12.0 | 13.999 | 1.0 | 154.001 | 114.0 | 0.0 | 111.0 | 2.901 | 5.099 | 102.0 | 36.0 | 323.0 | 52.0 | 29.0 | 159.428 | 12.0 | 11.0 | 8.0 | 177.0 | 14.0 | 40.0 | 0.0 | 14.0 | 12.0 | 16.0 | 0.0 | 2.0 | 482.0 | 59.0 | 0.0 | 264.768 | 5.0 | 6162.232 | 23.0 | 44.0 | 65.0 | 12.533 | 133.467 | 67.0 | 58.0 | 0.0 | 3.0 | 3.0 | 23.0 | 1.0 | 20.0 | 37.0 | 60.0 | 129.0 | 13.0 | 2.0 | 1.0 | 49.0 | 27.0 | 0.0 | 23.0 | 53.0 | 19.0 | 27.0 | 23.28 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 3.0 | 0.0 | 5.0 | 0.0 | 1.0 | 78.0 | 60.0 | 17.0 | 67.0 | 14.0 | 17154.0 | 6178.0 | 61.0 | 63.0 | 14.0 | 35.0 | 0.0 | 0.0 | 1.0 | 34.0 | 13.0 | 2.0 | 0.0 | 15.0 | 17.0 | 0.0 | 5.0 | 105.0 | 0.0 | 15.72 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 13.0 | 28.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP2 | 0.0 | 0.0 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 115.0 | 82.0 | 1693.0 | 1.0 | 13355.0 | 132.0 | 24.0 | 19.0 | 36.0 | 36.0 | 536.0 | 35.0 | 43.0 | 28.0 | 0.0 | 0.0 | 326.0 | 210.0 | 0.0 | 238.0 | 0.0 | 18.0 | 203.0 | 86.0 | 659.0 | 99.0 | 56.0 | 314.989 | 19.011 | 20.0 | 19.0 | 359.0 | 40.0 | 72.0 | 0.0 | 37.0 | 24.0 | 44.0 | 0.0 | 25.0 | 872.0 | 147.0 | 0.0 | 3.0 | 10.0 | 13457.0 | 39.0 | 92.0 | 140.0 | 0.0 | 291.0 | 135.0 | 123.0 | 14.0 | 1.0 | 7.0 | 40.0 | 0.0 | 46.0 | 65.0 | 119.0 | 262.0 | 7.0 | 5.0 | 0.0 | 133.0 | 49.0 | 0.0 | 42.0 | 114.0 | 82.0 | 37.0 | 13.228 | 1.0 | 0.0 | 2.0 | 1.0 | 0.0 | 0.0 | 4.0 | 0.0 | 15.0 | 1.0 | 1.0 | 156.0 | 121.0 | 35.0 | 126.0 | 18.0 | 33244.0 | 11826.0 | 168.0 | 151.0 | 28.0 | 72.0 | 0.0 | 0.0 | 5.0 | 74.0 | 56.0 | 4.0 | 2.0 | 42.0 | 20.0 | 0.0 | 4.0 | 198.0 | 2.0 | 13.772 | 0.0 | 4.0 | 0.0 | 2.0 | 0.0 | 8.0 | 24.0 | 0.0 | 0.0 |
WT_REP1 | 0.0 | 0.0 | 8.0 | 0.0 | 0.0 | 1.0 | 0.0 | 60.0 | 63.0 | 1115.0 | 0.0 | 8218.0 | 61.0 | 10.0 | 9.0 | 30.0 | 19.0 | 385.0 | 21.0 | 28.0 | 8.0 | 0.0 | 0.0 | 194.0 | 101.0 | 0.0 | 186.0 | 0.0 | 12.0 | 136.0 | 54.0 | 432.0 | 69.0 | 50.0 | 180.41 | 14.59 | 5.0 | 4.0 | 241.0 | 15.0 | 36.0 | 0.0 | 18.0 | 18.0 | 11.0 | 0.0 | 5.0 | 760.0 | 82.0 | 0.0 | 0.0 | 4.0 | 10313.0 | 12.0 | 50.0 | 89.0 | 0.0 | 168.0 | 77.0 | 47.0 | 7.0 | 0.0 | 8.0 | 19.0 | 0.0 | 26.0 | 34.0 | 82.0 | 133.0 | 5.0 | 1.0 | 0.0 | 72.0 | 39.0 | 0.0 | 33.0 | 43.0 | 9.0 | 13.0 | 4.535 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 85.0 | 61.0 | 22.0 | 95.0 | 4.0 | 36435.0 | 13470.0 | 68.0 | 91.0 | 25.0 | 26.0 | 0.0 | 0.0 | 2.0 | 26.0 | 16.0 | 0.0 | 1.0 | 12.0 | 5.0 | 0.0 | 12.0 | 127.0 | 1.0 | 13.465 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 5.0 | 0.0 | 0.0 |
WT_REP2 | 0.0 | 0.0 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30.0 | 25.0 | 704.0 | 1.0 | 4279.0 | 44.0 | 3.0 | 5.0 | 10.0 | 17.0 | 230.0 | 10.0 | 17.0 | 6.0 | 0.0 | 0.0 | 104.0 | 60.0 | 0.0 | 84.0 | 0.0 | 2.0 | 79.0 | 27.0 | 244.0 | 46.0 | 21.0 | 123.638 | 5.362 | 2.0 | 4.0 | 139.0 | 8.0 | 15.0 | 0.0 | 13.0 | 7.0 | 6.0 | 0.0 | 6.0 | 390.0 | 48.0 | 0.0 | 0.0 | 1.0 | 6339.0 | 5.0 | 36.0 | 50.0 | 0.0 | 102.0 | 44.0 | 25.0 | 3.0 | 0.0 | 6.0 | 14.0 | 0.0 | 10.0 | 12.0 | 39.0 | 68.0 | 1.0 | 2.0 | 0.0 | 38.0 | 27.0 | 0.0 | 20.0 | 29.0 | 5.0 | 4.0 | 1.109 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 64.0 | 40.0 | 14.0 | 45.0 | 5.0 | 17184.0 | 6132.0 | 50.0 | 39.0 | 14.0 | 18.0 | 0.0 | 0.0 | 0.0 | 19.0 | 11.0 | 0.0 | 0.0 | 9.0 | 4.0 | 0.0 | 2.0 | 75.0 | 0.0 | 6.891 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 11.0 | 7.0 | 1.0 | 0.0 |
Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.
Let’s create an AnnData object to model this. First, create a dataframe for the variables:
var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])
var.head()
Show code cell output
gene_name | |
---|---|
gene_id | |
Gfp_transgene_gene | Gfp_transgene_gene |
HRA1 | HRA1 |
snR18 | snR18 |
tA(UGC)A | TGA1 |
tL(CAA)A | SUP56 |
Now, let’s create an AnnData object:
# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)
adata
Show code cell output
AnnData object with n_obs × n_vars = 5 × 125
var: 'gene_name'
The AnnData object is in tidy form and complies with conventions of statistics and machine learning:
adata.to_df()
Show code cell output
gene_id | Gfp_transgene_gene | HRA1 | snR18 | tA(UGC)A | tL(CAA)A | tP(UGG)A | tS(AGA)A | YAL001C | YAL002W | YAL003W | YAL004W | YAL005C | YAL007C | YAL008W | YAL009W | YAL010C | YAL011W | YAL012W | YAL013W | YAL014C | YAL015C | YAL016C-A | YAL016C-B | YAL016W | YAL017W | YAL018C | YAL019W | YAL019W-A | YAL020C | YAL021C | YAL022C | YAL023C | YAL024C | YAL025C | YAL026C | YAL026C-A | YAL027W | YAL028W | YAL029C | YAL030W | YAL031C | YAL031W-A | YAL032C | YAL033W | YAL034C | YAL034C-B | YAL034W-A | YAL035W | YAL036C | YAL037C-A | YAL037C-B | YAL037W | YAL038W | YAL039C | YAL040C | YAL041W | YAL042C-A | YAL042W | YAL043C | YAL044C | YAL044W-A | YAL045C | YAL046C | YAL047C | YAL047W-A | YAL048C | YAL049C | YAL051W | YAL053W | YAL054C | YAL055W | YAL056C-A | YAL056W | YAL058W | YAL059C-A | YAL059W | YAL060W | YAL061W | YAL062W | YAL063C | YAL063C-A | YAL064C-A | YAL064W | YAL064W-B | YAL065C | YAL066W | YAL067C | YAL067W-A | YAL068C | YAL068W-A | YAL069W | YAR002C-A | YAR002W | YAR003W | YAR007C | YAR008W | YAR009C | YAR010C | YAR014C | YAR015W | YAR018C | YAR019C | YAR019W-A | YAR020C | YAR023C | YAR027W | YAR028W | YAR029W | YAR030C | YAR031W | YAR033W | YAR035C-A | YAR035W | YAR042W | YAR047C | YAR050W | YAR053W | YAR060C | YAR061W | YAR062W | YAR064W | YAR066W | YAR068W | YAR069C | YAR070C |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
RAP1_IAA_30M_REP1 | 0.0 | 0.000 | 3.0 | 0.0 | 0.0 | 0.0 | 1.0 | 55.0 | 36.0 | 632.0 | 1.000000 | 6174.000000 | 46.0 | 14.0 | 14.0 | 11.0 | 10.0 | 247.0 | 8.0 | 16.0 | 12.0 | 0.000 | 0.0 | 148.000000 | 100.0 | 0.0 | 101.0 | 0.000 | 12.000 | 105.0 | 42.0 | 302.0 | 49.0 | 19.0 | 122.000000 | 12.000 | 10.0 | 9.0 | 178.0 | 14.0 | 38.0 | 0.0 | 14.0 | 13.0 | 16.0 | 0.0 | 8.0 | 409.0 | 49.0 | 0.0 | 1.000000 | 4.0 | 5710.000000 | 9.0 | 34.0 | 61.0 | 0.000 | 141.000000 | 63.0 | 33.0 | 5.0 | 0.0 | 2.0 | 20.0 | 0.0 | 18.0 | 26.0 | 38.0 | 116.0 | 3.0 | 6.0 | 0.0 | 61.0 | 23.0 | 0.0 | 17.0 | 49.0 | 30.0 | 5.0 | 4.643000 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 1.0 | 2.0 | 64.0 | 57.0 | 27.0 | 56.0 | 8.0 | 16523.0 | 5741.0 | 55.0 | 53.0 | 20.0 | 21.0 | 0.0 | 0.0 | 1.0 | 25.0 | 18.0 | 1.0 | 2.0 | 17.0 | 1.0 | 0.0 | 2.0 | 104.0 | 0.0 | 4.357 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 9.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP1 | 0.0 | 8.572 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 72.0 | 33.0 | 810.0 | 345.088989 | 6000.911133 | 56.0 | 17.0 | 15.0 | 12.0 | 25.0 | 232.0 | 18.0 | 20.0 | 12.0 | 13.999 | 1.0 | 154.001007 | 114.0 | 0.0 | 111.0 | 2.901 | 5.099 | 102.0 | 36.0 | 323.0 | 52.0 | 29.0 | 159.427994 | 12.000 | 11.0 | 8.0 | 177.0 | 14.0 | 40.0 | 0.0 | 14.0 | 12.0 | 16.0 | 0.0 | 2.0 | 482.0 | 59.0 | 0.0 | 264.768005 | 5.0 | 6162.231934 | 23.0 | 44.0 | 65.0 | 12.533 | 133.466995 | 67.0 | 58.0 | 0.0 | 3.0 | 3.0 | 23.0 | 1.0 | 20.0 | 37.0 | 60.0 | 129.0 | 13.0 | 2.0 | 1.0 | 49.0 | 27.0 | 0.0 | 23.0 | 53.0 | 19.0 | 27.0 | 23.280001 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 3.0 | 0.0 | 5.0 | 0.0 | 1.0 | 78.0 | 60.0 | 17.0 | 67.0 | 14.0 | 17154.0 | 6178.0 | 61.0 | 63.0 | 14.0 | 35.0 | 0.0 | 0.0 | 1.0 | 34.0 | 13.0 | 2.0 | 0.0 | 15.0 | 17.0 | 0.0 | 5.0 | 105.0 | 0.0 | 15.720 | 0.0 | 0.0 | 0.0 | 3.0 | 2.0 | 13.0 | 28.0 | 0.0 | 0.0 |
RAP1_UNINDUCED_REP2 | 0.0 | 0.000 | 4.0 | 0.0 | 0.0 | 0.0 | 0.0 | 115.0 | 82.0 | 1693.0 | 1.000000 | 13355.000000 | 132.0 | 24.0 | 19.0 | 36.0 | 36.0 | 536.0 | 35.0 | 43.0 | 28.0 | 0.000 | 0.0 | 326.000000 | 210.0 | 0.0 | 238.0 | 0.000 | 18.000 | 203.0 | 86.0 | 659.0 | 99.0 | 56.0 | 314.989014 | 19.011 | 20.0 | 19.0 | 359.0 | 40.0 | 72.0 | 0.0 | 37.0 | 24.0 | 44.0 | 0.0 | 25.0 | 872.0 | 147.0 | 0.0 | 3.000000 | 10.0 | 13457.000000 | 39.0 | 92.0 | 140.0 | 0.000 | 291.000000 | 135.0 | 123.0 | 14.0 | 1.0 | 7.0 | 40.0 | 0.0 | 46.0 | 65.0 | 119.0 | 262.0 | 7.0 | 5.0 | 0.0 | 133.0 | 49.0 | 0.0 | 42.0 | 114.0 | 82.0 | 37.0 | 13.228000 | 1.0 | 0.0 | 2.0 | 1.0 | 0.0 | 0.0 | 4.0 | 0.0 | 15.0 | 1.0 | 1.0 | 156.0 | 121.0 | 35.0 | 126.0 | 18.0 | 33244.0 | 11826.0 | 168.0 | 151.0 | 28.0 | 72.0 | 0.0 | 0.0 | 5.0 | 74.0 | 56.0 | 4.0 | 2.0 | 42.0 | 20.0 | 0.0 | 4.0 | 198.0 | 2.0 | 13.772 | 0.0 | 4.0 | 0.0 | 2.0 | 0.0 | 8.0 | 24.0 | 0.0 | 0.0 |
WT_REP1 | 0.0 | 0.000 | 8.0 | 0.0 | 0.0 | 1.0 | 0.0 | 60.0 | 63.0 | 1115.0 | 0.000000 | 8218.000000 | 61.0 | 10.0 | 9.0 | 30.0 | 19.0 | 385.0 | 21.0 | 28.0 | 8.0 | 0.000 | 0.0 | 194.000000 | 101.0 | 0.0 | 186.0 | 0.000 | 12.000 | 136.0 | 54.0 | 432.0 | 69.0 | 50.0 | 180.410004 | 14.590 | 5.0 | 4.0 | 241.0 | 15.0 | 36.0 | 0.0 | 18.0 | 18.0 | 11.0 | 0.0 | 5.0 | 760.0 | 82.0 | 0.0 | 0.000000 | 4.0 | 10313.000000 | 12.0 | 50.0 | 89.0 | 0.000 | 168.000000 | 77.0 | 47.0 | 7.0 | 0.0 | 8.0 | 19.0 | 0.0 | 26.0 | 34.0 | 82.0 | 133.0 | 5.0 | 1.0 | 0.0 | 72.0 | 39.0 | 0.0 | 33.0 | 43.0 | 9.0 | 13.0 | 4.535000 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 0.0 | 0.0 | 85.0 | 61.0 | 22.0 | 95.0 | 4.0 | 36435.0 | 13470.0 | 68.0 | 91.0 | 25.0 | 26.0 | 0.0 | 0.0 | 2.0 | 26.0 | 16.0 | 0.0 | 1.0 | 12.0 | 5.0 | 0.0 | 12.0 | 127.0 | 1.0 | 13.465 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 5.0 | 0.0 | 0.0 |
WT_REP2 | 0.0 | 0.000 | 8.0 | 0.0 | 0.0 | 0.0 | 0.0 | 30.0 | 25.0 | 704.0 | 1.000000 | 4279.000000 | 44.0 | 3.0 | 5.0 | 10.0 | 17.0 | 230.0 | 10.0 | 17.0 | 6.0 | 0.000 | 0.0 | 104.000000 | 60.0 | 0.0 | 84.0 | 0.000 | 2.000 | 79.0 | 27.0 | 244.0 | 46.0 | 21.0 | 123.638000 | 5.362 | 2.0 | 4.0 | 139.0 | 8.0 | 15.0 | 0.0 | 13.0 | 7.0 | 6.0 | 0.0 | 6.0 | 390.0 | 48.0 | 0.0 | 0.000000 | 1.0 | 6339.000000 | 5.0 | 36.0 | 50.0 | 0.000 | 102.000000 | 44.0 | 25.0 | 3.0 | 0.0 | 6.0 | 14.0 | 0.0 | 10.0 | 12.0 | 39.0 | 68.0 | 1.0 | 2.0 | 0.0 | 38.0 | 27.0 | 0.0 | 20.0 | 29.0 | 5.0 | 4.0 | 1.109000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 64.0 | 40.0 | 14.0 | 45.0 | 5.0 | 17184.0 | 6132.0 | 50.0 | 39.0 | 14.0 | 18.0 | 0.0 | 0.0 | 0.0 | 19.0 | 11.0 | 0.0 | 0.0 | 9.0 | 4.0 | 0.0 | 2.0 | 75.0 | 0.0 | 6.891 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 11.0 | 7.0 | 1.0 | 0.0 |
Validate ¶
Let’s create a Artifact object from this AnnData.
Almost all gene IDs are validated:
genes = bt.Gene.from_values(
adata.var.index,
bt.Gene.stable_id,
organism="saccharomyces cerevisiae", # or set globally with bt.settings.organism
)
Show code cell output
! did not create Gene records for 2 non-validated stable_ids: 'Gfp_transgene_gene', 'YAR062W'
# also register the 2 non-validated genes obtained from Bionty
ln.save(genes)
Register ¶
efs = bt.ExperimentalFactor.lookup()
organism = bt.Organism.lookup()
features = ln.Feature.lookup()
curated_file = ln.Artifact.from_anndata(adata, description="Curated bulk RNA counts")
Hence, let’s save this artifact:
curated_file.save()
Show code cell output
Artifact(uid='TdIKeU6wpoAWaSkM0000', is_latest=True, description='Curated bulk RNA counts', suffix='.h5ad', type='dataset', size=28180, hash='6bieh8XjOCCz6bJToN4u1g', _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=2, run_id=2, created_by_id=1, created_at=2024-12-20 15:06:15 UTC)
Link to validated metadata records:
curated_file.features._add_set_from_anndata(
var_field=bt.Gene.stable_id, organism="saccharomyces cerevisiae"
)
Show code cell output
! 2 unique terms (1.60%) are not validated for stable_id: 'Gfp_transgene_gene', 'YAR062W'
curated_file.labels.add(efs.rna_seq, features.assay)
curated_file.labels.add(organism.saccharomyces_cerevisiae, features.organism)
curated_file.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = 'TdIKeU6wpoAWaSkM0000' │ ├── .size = 28180 │ ├── .hash = '6bieh8XjOCCz6bJToN4u1g' │ ├── .path = │ │ /home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna/.lamindb/TdIKeU6wpoAWaSkM0000.h5ad │ ├── .created_by = testuser1 (Test User1) │ ├── .created_at = 2024-12-20 15:06:15 │ └── .transform = 'Bulk RNA-seq' ├── Dataset features/.feature_sets │ └── var • 123 [bionty.Gene] │ TGA1 float │ SUP56 float │ TRN1 float │ TFC3 float │ VPS8 float │ EFB1 float │ SSA1 float │ ERP2 float │ FUN14 float │ SPO7 float │ MDM10 float │ SWC3 float │ CYS3 float │ DEP1 float │ SYN8 float │ NTG1 float ├── Linked features │ └── assay cat[bionty.ExperimentalF… RNA-Seq │ organism cat[bionty.Organism] saccharomyces cerevisiae └── Labels └── .organisms bionty.Organism saccharomyces cerevisiae .experimental_factors bionty.ExperimentalFactor RNA-Seq
Query data¶
We have two files in the artifact registry:
ln.Artifact.df()
Show code cell output
uid | key | description | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | version | is_latest | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
2 | TdIKeU6wpoAWaSkM0000 | None | Curated bulk RNA counts | .h5ad | dataset | 28180 | 6bieh8XjOCCz6bJToN4u1g | None | None | md5 | AnnData | 1 | True | 1 | 2 | None | True | 2 | 2024-12-20 15:06:15.292588+00:00 | 1 |
1 | pwJcYYc2Vp3qVJE20000 | output_dir/salmon.merged.gene_counts.tsv | Merged Bulk RNA counts | .tsv | None | 3787 | xxw0k3au3KtxFcgtbEr4eQ | None | None | md5 | None | 1 | False | 1 | 1 | None | True | 1 | 2024-12-20 15:06:13.830029+00:00 | 1 |
curated_file.view_lineage()
Let’s by query by gene:
genes = bt.Gene.lookup()
genes.spo7
Show code cell output
Gene(uid='2pkcLeMEB6aS', symbol='SPO7', stable_id='YAL009W', ncbi_gene_ids='851224', biotype='protein_coding', synonyms='', description='Putative regulatory subunit of Nem1p-Spo7p phosphatase holoenzyme; regulates nuclear growth by controlling phospholipid biosynthesis, required for normal nuclear envelope morphology, premeiotic replication, and sporulation ', created_by_id=1, run_id=2, source_id=19, organism_id=1, created_at=2024-12-20 15:06:15 UTC)
# a gene set containing SPO7
feature_set = ln.FeatureSet.filter(genes=genes.spo7).first()
# artifacts that link to this feature set
ln.Artifact.filter(feature_sets=feature_set).df()
Show code cell output
uid | key | description | suffix | type | size | hash | n_objects | n_observations | _hash_type | _accessor | visibility | _key_is_virtual | storage_id | transform_id | version | is_latest | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
2 | TdIKeU6wpoAWaSkM0000 | None | Curated bulk RNA counts | .h5ad | dataset | 28180 | 6bieh8XjOCCz6bJToN4u1g | None | None | md5 | AnnData | 1 | True | 1 | 2 | None | True | 2 | 2024-12-20 15:06:15.292588+00:00 | 1 |
# clean up test instance
!rm -r test-bulkrna
!lamin delete --force test-bulkrna
Show code cell output
• deleting instance testuser1/test-bulkrna