Jupyter Notebook

Bulk RNA-seq

Note

More comprehensive examples are provided for these data types:

# !pip install 'lamindb[jupyter,bionty]'
!lamin init --storage test-bulkrna --schema bionty
Hide code cell output
 connected lamindb: testuser1/test-bulkrna
import lamindb as ln
import bionty as bt
import pandas as pd
import anndata as ad
from pathlib import Path
Hide code cell output
 connected lamindb: testuser1/test-bulkrna

Ingest data

Access

We start by simulating a nf-core RNA-seq run which yields us a count matrix artifact.

(See Nextflow for running this with Nextflow.)

# pretend we're running a bulk RNA-seq pipeline
ln.track(
    transform=ln.Transform(name="nf-core RNA-seq", reference="https://nf-co.re/rnaseq")
)
# create a directory for its output
Path("./test-bulkrna/output_dir").mkdir(exist_ok=True)
# get the count matrix
path = ln.core.datasets.file_tsv_rnaseq_nfcore_salmon_merged_gene_counts(
    populate_registries=True
)
# move it into the output directory
path = path.rename(f"./test-bulkrna/output_dir/{path.name}")
# register it
ln.Artifact(path, description="Merged Bulk RNA counts").save()
Hide code cell output
 created Transform('N4EXhlNt'), started new Run('BIxRQyjO') at 2024-12-20 15:06:12 UTC
Artifact(uid='pwJcYYc2Vp3qVJE20000', is_latest=True, key='output_dir/salmon.merged.gene_counts.tsv', description='Merged Bulk RNA counts', suffix='.tsv', size=3787, hash='xxw0k3au3KtxFcgtbEr4eQ', _hash_type='md5', visibility=1, _key_is_virtual=False, storage_id=1, transform_id=1, run_id=1, created_by_id=1, created_at=2024-12-20 15:06:13 UTC)

Transform

ln.track("s5V0dNMVwL9i0000")
Hide code cell output
 created Transform('s5V0dNMV'), started new Run('cUAnUA5c') at 2024-12-20 15:06:14 UTC
 notebook imports: anndata==0.11.1 bionty==0.53.2 lamindb==0.77.3 pandas==2.2.3

Let’s query the artifact:

artifact = ln.Artifact.get(description="Merged Bulk RNA counts")
df = artifact.load()

If we look at it, we realize it deviates far from the tidy data standard Wickham14, conventions of statistics & machine learning Hastie09, Murphy12 and the major Python & R data packages.

Variables are not in columns and observations are not in rows:

df
Hide code cell output
gene_id gene_name RAP1_IAA_30M_REP1 RAP1_UNINDUCED_REP1 RAP1_UNINDUCED_REP2 WT_REP1 WT_REP2
0 Gfp_transgene_gene Gfp_transgene_gene 0.0 0.000 0.0 0.0 0.0
1 HRA1 HRA1 0.0 8.572 0.0 0.0 0.0
2 snR18 snR18 3.0 8.000 4.0 8.0 8.0
3 tA(UGC)A TGA1 0.0 0.000 0.0 0.0 0.0
4 tL(CAA)A SUP56 0.0 0.000 0.0 0.0 0.0
... ... ... ... ... ... ... ...
120 YAR064W YAR064W 0.0 2.000 0.0 0.0 0.0
121 YAR066W YAR066W 3.0 13.000 8.0 5.0 11.0
122 YAR068W YAR068W 9.0 28.000 24.0 5.0 7.0
123 YAR069C YAR069C 0.0 0.000 0.0 0.0 1.0
124 YAR070C YAR070C 0.0 0.000 0.0 0.0 0.0

125 rows × 7 columns

Let’s change that and move observations into rows:

df = df.T
df
Hide code cell output
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124
gene_id Gfp_transgene_gene HRA1 snR18 tA(UGC)A tL(CAA)A tP(UGG)A tS(AGA)A YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C YAL008W YAL009W YAL010C YAL011W YAL012W YAL013W YAL014C YAL015C YAL016C-A YAL016C-B YAL016W YAL017W YAL018C YAL019W YAL019W-A YAL020C YAL021C YAL022C YAL023C YAL024C YAL025C YAL026C YAL026C-A YAL027W YAL028W YAL029C YAL030W YAL031C YAL031W-A YAL032C YAL033W YAL034C YAL034C-B YAL034W-A YAL035W YAL036C YAL037C-A YAL037C-B YAL037W YAL038W YAL039C YAL040C YAL041W YAL042C-A YAL042W YAL043C YAL044C YAL044W-A YAL045C YAL046C YAL047C YAL047W-A YAL048C YAL049C YAL051W YAL053W YAL054C YAL055W YAL056C-A YAL056W YAL058W YAL059C-A YAL059W YAL060W YAL061W YAL062W YAL063C YAL063C-A YAL064C-A YAL064W YAL064W-B YAL065C YAL066W YAL067C YAL067W-A YAL068C YAL068W-A YAL069W YAR002C-A YAR002W YAR003W YAR007C YAR008W YAR009C YAR010C YAR014C YAR015W YAR018C YAR019C YAR019W-A YAR020C YAR023C YAR027W YAR028W YAR029W YAR030C YAR031W YAR033W YAR035C-A YAR035W YAR042W YAR047C YAR050W YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
gene_name Gfp_transgene_gene HRA1 snR18 TGA1 SUP56 TRN1 tS(AGA)A TFC3 VPS8 EFB1 YAL004W SSA1 ERP2 FUN14 SPO7 MDM10 SWC3 CYS3 DEP1 SYN8 NTG1 YAL016C-A YAL016C-B TPD3 PSK1 LDS1 FUN30 YAL019W-A ATS1 CCR4 FUN26 PMT2 LTE1 MAK16 DRS2 YAL026C-A SAW1 FRT2 MYO4 SNC1 GIP4 YAL031W-A PRP45 POP5 FUN19 YAL034C-B MTW1 FUN12 RBG1 YAL037C-A YAL037C-B YAL037W CDC19 CYC3 CLN3 CDC24 YAL042C-A ERV46 PTA1 GCV3 YAL044W-A YAL045C AIM1 SPC72 YAL047W-A GEM1 AIM2 OAF1 FLC2 ACS1 PEX22 YAL056C-A GPB2 CNE1 YAL059C-A ECM1 BDH1 BDH2 GDH3 FLO9 YAL063C-A TDA8 YAL064W YAL064W-B YAL065C YAL066W SEO1 YAL067W-A PAU8 YAL068W-A YAL069W ERP1 NUP60 SWD1 RFA1 SEN34 YAR009C YAR010C BUD14 ADE1 KIN3 CDC15 YAR019W-A PAU7 YAR023C UIP3 YAR028W YAR029W YAR030C PRM9 MST28 YAR035C-A YAT1 SWH1 YAR047C FLO1 YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
RAP1_IAA_30M_REP1 0.0 0.0 3.0 0.0 0.0 0.0 1.0 55.0 36.0 632.0 1.0 6174.0 46.0 14.0 14.0 11.0 10.0 247.0 8.0 16.0 12.0 0.0 0.0 148.0 100.0 0.0 101.0 0.0 12.0 105.0 42.0 302.0 49.0 19.0 122.0 12.0 10.0 9.0 178.0 14.0 38.0 0.0 14.0 13.0 16.0 0.0 8.0 409.0 49.0 0.0 1.0 4.0 5710.0 9.0 34.0 61.0 0.0 141.0 63.0 33.0 5.0 0.0 2.0 20.0 0.0 18.0 26.0 38.0 116.0 3.0 6.0 0.0 61.0 23.0 0.0 17.0 49.0 30.0 5.0 4.643 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 1.0 2.0 64.0 57.0 27.0 56.0 8.0 16523.0 5741.0 55.0 53.0 20.0 21.0 0.0 0.0 1.0 25.0 18.0 1.0 2.0 17.0 1.0 0.0 2.0 104.0 0.0 4.357 0.0 1.0 0.0 1.0 0.0 3.0 9.0 0.0 0.0
RAP1_UNINDUCED_REP1 0.0 8.572 8.0 0.0 0.0 0.0 0.0 72.0 33.0 810.0 345.089 6000.911 56.0 17.0 15.0 12.0 25.0 232.0 18.0 20.0 12.0 13.999 1.0 154.001 114.0 0.0 111.0 2.901 5.099 102.0 36.0 323.0 52.0 29.0 159.428 12.0 11.0 8.0 177.0 14.0 40.0 0.0 14.0 12.0 16.0 0.0 2.0 482.0 59.0 0.0 264.768 5.0 6162.232 23.0 44.0 65.0 12.533 133.467 67.0 58.0 0.0 3.0 3.0 23.0 1.0 20.0 37.0 60.0 129.0 13.0 2.0 1.0 49.0 27.0 0.0 23.0 53.0 19.0 27.0 23.28 1.0 0.0 0.0 0.0 2.0 0.0 3.0 0.0 5.0 0.0 1.0 78.0 60.0 17.0 67.0 14.0 17154.0 6178.0 61.0 63.0 14.0 35.0 0.0 0.0 1.0 34.0 13.0 2.0 0.0 15.0 17.0 0.0 5.0 105.0 0.0 15.72 0.0 0.0 0.0 3.0 2.0 13.0 28.0 0.0 0.0
RAP1_UNINDUCED_REP2 0.0 0.0 4.0 0.0 0.0 0.0 0.0 115.0 82.0 1693.0 1.0 13355.0 132.0 24.0 19.0 36.0 36.0 536.0 35.0 43.0 28.0 0.0 0.0 326.0 210.0 0.0 238.0 0.0 18.0 203.0 86.0 659.0 99.0 56.0 314.989 19.011 20.0 19.0 359.0 40.0 72.0 0.0 37.0 24.0 44.0 0.0 25.0 872.0 147.0 0.0 3.0 10.0 13457.0 39.0 92.0 140.0 0.0 291.0 135.0 123.0 14.0 1.0 7.0 40.0 0.0 46.0 65.0 119.0 262.0 7.0 5.0 0.0 133.0 49.0 0.0 42.0 114.0 82.0 37.0 13.228 1.0 0.0 2.0 1.0 0.0 0.0 4.0 0.0 15.0 1.0 1.0 156.0 121.0 35.0 126.0 18.0 33244.0 11826.0 168.0 151.0 28.0 72.0 0.0 0.0 5.0 74.0 56.0 4.0 2.0 42.0 20.0 0.0 4.0 198.0 2.0 13.772 0.0 4.0 0.0 2.0 0.0 8.0 24.0 0.0 0.0
WT_REP1 0.0 0.0 8.0 0.0 0.0 1.0 0.0 60.0 63.0 1115.0 0.0 8218.0 61.0 10.0 9.0 30.0 19.0 385.0 21.0 28.0 8.0 0.0 0.0 194.0 101.0 0.0 186.0 0.0 12.0 136.0 54.0 432.0 69.0 50.0 180.41 14.59 5.0 4.0 241.0 15.0 36.0 0.0 18.0 18.0 11.0 0.0 5.0 760.0 82.0 0.0 0.0 4.0 10313.0 12.0 50.0 89.0 0.0 168.0 77.0 47.0 7.0 0.0 8.0 19.0 0.0 26.0 34.0 82.0 133.0 5.0 1.0 0.0 72.0 39.0 0.0 33.0 43.0 9.0 13.0 4.535 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 3.0 0.0 0.0 85.0 61.0 22.0 95.0 4.0 36435.0 13470.0 68.0 91.0 25.0 26.0 0.0 0.0 2.0 26.0 16.0 0.0 1.0 12.0 5.0 0.0 12.0 127.0 1.0 13.465 0.0 0.0 0.0 1.0 0.0 5.0 5.0 0.0 0.0
WT_REP2 0.0 0.0 8.0 0.0 0.0 0.0 0.0 30.0 25.0 704.0 1.0 4279.0 44.0 3.0 5.0 10.0 17.0 230.0 10.0 17.0 6.0 0.0 0.0 104.0 60.0 0.0 84.0 0.0 2.0 79.0 27.0 244.0 46.0 21.0 123.638 5.362 2.0 4.0 139.0 8.0 15.0 0.0 13.0 7.0 6.0 0.0 6.0 390.0 48.0 0.0 0.0 1.0 6339.0 5.0 36.0 50.0 0.0 102.0 44.0 25.0 3.0 0.0 6.0 14.0 0.0 10.0 12.0 39.0 68.0 1.0 2.0 0.0 38.0 27.0 0.0 20.0 29.0 5.0 4.0 1.109 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 64.0 40.0 14.0 45.0 5.0 17184.0 6132.0 50.0 39.0 14.0 18.0 0.0 0.0 0.0 19.0 11.0 0.0 0.0 9.0 4.0 0.0 2.0 75.0 0.0 6.891 0.0 1.0 0.0 0.0 0.0 11.0 7.0 1.0 0.0

Now, it’s clear that the first two rows are in fact no observations, but descriptions of the variables (or features) themselves.

Let’s create an AnnData object to model this. First, create a dataframe for the variables:

var = pd.DataFrame({"gene_name": df.loc["gene_name"].values}, index=df.loc["gene_id"])
var.head()
Hide code cell output
gene_name
gene_id
Gfp_transgene_gene Gfp_transgene_gene
HRA1 HRA1
snR18 snR18
tA(UGC)A TGA1
tL(CAA)A SUP56

Now, let’s create an AnnData object:

# we're also fixing the datatype here, which was string in the tsv
adata = ad.AnnData(df.iloc[2:].astype("float32"), var=var)
adata
Hide code cell output
AnnData object with n_obs × n_vars = 5 × 125
    var: 'gene_name'

The AnnData object is in tidy form and complies with conventions of statistics and machine learning:

adata.to_df()
Hide code cell output
gene_id Gfp_transgene_gene HRA1 snR18 tA(UGC)A tL(CAA)A tP(UGG)A tS(AGA)A YAL001C YAL002W YAL003W YAL004W YAL005C YAL007C YAL008W YAL009W YAL010C YAL011W YAL012W YAL013W YAL014C YAL015C YAL016C-A YAL016C-B YAL016W YAL017W YAL018C YAL019W YAL019W-A YAL020C YAL021C YAL022C YAL023C YAL024C YAL025C YAL026C YAL026C-A YAL027W YAL028W YAL029C YAL030W YAL031C YAL031W-A YAL032C YAL033W YAL034C YAL034C-B YAL034W-A YAL035W YAL036C YAL037C-A YAL037C-B YAL037W YAL038W YAL039C YAL040C YAL041W YAL042C-A YAL042W YAL043C YAL044C YAL044W-A YAL045C YAL046C YAL047C YAL047W-A YAL048C YAL049C YAL051W YAL053W YAL054C YAL055W YAL056C-A YAL056W YAL058W YAL059C-A YAL059W YAL060W YAL061W YAL062W YAL063C YAL063C-A YAL064C-A YAL064W YAL064W-B YAL065C YAL066W YAL067C YAL067W-A YAL068C YAL068W-A YAL069W YAR002C-A YAR002W YAR003W YAR007C YAR008W YAR009C YAR010C YAR014C YAR015W YAR018C YAR019C YAR019W-A YAR020C YAR023C YAR027W YAR028W YAR029W YAR030C YAR031W YAR033W YAR035C-A YAR035W YAR042W YAR047C YAR050W YAR053W YAR060C YAR061W YAR062W YAR064W YAR066W YAR068W YAR069C YAR070C
RAP1_IAA_30M_REP1 0.0 0.000 3.0 0.0 0.0 0.0 1.0 55.0 36.0 632.0 1.000000 6174.000000 46.0 14.0 14.0 11.0 10.0 247.0 8.0 16.0 12.0 0.000 0.0 148.000000 100.0 0.0 101.0 0.000 12.000 105.0 42.0 302.0 49.0 19.0 122.000000 12.000 10.0 9.0 178.0 14.0 38.0 0.0 14.0 13.0 16.0 0.0 8.0 409.0 49.0 0.0 1.000000 4.0 5710.000000 9.0 34.0 61.0 0.000 141.000000 63.0 33.0 5.0 0.0 2.0 20.0 0.0 18.0 26.0 38.0 116.0 3.0 6.0 0.0 61.0 23.0 0.0 17.0 49.0 30.0 5.0 4.643000 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 2.0 1.0 2.0 64.0 57.0 27.0 56.0 8.0 16523.0 5741.0 55.0 53.0 20.0 21.0 0.0 0.0 1.0 25.0 18.0 1.0 2.0 17.0 1.0 0.0 2.0 104.0 0.0 4.357 0.0 1.0 0.0 1.0 0.0 3.0 9.0 0.0 0.0
RAP1_UNINDUCED_REP1 0.0 8.572 8.0 0.0 0.0 0.0 0.0 72.0 33.0 810.0 345.088989 6000.911133 56.0 17.0 15.0 12.0 25.0 232.0 18.0 20.0 12.0 13.999 1.0 154.001007 114.0 0.0 111.0 2.901 5.099 102.0 36.0 323.0 52.0 29.0 159.427994 12.000 11.0 8.0 177.0 14.0 40.0 0.0 14.0 12.0 16.0 0.0 2.0 482.0 59.0 0.0 264.768005 5.0 6162.231934 23.0 44.0 65.0 12.533 133.466995 67.0 58.0 0.0 3.0 3.0 23.0 1.0 20.0 37.0 60.0 129.0 13.0 2.0 1.0 49.0 27.0 0.0 23.0 53.0 19.0 27.0 23.280001 1.0 0.0 0.0 0.0 2.0 0.0 3.0 0.0 5.0 0.0 1.0 78.0 60.0 17.0 67.0 14.0 17154.0 6178.0 61.0 63.0 14.0 35.0 0.0 0.0 1.0 34.0 13.0 2.0 0.0 15.0 17.0 0.0 5.0 105.0 0.0 15.720 0.0 0.0 0.0 3.0 2.0 13.0 28.0 0.0 0.0
RAP1_UNINDUCED_REP2 0.0 0.000 4.0 0.0 0.0 0.0 0.0 115.0 82.0 1693.0 1.000000 13355.000000 132.0 24.0 19.0 36.0 36.0 536.0 35.0 43.0 28.0 0.000 0.0 326.000000 210.0 0.0 238.0 0.000 18.000 203.0 86.0 659.0 99.0 56.0 314.989014 19.011 20.0 19.0 359.0 40.0 72.0 0.0 37.0 24.0 44.0 0.0 25.0 872.0 147.0 0.0 3.000000 10.0 13457.000000 39.0 92.0 140.0 0.000 291.000000 135.0 123.0 14.0 1.0 7.0 40.0 0.0 46.0 65.0 119.0 262.0 7.0 5.0 0.0 133.0 49.0 0.0 42.0 114.0 82.0 37.0 13.228000 1.0 0.0 2.0 1.0 0.0 0.0 4.0 0.0 15.0 1.0 1.0 156.0 121.0 35.0 126.0 18.0 33244.0 11826.0 168.0 151.0 28.0 72.0 0.0 0.0 5.0 74.0 56.0 4.0 2.0 42.0 20.0 0.0 4.0 198.0 2.0 13.772 0.0 4.0 0.0 2.0 0.0 8.0 24.0 0.0 0.0
WT_REP1 0.0 0.000 8.0 0.0 0.0 1.0 0.0 60.0 63.0 1115.0 0.000000 8218.000000 61.0 10.0 9.0 30.0 19.0 385.0 21.0 28.0 8.0 0.000 0.0 194.000000 101.0 0.0 186.0 0.000 12.000 136.0 54.0 432.0 69.0 50.0 180.410004 14.590 5.0 4.0 241.0 15.0 36.0 0.0 18.0 18.0 11.0 0.0 5.0 760.0 82.0 0.0 0.000000 4.0 10313.000000 12.0 50.0 89.0 0.000 168.000000 77.0 47.0 7.0 0.0 8.0 19.0 0.0 26.0 34.0 82.0 133.0 5.0 1.0 0.0 72.0 39.0 0.0 33.0 43.0 9.0 13.0 4.535000 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 3.0 0.0 0.0 85.0 61.0 22.0 95.0 4.0 36435.0 13470.0 68.0 91.0 25.0 26.0 0.0 0.0 2.0 26.0 16.0 0.0 1.0 12.0 5.0 0.0 12.0 127.0 1.0 13.465 0.0 0.0 0.0 1.0 0.0 5.0 5.0 0.0 0.0
WT_REP2 0.0 0.000 8.0 0.0 0.0 0.0 0.0 30.0 25.0 704.0 1.000000 4279.000000 44.0 3.0 5.0 10.0 17.0 230.0 10.0 17.0 6.0 0.000 0.0 104.000000 60.0 0.0 84.0 0.000 2.000 79.0 27.0 244.0 46.0 21.0 123.638000 5.362 2.0 4.0 139.0 8.0 15.0 0.0 13.0 7.0 6.0 0.0 6.0 390.0 48.0 0.0 0.000000 1.0 6339.000000 5.0 36.0 50.0 0.000 102.000000 44.0 25.0 3.0 0.0 6.0 14.0 0.0 10.0 12.0 39.0 68.0 1.0 2.0 0.0 38.0 27.0 0.0 20.0 29.0 5.0 4.0 1.109000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 64.0 40.0 14.0 45.0 5.0 17184.0 6132.0 50.0 39.0 14.0 18.0 0.0 0.0 0.0 19.0 11.0 0.0 0.0 9.0 4.0 0.0 2.0 75.0 0.0 6.891 0.0 1.0 0.0 0.0 0.0 11.0 7.0 1.0 0.0

Validate

Let’s create a Artifact object from this AnnData.

Almost all gene IDs are validated:

genes = bt.Gene.from_values(
    adata.var.index,
    bt.Gene.stable_id,
    organism="saccharomyces cerevisiae",  # or set globally with bt.settings.organism
)
Hide code cell output
! did not create Gene records for 2 non-validated stable_ids: 'Gfp_transgene_gene', 'YAR062W'
# also register the 2 non-validated genes obtained from Bionty
ln.save(genes)

Register

efs = bt.ExperimentalFactor.lookup()
organism = bt.Organism.lookup()
features = ln.Feature.lookup()
curated_file = ln.Artifact.from_anndata(adata, description="Curated bulk RNA counts")

Hence, let’s save this artifact:

curated_file.save()
Hide code cell output
Artifact(uid='TdIKeU6wpoAWaSkM0000', is_latest=True, description='Curated bulk RNA counts', suffix='.h5ad', type='dataset', size=28180, hash='6bieh8XjOCCz6bJToN4u1g', _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=True, storage_id=1, transform_id=2, run_id=2, created_by_id=1, created_at=2024-12-20 15:06:15 UTC)

Link to validated metadata records:

curated_file.features._add_set_from_anndata(
    var_field=bt.Gene.stable_id, organism="saccharomyces cerevisiae"
)
Hide code cell output
!    2 unique terms (1.60%) are not validated for stable_id: 'Gfp_transgene_gene', 'YAR062W'
curated_file.labels.add(efs.rna_seq, features.assay)
curated_file.labels.add(organism.saccharomyces_cerevisiae, features.organism)
curated_file.describe()
Hide code cell output
Artifact .h5ad/AnnData
├── General
│   ├── .uid = 'TdIKeU6wpoAWaSkM0000'
│   ├── .size = 28180
│   ├── .hash = '6bieh8XjOCCz6bJToN4u1g'
│   ├── .path = 
│   │   /home/runner/work/lamin-usecases/lamin-usecases/docs/test-bulkrna/.lamindb/TdIKeU6wpoAWaSkM0000.h5ad
│   ├── .created_by = testuser1 (Test User1)
│   ├── .created_at = 2024-12-20 15:06:15
│   └── .transform = 'Bulk RNA-seq'
├── Dataset features/.feature_sets
│   └── var123                   [bionty.Gene]                                                       
TGA1                        float                                                               
SUP56                       float                                                               
TRN1                        float                                                               
TFC3                        float                                                               
VPS8                        float                                                               
EFB1                        float                                                               
SSA1                        float                                                               
ERP2                        float                                                               
FUN14                       float                                                               
SPO7                        float                                                               
MDM10                       float                                                               
SWC3                        float                                                               
CYS3                        float                                                               
DEP1                        float                                                               
SYN8                        float                                                               
NTG1                        float                                                               
├── Linked features
│   └── assay                       cat[bionty.ExperimentalF…  RNA-Seq                                  
organism                    cat[bionty.Organism]       saccharomyces cerevisiae                 
└── Labels
    └── .organisms                  bionty.Organism            saccharomyces cerevisiae                 
        .experimental_factors       bionty.ExperimentalFactor  RNA-Seq                                  

Query data

We have two files in the artifact registry:

ln.Artifact.df()
Hide code cell output
uid key description suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id version is_latest run_id created_at created_by_id
id
2 TdIKeU6wpoAWaSkM0000 None Curated bulk RNA counts .h5ad dataset 28180 6bieh8XjOCCz6bJToN4u1g None None md5 AnnData 1 True 1 2 None True 2 2024-12-20 15:06:15.292588+00:00 1
1 pwJcYYc2Vp3qVJE20000 output_dir/salmon.merged.gene_counts.tsv Merged Bulk RNA counts .tsv None 3787 xxw0k3au3KtxFcgtbEr4eQ None None md5 None 1 False 1 1 None True 1 2024-12-20 15:06:13.830029+00:00 1
curated_file.view_lineage()
_images/2bd4ece824851b72cbbb29e35cf272dcfa66e9c24195a76146b066f4cea46d86.svg

Let’s by query by gene:

genes = bt.Gene.lookup()
genes.spo7
Hide code cell output
Gene(uid='2pkcLeMEB6aS', symbol='SPO7', stable_id='YAL009W', ncbi_gene_ids='851224', biotype='protein_coding', synonyms='', description='Putative regulatory subunit of Nem1p-Spo7p phosphatase holoenzyme; regulates nuclear growth by controlling phospholipid biosynthesis, required for normal nuclear envelope morphology, premeiotic replication, and sporulation ', created_by_id=1, run_id=2, source_id=19, organism_id=1, created_at=2024-12-20 15:06:15 UTC)
# a gene set containing SPO7
feature_set = ln.FeatureSet.filter(genes=genes.spo7).first()
# artifacts that link to this feature set
ln.Artifact.filter(feature_sets=feature_set).df()
Hide code cell output
uid key description suffix type size hash n_objects n_observations _hash_type _accessor visibility _key_is_virtual storage_id transform_id version is_latest run_id created_at created_by_id
id
2 TdIKeU6wpoAWaSkM0000 None Curated bulk RNA counts .h5ad dataset 28180 6bieh8XjOCCz6bJToN4u1g None None md5 AnnData 1 True 1 2 None True 2 2024-12-20 15:06:15.292588+00:00 1
# clean up test instance
!rm -r test-bulkrna
!lamin delete --force test-bulkrna
Hide code cell output
 deleting instance testuser1/test-bulkrna