Append a new dataset¶
We have one dataset in storage and are about to receive a new dataset.
In this notebook, we’ll see how to manage the situation.
import lamindb as ln
import bionty as bt
import readfcs
bt.settings.organism = "human"
ln.track("SmQmhrhigFPL0000")
→ connected lamindb: testuser1/test-facs
→ created Transform('SmQmhrhigFPL0000'), started new Run('Ww93zihY...') at 2025-01-20 07:39:06 UTC
→ notebook imports: bionty==1.0.0 lamindb==1.0.2 pytometry==0.1.6 readfcs==1.1.9 scanpy==1.10.4
Ingest a new artifact¶
Access ¶
Let us validate and register another .fcs
file from Oetjen18:
filepath = readfcs.datasets.Oetjen18_t1()
adata = readfcs.read(filepath)
adata
Show code cell output
AnnData object with n_obs × n_vars = 241552 × 20
var: 'n', 'channel', 'marker', '$PnR', '$PnB', '$PnE', '$PnV', '$PnG'
uns: 'meta'
Transform: normalize ¶
import pytometry as pm
pm.pp.split_signal(adata, var_key="channel")
pm.pp.compensate(adata)
pm.tl.normalize_biExp(adata)
adata = adata[ # subset to rows that do not have nan values
adata.to_df().isna().sum(axis=1) == 0
]
adata.to_df().describe()
Show code cell output
CD95 | CD8 | CD27 | CXCR4 | CCR7 | LIVE/DEAD | CD4 | CD45RA | CD3 | CD49B | CD14/19 | CD69 | CD103 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 241552.000000 | 241552.000000 | 241552.000000 | 241552.000000 | 241552.000000 | 241552.000000 | 241552.000000 | 241552.000000 | 241552.000000 | 241552.000000 | 241552.000000 | 241552.000000 | 241552.000000 |
mean | 887.579860 | 1302.985717 | 1221.257257 | 877.533482 | 977.505533 | 1883.358298 | 556.687953 | 929.493316 | 941.166747 | 966.012244 | 1210.769935 | 741.523184 | 1003.064857 |
std | 573.549695 | 827.850302 | 672.851319 | 411.966073 | 584.217139 | 932.113729 | 480.875917 | 795.550133 | 658.984751 | 456.437094 | 694.622980 | 473.287558 | 642.728024 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 462.757715 | 493.413744 | 605.463427 | 588.047798 | 495.437303 | 1063.670965 | 240.623098 | 404.087640 | 477.932659 | 592.294399 | 575.401173 | 380.247262 | 475.108131 |
50% | 774.350833 | 1207.624048 | 1110.367681 | 782.939692 | 782.981430 | 1951.855099 | 484.355203 | 557.904360 | 655.909639 | 800.280049 | 1124.574275 | 705.802991 | 775.101973 |
75% | 1327.792103 | 2036.849496 | 1721.730010 | 1070.479036 | 1453.929567 | 2623.975657 | 729.754419 | 1345.771633 | 1218.445208 | 1347.042403 | 1742.288464 | 1069.175380 | 1420.744291 |
max | 4053.903716 | 4065.495666 | 4095.351322 | 4025.827267 | 3999.075551 | 4096.000000 | 4088.719985 | 3961.255364 | 3940.061146 | 4089.445928 | 3982.769373 | 3810.774988 | 4023.968008 |
Validate cell markers ¶
Let’s see how many markers validate:
validated = bt.CellMarker.validate(adata.var.index)
Show code cell output
! 9 unique terms (69.20%) are not validated for name: 'CD95', 'CXCR4', 'CCR7', 'LIVE/DEAD', 'CD4', 'CD49B', 'CD14/19', 'CD69', 'CD103'
Let’s standardize and re-validate:
adata.var.index = bt.CellMarker.standardize(adata.var.index)
validated = bt.CellMarker.validate(adata.var.index)
Show code cell output
! 7 unique terms (53.80%) are not validated for name: 'CD95', 'CXCR4', 'LIVE/DEAD', 'CD49B', 'CD14/19', 'CD69', 'CD103'
/tmp/ipykernel_3660/92294437.py:1: ImplicitModificationWarning: Trying to modify index of attribute `.var` of view, initializing view as actual.
adata.var.index = bt.CellMarker.standardize(adata.var.index)
Next, register non-validated markers from Bionty:
records = bt.CellMarker.from_values(adata.var.index[~validated])
ln.save(records)
Show code cell output
! did not create CellMarker records for 2 non-validated names: 'CD14/19', 'LIVE/DEAD'
Manually create 1 marker:
bt.CellMarker(name="CD14/19").save()
Show code cell output
CellMarker(uid='3ZFziy5ims8J', name='CD14/19', created_by_id=1, run_id=2, space_id=1, organism_id=1, created_at=2025-01-20 07:39:09 UTC)
Move metadata to obs:
validated = bt.CellMarker.validate(adata.var.index)
adata.obs = adata[:, ~validated].to_df()
adata = adata[:, validated].copy()
Show code cell output
! 1 unique term (7.70%) is not validated for name: 'LIVE/DEAD'
Now all markers pass validation:
validated = bt.CellMarker.validate(adata.var.index)
assert all(validated)
Register ¶
curate = ln.Curator.from_anndata(adata, var_index=bt.CellMarker.name, categoricals={})
curate.validate()
Show code cell output
✓ "var_index" is validated against CellMarker.name
True
artifact = curate.save_artifact(description="Oetjen18_t1")
Show code cell output
! 1 unique term (100.00%) is not validated for name: 'LIVE/DEAD'
! skip linking features to artifact in slot 'obs'
Annotate with more labels:
efs = bt.ExperimentalFactor.lookup()
organism = bt.Organism.lookup()
artifact.labels.add(efs.fluorescence_activated_cell_sorting)
artifact.labels.add(organism.human)
artifact.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = '5s56cWrWhpiAwh5h0000' │ ├── .size = 46506448 │ ├── .hash = 'WbPHGIMM_5GT68rC8ZydHA' │ ├── .n_observations = 241552 │ ├── .path = /home/runner/work/lamin-usecases/lamin-usecases/docs/test-facs/.lamindb/5s56cWrWhpiAwh5h0000.h5ad │ ├── .created_by = testuser1 (Test User1) │ ├── .created_at = 2025-01-20 07:39:09 │ └── .transform = 'Append a new dataset' ├── Dataset features/._schemas_m2m │ └── var • 12 [bionty.CellMarker] │ Cd4 float │ CD8 float │ CD3 float │ CD27 float │ Ccr7 float │ CD45RA float │ CD95 float │ CXCR4 float │ CD49B float │ CD69 float │ CD103 float │ CD14/19 float └── Labels └── .organisms bionty.Organism human .experimental_factors bionty.ExperimentalFactor fluorescence-activated cell sorting
Inspect a PCA fo QC - this collection looks much like noise:
import scanpy as sc
markers = bt.CellMarker.lookup()
sc.pp.pca(adata)
sc.pl.pca(adata, color=markers.cd8.name)
Show code cell output
Create a new version of the collection by appending a artifact¶
Query the old version:
collection_v1 = ln.Collection.get(name="My versioned cytometry collection")
collection_v2 = ln.Collection(
[artifact, collection_v1.ordered_artifacts[0]],
revises=collection_v1,
version="2",
)
collection_v2.describe()
Show code cell output
• adding collection ids [1] as inputs for run 2, adding parent transform 1
• adding artifact ids [1] as inputs for run 2, adding parent transform 1
Collection └── General ├── .uid = '10GVN0SBTT6Cqg0n0001' ├── .key = 'My versioned cytometry collection' ├── .hash = 'aIyjTZDm9LEyi4udLlQ-FA' ├── .version = '2' ├── .created_by = testuser1 (Test User1) ├── .created_at = timestamp of unsaved record not available └── .transform = 'Append a new dataset'
collection_v2.save()
Show code cell output
Collection(uid='10GVN0SBTT6Cqg0n0001', version='2', is_latest=True, key='My versioned cytometry collection', hash='aIyjTZDm9LEyi4udLlQ-FA', created_by_id=1, space_id=1, run_id=2, created_at=2025-01-20 07:39:10 UTC)