#### Tutorial [image: .md][image]

This tutorial complements the quickstart: there, you skim through core
features; here, you learn the core operations end-to-end with
conceptual context.

### Track changes

If you don't have a LaminDB instance, create one using the shell:

 !lamin init --storage ./lamindb-tutorial --modules bionty

-[ Via the R shell ]-

 library(laminr)
 lc <- import_module("lamin_cli")
 lc$init(storage = "./lamin-tutorial", modules = "bionty")

-[ What else can I configure during setup? ]-

1. You can pass a cloud storage location to "--storage" (S3, GCP, R2,
 HF, etc.)

 --storage s3://my-bucket

2. Instead of the default SQLite database pass a Postgres connection
 string to "--db":

 --db postgresql://<user>:<pwd>@<hostname>:<port>/<dbname>

3. Instead of a default instance name derived from the storage
 location or the Postgres database, provide a custom name:

 --name my-name

4. Mount additional modules:

 --modules bionty,pertdb,custom1

For more info: Install & setup

##### Track a notebook or script

Let's now track the notebook or script that is running.

-[ Py ]-

 import lamindb as ln

 ln.track()  # track the current notebook or script

-[ R ]-

 library(laminr)
 ln <- import_module("lamindb")

 ln$track()  # track the current notebook or script

Calling "track()" links this execution to the data you'll save in this
run. You can inspect your scripts & notebooks in the "Transform"
registry and their runs in the "Run" registry.

-[ Py ]-

 ln.Transform.to_dataframe()

-[ R ]-

 ln$Transform$to_dataframe()

-[ Py ]-

 ln.Run.to_dataframe()

-[ R ]-

 ln$Run$to_dataframe()

-[ What happened under the hood? ]-

1. The full run environment and imported package versions of current
 notebook were detected

2. Notebook metadata was detected and stored in a "Transform" record
 with a unique id

3. Run metadata was detected and stored in a "Run" record with a
 unique id

The "Transform" registry stores data transformations: scripts,
notebooks, pipelines, functions.

The "Run" registry stores executions of transforms. Many runs can be
linked to the same transform if executed with different context (time,
user, input data, etc.).

For more info: Track notebooks, scripts & workflows

-[ How do I track a workflow or a pipeline instead of a notebook or
script? ]-

Use "flow()":

 import lamindb as ln

 @ln.flow()
 def ingest_dataset(key: str) -> ln.Artifact:
 df = ln.examples.datasets.mini_immuno.get_dataset1()
 return ln.Artifact.from_dataframe(df, key=key).save()

 ingest_dataset(key="my_analysis/dataset.parquet")

For more info: Track notebooks, scripts & workflows

Leverage a pipeline integration, see: Manage computational pipelines.

You can also manually create a transform and manage its runs with your
custom logic:

 transform = ln.Transform(key="my_pipeline", version="1.2.0")

-[ Why should I care about emergent data lineage? ]-

When humans & agents are in the loop, most mistakes happen outside
deterministic computational pipelines. They happen in between
pipelines and when running notebooks & scripts.

"track()" makes notebooks & derived results reproducible & auditable,
enabling to learn from mistakes.

-[ Is this compliant with OpenLineage? ]-

Yes. What OpenLineage calls a "job", LaminDB calls a "transform". What
OpenLineage calls a "run", LaminDB calls a "run".

##### Manage changes with branches

When you need more control you can also use git-inspired change
management via contribution branches that you merge into main.

 lamin switch -c my_branch
 # ... make changes ...
 lamin switch main
 lamin merge my_branch

For more info: "Branch"

##### Manage agent plans

If you're working with an agent plan, here is a way to save it:

 lamin save /path/to/.cursor/plans/my-agent-plan.md

It you switched to a branch, the plan will be linked to that branch.
Here is how to link a plan to a run:

 ln.track(plan=".plans/my-agent-plan.md")  # link plan artifact to this run

### Manage datasets

"Artifact" is the core object for datasets and models in LaminDB,
whether they are files, folders, tables, or array-like data.

It gives you one interface for registration, access, annotation,
lineage, and queries across local storage, AWS S3 ("s3://"), Google
Cloud ("gs://"), Hugging Face ("hf://"), and any other file system
supported by "fsspec".

##### Create an artifact

Let's first look at an exemplary "DataFrame". We'll cover simple
files, folders, and the "AnnData" format later in this tutorial.

-[ Py ]-

 df = ln.examples.datasets.mini_immuno.get_dataset1(with_typo=True)
 df

-[ R ]-

 df <- ln$examples$datasets$mini_immuno$get_dataset1(with_typo = TRUE)
 df

This is how you create an artifact from a "DataFrame".

-[ Py ]-

 artifact = ln.Artifact.from_dataframe(df, key="my_datasets/rnaseq1.parquet").save()
 artifact.describe()

-[ R ]-

 artifact <- ln$Artifact$from_dataframe(df, key = "my_datasets/rnaseq1.parquet")$save()
 artifact$describe()

-[ Which fields are populated when creating an artifact? ]-

Basic fields:

* "uid": universal ID

* "key": a (virtual) relative path of the artifact in "storage"

* "description": an optional string description

* "storage": the storage location (the root, say, an S3 bucket or a
  local directory)

* "suffix": an optional file/path suffix

* "size": the artifact size in bytes

* "hash": a hash useful to check for integrity and collisions (is this
  artifact already stored?)

* "created_at": time of creation

* "updated_at": time of last update

Provenance-related fields:

* "created_by": the "User" who created the artifact

* "run": the "Run" of the "Transform" that created the artifact

For a full reference, see "Artifact".

-[ How does LaminDB compare to a AWS S3? ]-

LaminDB provides a database on top of AWS S3 (or GCP storage, file
systems, etc.).

Similar to organizing files with paths, you can organize artifacts
using the "key" parameter of "Artifact".

However, you'll see that you can more conveniently query data by
entities you care about: people, code, experiments, genes, proteins,
cell types, etc.

-[ What exactly happens during save? ]-

In the database: A SQL record is inserted into the "Artifact"
registry. If the SQL record exists already based on comparing its
"hash", it's returned.

In storage:

* If the default storage is in the cloud, ".save()" triggers an upload
  for a local artifact.

* If the artifact is in a registered storage location, only the
  metadata of the artifact is saved as a SQL record to the "Artifact"
  registry.

##### Access artifacts

Get the artifact by "key":

-[ Py ]-

 artifact = ln.Artifact.get(key="my_datasets/rnaseq1.parquet")

-[ R ]-

 artifact <- ln$Artifact$get(key = "my_datasets/rnaseq1.parquet")

Load it into memory:

-[ Py ]-

 artifact.load()

-[ R ]-

 artifact$load()

Typically your artifact is in a cloud storage location. To get a local
file path to it, call "cache()":

-[ Py ]-

 artifact.cache()

-[ R ]-

 artifact$cache()

If the data is large, you might not want to cache but stream it via
"open()". For more: Stream datasets from storage

##### Update & delete artifacts

Here is how to update & delete an artifact:

-[ Py ]-

 artifact.description = "My updated description"
 artifact.save()  # persist metadata changes

 artifact.delete()  # move to trash
 artifact.restore()  # restore from trash

 # artifact.delete(permanent=True)  # permanently delete

-[ R ]-

 artifact$description = "My updated description"
 artifact$save()  # persist metadata changes

 artifact$delete()  # move to trash
 artifact$restore()  # restore from trash

 # artifact$delete(permanent = TRUE)  # permanently delete

-[ What happens when I delete an artifact? ]-

By default, deleting moves an artifact into the trash so it no longer
appears in standard queries.

To restore it, call "artifact.restore()". For archive/trash semantics
and query patterns, see How do I trash or archive objects?.

##### Trace data lineage

You can understand where an artifact comes from by looking at its
"Transform" & "Run" attributes:

-[ Py ]-

 artifact.transform

-[ R ]-

 artifact$transform

-[ Py ]-

 artifact.run

-[ R ]-

 artifact$run

Or visualize deeper data lineage with the "view_lineage()" method.
Here we're only one step deep:

-[ Py ]-

 artifact.view_lineage()

-[ R ]-

 artifact$view_lineage()

-[ Show me a more interesting example, please! ]-

-[ Py ]-

Explore and load the notebook here.

-[ Hub ]-

Explore data lineage interactively here.

Once you're done, at the end of your notebook or script, call
"finish()". Here, we're not yet done so we're commenting it out.

-[ Py ]-

 # ln.finish()  # mark run as finished, save execution report, source code & environment

-[ R ]-

 # ln$finish()  # mark run as finished, save execution report, source code & environment

-[ Saving reports when using R ]-

If you did *not* use RStudio's notebook mode, you have to render an
HTML externally.

1. Render the notebook to HTML via one of:

 * In RStudio, click the "Knit" button

 * From the command line, run

 Rscript -e 'rmarkdown::render("tutorial.Rmd")'

 * Use the "rmarkdown" package in R

 rmarkdown::render("tutorial.Rmd")

2. Save it via one of:

 * the "save()" command in the "lamin_cli" module from R

 lc <- import_module("lamin_cli")
 lc$save("tutorial.Rmd")

 * the "lamin" CLI

 lamin save tutorial.Rmd

To create a new version of a notebook or script, run "lamin load" on
the terminal, e.g.,

 $ lamin load https://lamin.ai/laminlabs/lamindata/transform/13VINnFk89PE0004
 → notebook is here: mcfarland_2020_preparation.ipynb

Note that data lineage also helps to understand what a dataset is
being used for, not only where it comes from. Many datasets are being
used over and over for different purposes and it's often useful to
understand how.

##### Annotate an artifact

You can annotate artifacts with features and labels. Features are
measurement dimensions (e.g. ""organism"", ""temperature"") and labels
are measured categories (e.g. ""human"", ""mouse"").

Let's annotate an artifact with a "ULabel", a built-in universal label
registry.

-[ Py ]-

 # create & save a universal label
 my_experiment = ln.ULabel(name="My experiment").save()

 # annotate the artifact with a universal label
 artifact.ulabels.add(my_experiment)

 # describe the artifact
 artifact.describe()

-[ R ]-

 # create & save a universal label
 my_experiment <- ln$ULabel(name = "My experiment")$save()

 # annotate the artifact with a universal label
 artifact$ulabels$add(my_experiment)

 # describe the artifact
 artifact$describe()

This is how you query artifacts based on the annotation.

-[ Py ]-

 ln.Artifact.filter(ulabels=my_experiment).to_dataframe()

-[ R ]-

 ln$Artifact$filter(ulabels = my_experiment)$to_dataframe()

You can also annotate with labels from other registries, e.g., the
biological ontologies in "bionty".

-[ Py ]-

 import bionty as bt

 # create a cell type label from the source ontology
 cell_type = bt.CellType.from_source(name="effector T cell").save()

 # annotate the artifact with a cell type
 artifact.cell_types.add(cell_type)

 # describe the artifact
 artifact.describe()

-[ R ]-

 bt <- import_module("bionty")

 # create a cell type label from the source ontology
 cell_type <- bt$CellType$from_source(name = "effector T cell")$save()

 # annotate the artifact with a cell type
 artifact$cell_types$add(cell_type)

 # describe the artifact
 artifact$describe()

This is how you query artifacts by cell type annotations.

-[ Py ]-

 ln.Artifact.filter(cell_types=cell_type).to_dataframe()

-[ R ]-

 ln$Artifact$filter(cell_types = cell_type)$to_dataframe()

Here is how to update or remove annotations:

-[ Py ]-

 # add labels
 artifact.ulabels.add(my_experiment)
 artifact.cell_types.add(cell_type)

 # remove labels
 artifact.ulabels.remove(my_experiment)
 artifact.cell_types.remove(cell_type)

-[ R ]-

 # add labels
 artifact$ulabels$add(my_experiment)
 artifact$cell_types$add(cell_type)

 # remove labels
 artifact$ulabels$remove(my_experiment)
 artifact$cell_types$remove(cell_type)

If you want to annotate by non-categorical metadata or indicate the
feature for a label, annotate via features.

-[ Py ]-

 # define the "temperature" & "experiment" features
 ln.Feature(name="temperature", dtype=float).save()
 ln.Feature(name="experiment", dtype=ln.ULabel).save()

 # annotate the artifact
 artifact.features.set_values({"temperature": 21.6, "experiment": "My experiment"})

 # describe the artifact
 artifact.describe()

-[ R ]-

 # define the "temperature" & "experiment" features
 ln$Feature(name = "temperature", dtype = "float")$save()
 ln$Feature(name = "experiment", dtype = ln$ULabel)$save()

 # annotate the artifact
 artifact$features$set_values(list(temperature = 21.6, experiment = "My experiment"))

 # describe the artifact
 artifact$describe()

This is how you query artifacts by features.

-[ Py ]-

 ln.Artifact.filter(temperature=21.6).to_dataframe()

-[ R ]-

 ln$Artifact$filter(temperature = 21.6)$to_dataframe()

Optionally, after querying, this is how you clean up feature values
from the artifact and remove the feature definitions.

-[ Py ]-

 # remove feature values from this artifact
 artifact.features.remove_values(["temperature", "experiment"])

 # optionally remove feature definitions from the registry
 ln.Feature.get(name="temperature").delete()
 ln.Feature.get(name="experiment").delete()

-[ R ]-

 # remove feature values from this artifact
 artifact$features$remove_values(list("temperature", "experiment"))

 # optionally remove feature definitions from the registry
 ln$Feature$get(name = "temperature")$delete()
 ln$Feature$get(name = "experiment")$delete()

##### Validate an artifact

Validated datasets are more re-usable by analysts and machine learning
models. You can define what a valid artifact should look like by
defining a schema. If you do that, LaminDB will also automatically
annotate with validated metadata which helps with finding the
artifact.

-[ Give me examples for findability and usability. ]-

1. Findability: Which datasets measured expression of cell marker
 "CD14"? Which characterized cell line "K562"? Which have a test &
 train split? Etc.

2. Usability: Are there typos in feature names? Are there typos in
 labels? Are types and units of features consistent? Etc.

-[ Py ]-

 import bionty as bt  # <-- use bionty to access registries with imported public ontologies

 # define a few more valid labels
 ln.ULabel(name="DMSO").save()
 ln.ULabel(name="IFNG").save()

 # define a few more valid features
 ln.Feature(name="perturbation", dtype=ln.ULabel).save()
 ln.Feature(name="cell_type_by_model", dtype=bt.CellType).save()
 ln.Feature(name="cell_type_by_expert", dtype=bt.CellType).save()
 ln.Feature(name="assay_oid", dtype=bt.ExperimentalFactor.ontology_id).save()
 ln.Feature(name="donor", dtype=str, nullable=True).save()
 ln.Feature(name="sample_note", dtype=str, nullable=True).save()
 ln.Feature(name="concentration", dtype=str).save()
 ln.Feature(name="treatment_time_h", dtype="num", coerce=True).save()

 # define a schema that merely enforces a feature identifier type
 schema = ln.Schema(itype=ln.Feature).save()

-[ R ]-

 bt <- import_module("bionty")

 # define a few more valid labels
 ln$ULabel(name = "DMSO")$save()
 ln$ULabel(name = "IFNG")$save()

 # define a few more valid features
 ln$Feature(name = "perturbation", dtype = ln$ULabel)$save()
 ln$Feature(name = "cell_type_by_model", dtype = bt$CellType)$save()
 ln$Feature(name = "cell_type_by_expert", dtype = bt$CellType)$save()
 ln$Feature(name = "assay_oid", dtype = bt$ExperimentalFactor$ontology_id)$save()
 ln$Feature(name = "donor", dtype = "str", nullable = TRUE)$save()
 ln$Feature(name = "sample_note", dtype = "str", nullable = TRUE)$save()
 ln$Feature(name = "concentration", dtype = "str")$save()
 ln$Feature(name = "treatment_time_h", dtype = "num", coerce = TRUE)$save()

 # define a schema that merely enforces a feature identifier type
 schema <- ln$Schema(itype = ln$Feature)$save()

If you pass a "schema" object to the "Artifact" constructor, the
artifact will be validated & annotated. Let's try this.

-[ Py ]-

 try:
 artifact = ln.Artifact.from_dataframe(
 df, key="my_datasets/rnaseq1.parquet", schema=schema
 )
 except ln.errors.ValidationError as error:
 print(str(error))

-[ R ]-

 tryCatch({
 artifact <- ln$Artifact$from_dataframe(
 df, key = "my_datasets/rnaseq1.parquet", schema = schema
 )
 }, error = function(error) {
 print(str(error))
 })

Because there is a typo in the "perturbation" column, validation
fails. Let's fix it by making a new version.

##### Make a new version of an artifact

-[ Py ]-

 # fix the "IFNJ" typo
 df["perturbation"] = df["perturbation"].cat.rename_categories({"IFNJ": "IFNG"})

 # create a new version
 artifact = ln.Artifact.from_dataframe(
 df, key="my_datasets/rnaseq1.parquet", schema=schema
 ).save()

 # see the annotations
 artifact.describe()

 # see all versions of the artifact
 artifact.versions.to_dataframe()

-[ R ]-

 # fix the "IFNJ" typo
 df[["perturbation"]] <- df[["perturbation"]]$cat$rename_categories(list(IFNJ = "IFNG"))

 # create a new version
 artifact <- ln$Artifact$from_dataframe(
 df, key = "my_datasets/rnaseq1.parquet", schema = schema
 )$save()

 # see the annotations
 artifact$describe()

 # see all versions of the artifact
 artifact$versions$to_dataframe()

The content of the dataset is now validated and the dataset is richly
annotated and queryable by all entities that you defined.

-[ Can I also create new versions without passing "key"? ]-

That works, too, you can use "revises":

 artifact_v1 = ln.Artifact.from_dataframe(df, description="Just a description").save()
 # below revises artifact_v1
 artifact_v2 = ln.Artifact.from_dataframe(df_updated, revises=artifact_v1).save()

The good thing about passing "revises" is that you don't need to worry
about coming up with naming conventions for paths. The good thing
about versioning based on "key" is that it's how all data versioning
tools are doing it and you can build a file hierarchy.

### Query & search registries

We've already seen a few queries. Let's now walk through the topic
systematically.

To get an overview over all artifacts in your instance, call "df".

-[ Py ]-

 ln.Artifact.to_dataframe()

-[ R ]-

 ln$Artifact$to_dataframe()

The "Artifact" registry additionally supports seeing all feature
annotations of an artifact.

-[ Py ]-

 ln.Artifact.to_dataframe(include="features")

-[ R ]-

 ln$Artifact$to_dataframe(include = "features")

LaminDB's central classes are registries that store "SQLRecord"
objects. If you want to see the fields of a registry, call
".describe()" on the class or auto-complete:

-[ Py ]-

 ln.Artifact.describe()

-[ R ]-

 ln$Artifact$describe()

Each registry is a table in the relational schema of the underlying
database. With "view()", you can see the latest additions to the
database:

-[ Py ]-

 ln.view()

-[ R ]-

 ln$view()

-[ Which registries have I already learned about? ]-

* "Artifact": datasets & models stored as files, folders, or arrays

* "Transform": transforms of artifacts

* "Run": runs of transforms

* "User": users

* "Storage": local or cloud storage locations

Every registry supports arbitrary relational queries using the class
methods "get" and "filter". The syntax for it is Django's query
syntax.

Here are some simple query examples.

-[ Py ]-

 # get a single record (here the current notebook)
 transform = ln.Transform.get(key="tutorial.ipynb")

 # get a set of records by filtering for a directory (LaminDB treats directories like AWS S3, as the prefix of the storage key)
 ln.Artifact.filter(key__startswith="my_datasets/").to_dataframe()

 # query all artifacts ingested from a transform
 artifacts = ln.Artifact.filter(transform=transform).all()

 # query all artifacts ingested from a notebook with "tutor" in the description
 artifacts = ln.Artifact.filter(
 transform__description__icontains="tutor",
 ).all()

-[ R ]-

 # get a single record (here the current notebook)
 transform <- ln$Transform$get(key = "tutorial.ipynb")

 # get a set of records by filtering for a directory (LaminDB treats directories like AWS S3, as the prefix of the storage key)
 ln$Artifact$filter(key__startswith = "my_datasets/")$to_dataframe()

 # query all artifacts ingested from a transform
 artifacts <- ln$Artifact$filter(transform = transform)$all()

 # query all artifacts ingested from a notebook with "tutor" in the description
 artifacts <- ln$Artifact$filter(
 transform__description__icontains = "tutor",
 )$all()

-[ What does a double underscore mean? ]-

For any field, the double underscore defines a comparator, e.g.,

* "name__icontains="Martha"": "name" contains ""Martha"" when ignoring
  case

* "name__startswith="Martha"": "name" starts with ""Martha"

* "name__in=["Martha", "John"]": "name" is ""John"" or ""Martha""

For more info: Query & search registries

-[ Can I chain filters and searches? ]-

Yes: "ln.Artifact.filter(suffix=".jpg").search("my image")"

The class methods "search" and "lookup" help with approximate matches.

-[ Py ]-

 # search artifacts
 ln.Artifact.search("iris").to_dataframe().head()

 # search transforms
 ln.Transform.search("tutor").to_dataframe()

 # look up records with auto-complete
 records = ln.Record.lookup()

-[ R ]-

 # search artifacts
 ln$Artifact$search("iris")$to_dataframe()$head()

 # search transforms
 ln$Transform$search("tutor")$to_dataframe()

 # look up records with auto-complete
 records <- ln$Record$lookup()

-[ Show me a screenshot ]-

For more info: Query & search registries

### Manage folders & storage locations

Let's look at a folder in the cloud that contains 3 sub-folders
storing images & metadata of Iris flowers, generated in 3 subsequent
studies.

-[ Py ]-

 # we use anon=True here in case no aws credentials are configured
 ln.UPath("s3://lamindata/iris_studies", anon=True).view_tree()

-[ R ]-

 # we use anon = TRUE here in case no aws credentials are configured
 ln$UPath("s3://lamindata/iris_studies", anon = TRUE)$view_tree()

Let's create an artifact for the first sub-folder.

-[ Py ]-

 artifact = ln.Artifact("s3://lamindata/iris_studies/study0_raw_images").save()
 artifact

-[ R ]-

 artifact <- ln$Artifact("s3://lamindata/iris_studies/study0_raw_images")$save()
 artifact

As you see from "path", the folder was merely registered in its
present storage location without copying it.

-[ Py ]-

 artifact.path

-[ R ]-

 artifact$path

LaminDB keeps track of all your storage locations.

-[ Py ]-

 ln.Storage.to_dataframe()

-[ R ]-

 ln$Storage$to_dataframe()

-[ How do I create an artifact for a local file or folder? ]-

Source path is local:

 ln.Artifact("./my_data.fcs", key="my_dataset.fcs")
 ln.Artifact("./my_images/", key="my_images")

Upon "artifact.save()", the source path will be copied or uploaded
into your instance's current storage, visible & changeable via
"ln.settings.storage".

If the source path is remote *or* already in a registered storage
location (one that's registered in "ln.Storage"), "artifact.save()"
will *not* trigger a copy or upload but register the existing path.

 ln.Artifact("s3://my-bucket/my_dataset.fcs")  # key is auto-populated from S3, you can optionally pass a description
 ln.Artifact("s3://my-bucket/my_images/")  # key is auto-populated from S3, you can optionally pass a description

-[ Are artifacts aware of array-like data? ]-

Yes.

You can make artifacts from paths referencing array-like objects:

 ln.Artifact("./my_anndata.h5ad", key="my_anndata.h5ad")
 ln.Artifact("./my_zarr_array/", key="my_zarr_array")

Or from in-memory objects:

 ln.Artifact.from_dataframe(df, key="my_dataframe.parquet")
 ln.Artifact.from_anndata(adata, key="my_anndata.h5ad")

You can open large artifacts for slicing from the cloud or load small
artifacts directly into memory via:

 artifact.open()

### Manage biological ontologies

Every "bionty" registry is based on configurable public ontologies
(>20 of them) that are automatically leveraged during validation &
annotation. Sometimes you want to access the public ontology directly.

-[ Py ]-

 import bionty as bt

 cell_type_ontology = bt.CellType.public()
 cell_type_ontology

-[ R ]-

 bt <- import_module("bionty")

 cell_type_ontology <- bt$CellType$public()
 cell_type_ontology

The returned object can be searched like you can search a registry.

-[ Py ]-

 cell_type_ontology.search("gamma-delta T cell").head(2)

-[ R ]-

 cell_type_ontology$search("gamma-delta T cell")$head(2)

Because you can't update an external public ontology, you update the
content of the corresponding registry. Here, you create a new cell
type.

-[ Py ]-

 # create an ontology-coupled cell type record and save it
 neuron = bt.CellType.from_source(name="neuron").save()

 # create a record to track a new cell state
 new_cell_state = bt.CellType(
 name="my neuron cell state", description="explains X"
 ).save()

 # express that it's a neuron state
 new_cell_state.parents.add(neuron)

 # view ontological hierarchy
 new_cell_state.view_parents(distance=2)

-[ R ]-

 # create an ontology-coupled cell type record and save it
 neuron <- bt$CellType$from_source(name = "neuron")$save()

 # create a record to track a new cell state
 new_cell_state <- bt$CellType(
 name = "my neuron cell state", description = "explains X"
 )$save()

 # express that it's a neuron state
 new_cell_state$parents$add(neuron)

 # view ontological hierarchy
 new_cell_state$view_parents(distance = 2)

For more info: Manage biological ontologies

### Manage AnnData artifacts

LaminDB supports a growing number of data structures: "DataFrame",
"AnnData", "MuData", "SpatialData", and "Tiledbsoma" with their
corresponding storage formats.

Let's go through the example of the quickstart, but store the dataset
in an AnnData this time.

-[ Py ]-

 # define var schema
 var_schema = ln.Schema(itype=bt.Gene.ensembl_gene_id, dtype=int).save()

 # define composite schema
 anndata_schema = ln.Schema(
 otype="AnnData", slots={"obs": schema, "var.T": var_schema}
 ).save()

-[ R ]-

 # define var schema
 var_schema <- ln$Schema(itype = bt$Gene$ensembl_gene_id, dtype = "int")$save()

 # define composite schema
 anndata_schema <- ln$Schema(
 otype = "AnnData", slots = list(obs = schema, var$T = var_schema)
 )$save()

Validate & annotate an "AnnData".

-[ Py ]-

 import anndata as ad

 # store the dataset as an AnnData object to distinguish data from metadata
 adata = ad.AnnData(df.iloc[:, :3], obs=df.iloc[:, 3:-1])

 # save curated artifact
 artifact = ln.Artifact.from_anndata(
 adata, key="my_datasets/my_rnaseq1.h5ad", schema=anndata_schema
 ).save()
 artifact.describe()

-[ R ]-

 ad <- import_module("anndata")

 # store the dataset as an AnnData object to distinguish data from metadata
 adata <- anndata::AnnData(df$iloc[:, :3], obs = df$iloc[:, 3:-1])

 # save curated artifact
 artifact <- ln$Artifact$from_anndata(
 adata, key = "my_datasets/my_rnaseq1.h5ad", schema = anndata_schema
 )$save()
 artifact$describe()

Because "AnnData" separates the high-dimensional count matrix that's
typically indexed with Ensembl gene ids from the metadata, we're now
working with two types of feature sets ("bt.Gene" for the counts and
"ln.Feature" for the metadata). These correspond to the "obs" and the
"var" schema in the "anndata_schema".

If you want to find a dataset by whether it measured "CD8A", you can
do so as as follows.

-[ Py ]-

 # query for all feature sets that contain CD8A
 feature_sets = ln.Schema.filter(genes__symbol="CD8A").all()

 # query for all artifacts linked to these feature sets
 ln.Artifact.filter(feature_sets__in=feature_sets).to_dataframe()

-[ R ]-

 # query for all feature sets that contain CD8A
 feature_sets <- ln$Schema$filter(genes__symbol = "CD8A")$all()

 # query for all artifacts linked to these feature sets
 ln$Artifact$filter(feature_sets__in = feature_sets)$to_dataframe()

### Manage collections of datasets

How do you integrate new datasets with your existing datasets?
Leverage "Collection".

-[ Py ]-

 # a new dataset
 df2 = ln.examples.datasets.mini_immuno.get_dataset2(otype="DataFrame")
 adata = ad.AnnData(df2.iloc[:, :3], obs=df2.iloc[:, 3:-1])
 artifact2 = ln.Artifact.from_anndata(
 adata, key="my_datasets/my_rnaseq2.h5ad", schema=anndata_schema
 ).save()

-[ R ]-

 # a new dataset
 df2 <- ln$examples$datasets$mini_immuno$get_dataset2(otype = "DataFrame")
 adata <- anndata::AnnData(df2$iloc[:, :3], obs = df2$iloc[:, 3:-1])
 artifact2 <- ln$Artifact$from_anndata(
 adata, key = "my_datasets/my_rnaseq2.h5ad", schema = anndata_schema
 )$save()

Create a collection using "Collection".

-[ Py ]-

 collection = ln.Collection([artifact, artifact2], key="my-RNA-seq-collection").save()
 collection.describe()
 collection.view_lineage()

-[ R ]-

 collection <- ln$Collection(list(artifact, artifact2), key = "my-RNA-seq-collection")$save()
 collection$describe()
 collection$view_lineage()

-[ Py ]-

 # if it's small enough, you can load the entire collection into memory as if it was one
 collection.load()

 # typically, it's too big, hence, open it for streaming (if the backend allows it)
 # collection.open()

 # or iterate over its artifacts
 collection.ordered_artifacts.all()

 # or look at a DataFrame listing the artifacts
 collection.ordered_artifacts.to_dataframe()

-[ R ]-

 # if it's small enough, you can load the entire collection into memory as if it was one
 collection$load()

 # typically, it's too big, hence, open it for streaming (if the backend allows it)
 # collection$open()

 # or iterate over its artifacts
 collection$ordered_artifacts$all()

 # or look at a DataFrame listing the artifacts
 collection$ordered_artifacts$to_dataframe()

Directly train models on collections of "AnnData".

 # to train models, batch iterate through the collection as if it was one array
 from torch.utils.data import DataLoader, WeightedRandomSampler
 dataset = collection.mapped(obs_keys=["cell_medium"])
 sampler = WeightedRandomSampler(
 weights=dataset.get_label_weights("cell_medium"), num_samples=len(dataset)
 )
 data_loader = DataLoader(dataset, batch_size=2, sampler=sampler)
 for batch in data_loader:
 pass

For more: Stream datasets from storage

Or this blog post for training models on distributed datasets.