lamindb.Artifact

class lamindb.Artifact(data: UPathStr, type: Literal['dataset', 'model', 'code'] = 'dataset', key: str | None = None, description: str | None = None, is_new_version_of: Artifact | None = None, run: Run | None = None)

Bases: Record, HasFeatures, HasParams, IsVersioned, TracksRun, TracksUpdates

Artifacts: datasets & models stored as files, folders, or arrays.

Artifacts manage data in local or remote storage.

An artifact stores a dataset or model as either a file or a folder.

Some artifacts are array-like, e.g., when stored as .parquet, .h5ad, .zarr, or .tiledb.

For more info, see tutorial: Tutorial: Artifacts.

Parameters:
  • dataUPathStr A path to a local or remote folder or file.

  • typeLiteral["dataset", "model", "code"] | None = None The artifact type.

  • keystr | None = None A relative path within default storage, e.g., "myfolder/myfile.fcs".

  • descriptionstr | None = None A description.

  • versionstr | None = None A version string.

  • is_new_version_ofArtifact | None = None A previous version of the artifact.

  • runRun | None = None The run that creates the artifact.

Typical storage formats & their API accessors

Arrays:

  • Table: .csv, .tsv, .parquet, .ipcDataFrame, pyarrow.Table

  • Curated matrix: .h5ad, .h5mu, .zradAnnData, MuData

  • Generic array: HDF5 group, zarr group, TileDB store ⟷ HDF5, zarr, TileDB loaders

Non-arrays:

  • Image: .jpg, .pngnp.ndarray, …

  • Fastq: .fastq ⟷ /

  • VCF: .vcf ⟷ /

  • QC: .html ⟷ /

You’ll find these values in the suffix & accessor fields.

LaminDB makes some default choices (e.g., serialize a DataFrame as a .parquet file).

See also

Storage

Storage locations for artifacts.

Collection

Collections of artifacts.

from_df()

Create an artifact from a DataFrame.

from_anndata()

Create an artifact from an AnnData.

from_dir()

Bulk create file-like artifacts from a directory.

Examples

Create an artifact from a file in the cloud:

>>> artifact = ln.Artifact("s3://my-bucket/my-folder/my-file.csv", description="My file")
>>> artifact.save()  # only metadata is saved

Create an artifact from a local filepath:

>>> artifact = ln.Artifact("./my_file.jpg", description="My image")
>>> artifact.save()
Why does the API look this way?

It’s inspired by APIs building on AWS S3.

Both boto3 and quilt select a bucket (akin to default storage in LaminDB) and define a target path through a key argument.

In boto3:

# signature: S3.Bucket.upload_file(filepath, key)
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('mybucket')
bucket.upload_file('/tmp/hello.txt', 'hello.txt')

In quilt3:

# signature: quilt3.Bucket.put_file(key, filepath)
import quilt3
bucket = quilt3.Bucket('mybucket')
bucket.put_file('hello.txt', '/tmp/hello.txt')

Make a new version of an artifact:

>>> # a non-versioned artifact
>>> artifact = ln.Artifact(df1, description="My dataframe")
>>> artifact.save()
>>> # version an artifact
>>> new_artifact = ln.Artifact(df2, is_new_version_of=artifact)
>>> assert new_artifact.stem_uid == artifact.stem_uid
>>> assert artifact.version == "1"
>>> assert new_artifact.version == "2"

Attributes

features: FeatureManager

Feature manager.

Features denote dataset dimensions, i.e., the variables that measure labels & numbers.

Curate with features & values:

artifact.features.add_values({
     "species": organism,  # here, organism is an Organism record
     "scientist": ['Barbara McClintock', 'Edgar Anderson'],
     "temperature": 27.6,
     "study": "Study 0: initial plant gathering"
})

Query for features & values:

ln.Artifact.features.filter(scientist="Barbara McClintock")

Features may or may not be part of the artifact content in storage. For instance, the Curate flow validates the columns of a DataFrame-like artifact and annotates it with features corresponding to these columns. artifact.features.add_values, by contrast, does not validate the content of the artifact.

params: ParamManager

Param manager.

What .features is to dataset-like artifacts, .params is to model-like artifacts.

Curate with params & values:

artifact.params.add_values({
    "hidden_size": 32,
    "bottleneck_size": 16,
    "batch_size": 32
})
property path: Path | UPath

Path.

File in cloud storage, here AWS S3:

>>> artifact = ln.Artifact("s3://my-bucket/my-file.csv").save()
>>> artifact.path
S3Path('s3://my-bucket/my-file.csv')

File in local storage:

>>> ln.Artifact("./myfile.csv", description="myfile").save()
>>> artifact = ln.Artifact.filter(description="myfile").one()
>>> artifact.path
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')

Fields

version: str

Version (default None).

Defines version of a family of records characterized by the same stem_uid.

Consider using semantic versioning with Python versioning.

id: int

Internal id, valid only in one DB instance.

uid: str

A universal random id (20-char base62 ~ UUID), valid across DB instances.

description: str

A description.

storage: Storage

Storage location (Storage), e.g.. n S3 or GCP bucket or a local directory.

key: str

Storage key, the relative path within the storage location.

suffix: str

Path suffix or empty string if no canonical suffix exists.

This is either a file suffix (".csv", ".h5ad", etc.) or the empty string “”.

type: str

Artifact type (default None).

accessor: str

Default backed or memory accessor, e.g., DataFrame, AnnData.

size: int

Size in bytes.

Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.

hash: str

Hash or pseudo-hash of artifact content.

Useful to ascertain integrity and avoid duplication.

hash_type: str

Type of hash.

n_objects: int

Number of objects.

Typically, this denotes the number of files in an artifact.

n_observations: int

Number of observations.

Typically, this denotes the first array dimension.

transform: Transform

Transform whose run created the artifact.

run: Run

Run that created the artifact.

visibility: int

Visibility of artifact record in queries & searches (0 default, 1 hidden, 2 trash).

key_is_virtual: bool

Indicates whether key is virtual or part of an actual file path.

ulabels: ULabel

The ulabels measured in the artifact (ULabel).

input_of: Run

Runs that use this artifact as an input.

previous_runs: Run

Sequence of runs that created or updated the record.

feature_sets: FeatureSet

The feature sets measured in the artifact (FeatureSet).

feature_values: FeatureValue

Non-categorical feature values for annotation.

param_values: ParamValue

Parameter values.

created_at: datetime

Time of creation of record.

created_by: User

Creator of record.

updated_at: datetime

Time of last update to record.

Methods

backed(is_run_input=None)
Return type:

AnnDataAccessor | BackedAccessor | SOMACollection | SOMAExperiment

cache(is_run_input=None)

Download cloud artifact to local cache.

Follows synching logic: only caches an artifact if it’s outdated in the local cache.

Returns a path to a locally cached on-disk object (say. .jpg file).

Return type:

Path

Examples

Sync file from cloud and return the local path of the cache:

>>> artifact.cache()
PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')
delete(permanent=None, storage=None, using_key=None)

Delete.

A first call to .delete() puts an artifact into the trash (sets visibility to -1).

A second call permanently deletes the artifact.

FAQ: Storage FAQ

Parameters:
  • permanent (bool | None, default: None) – Permanently delete the artifact (skip trash).

  • storage (bool | None, default: None) – Indicate whether you want to delete the artifact in storage.

Return type:

None

Examples

For an Artifact object artifact, call:

>>> artifact.delete()
classmethod from_anndata(adata, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)

Create from AnnData, validate & link features.

Parameters:
  • adata (AnnData | UPathStr) – An AnnData object or a path of AnnData-like.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.h5ad".

  • description (str | None, default: None) – A description.

  • version (str | None, default: None) – A version string.

  • is_new_version_of (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> adata = ln.core.datasets.anndata_with_obs()
>>> artifact = ln.Artifact.from_anndata(adata, description="mini anndata with obs")
>>> artifact.save()
classmethod from_df(df, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)

Create from DataFrame, validate & link features.

For more info, see tutorial: Tutorial: Artifacts.

Parameters:
  • df (DataFrame) – A DataFrame object.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.parquet".

  • description (str | None, default: None) – A description.

  • version (str | None, default: None) – A version string.

  • is_new_version_of (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> df = ln.core.datasets.df_iris_in_meter_batch1()
>>> df.head()
  sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0
>>> artifact = ln.Artifact.from_df(df, description="Iris flower collection batch1")
>>> artifact.save()
classmethod from_dir(path, key=None, *, run=None)

Create a list of artifact objects from a directory.

Hint

If you have a high number of files (several 100k) and don’t want to track them individually, create a single Artifact via Artifact(path) for them. See, e.g., RxRx: cell imaging.

Parameters:
  • path (lamindb.core.types.UPathStr) – Source path of folder.

  • key (str | None, default: None) – Key for storage destination. If None and directory is in a registered location. n inferred key will reflect the relative position. If None and directory is outside of a registered storage location, the inferred key defaults to path.name.

  • run (Run | None, default: None) – A Run object.

Return type:

list[Artifact]

Examples

>>> dir_path = ln.core.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage)
>>> artifacts = ln.Artifact.from_dir(dir_path)
>>> ln.save(artifacts)
classmethod from_mudata(mdata, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)

Create from MuData, validate & link features.

Parameters:
  • mdata (MuData) – An MuData object.

  • key (str | None, default: None) – A relative path within default storage, e.g., "myfolder/myfile.h5mu".

  • description (str | None, default: None) – A description.

  • version (str | None, default: None) – A version string.

  • is_new_version_of (Artifact | None, default: None) – An old version of the artifact.

  • run (Run | None, default: None) – The run that creates the artifact.

Return type:

Artifact

See also

Collection()

Track collections.

Feature

Track features.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> mdata = ln.core.datasets.mudata_papalexi21_subset()
>>> artifact = ln.Artifact.from_mudata(mdata, description="a mudata object")
>>> artifact.save()
load(is_run_input=None, stream=False, **kwargs)

Stage and load to memory.

Returns in-memory representation if possible, e.g.. n AnnData object for an h5ad file.

Return type:

Any

Examples

Load as a DataFrame:

>>> df = ln.core.datasets.df_iris_in_meter_batch1()
>>> ln.Artifact.from_df(df, description="iris").save()
>>> artifact = ln.Artifact.filter(description="iris").one()
>>> artifact.load().head()
sepal_length sepal_width petal_length petal_width iris_organism_code
0        0.051       0.035        0.014       0.002                 0
1        0.049       0.030        0.014       0.002                 0
2        0.047       0.032        0.013       0.002                 0
3        0.046       0.031        0.015       0.002                 0
4        0.050       0.036        0.014       0.002                 0

Load as an AnnData:

>>> artifact.load()
AnnData object with n_obs × n_vars = 70 × 765

Fall back to cache() if no in-memory representation is configured:

>>> artifact.load()
PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')
open(is_run_input=None)

Return a cloud-backed data object.

Return type:

AnnDataAccessor | BackedAccessor | SOMACollection | SOMAExperiment

Notes

For more info, see tutorial: Query arrays.

Examples

Read AnnData in backed mode from cloud:

>>> artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()
>>> artifact.open()
AnnData object with n_obs × n_vars = 70 × 765 backed at 's3://lamindb-ci/lndb-storage/pbmc68k.h5ad'
replace(data, run=None, format=None)

Replace artifact content.

Parameters:
  • data (lamindb.core.types.UPathStr) – A file path.

  • run (Run | None, default: None) – The run that created the artifact gets auto-linked if ln.track() was called.

Return type:

None

Examples

Say we made a change to the content of an artifact, e.g., edited the image paradisi05_laminopathic_nuclei.jpg.

This is how we replace the old file in storage with the new file:

>>> artifact.replace("paradisi05_laminopathic_nuclei.jpg")
>>> artifact.save()

Note that this neither changes the storage key nor the filename.

However, it will update the suffix if it changes.

restore()

Restore from trash.

Return type:

None

Examples

For any Artifact object artifact, call:

>>> artifact.restore()
save(upload=None, **kwargs)

Save to database & storage.

Parameters:

upload (bool | None, default: None) – Trigger upload to cloud storage in instances with hybrid storage mode.

Return type:

Artifact

Examples

>>> artifact = ln.Artifact("./myfile.csv", description="myfile")
>>> artifact.save()