lamindb.Artifact¶
- class lamindb.Artifact(data: UPathStr, type: ArtifactKind | None = None, key: str | None = None, description: str | None = None, revises: Artifact | None = None, run: Run | None = None)¶
Bases:
Record
,IsVersioned
,TracksRun
,TracksUpdates
Datasets & models stored as files, folders, or arrays.
Artifacts manage data in local or remote storage.
Some artifacts are array-like, e.g., when stored as
.parquet
,.h5ad
,.zarr
, or.tiledb
.- Parameters:
data –
UPathStr
A path to a local or remote folder or file.type –
Literal["dataset", "model"] | None = None
The artifact type.key –
str | None = None
A path-like key to reference artifact in default storage, e.g.,"myfolder/myfile.fcs"
. Artifacts with the same key form a revision family.description –
str | None = None
A description.revises –
Artifact | None = None
Previous version of the artifact. Triggers a revision.run –
Run | None = None
The run that creates the artifact.
Typical storage formats & their API accessors
Arrays:
Table:
.csv
,.tsv
,.parquet
,.ipc
⟷DataFrame
,pyarrow.Table
Annotated matrix:
.h5ad
,.h5mu
,.zrad
⟷AnnData
,MuData
Generic array: HDF5 group, zarr group, TileDB store ⟷ HDF5, zarr, TileDB loaders
Non-arrays:
Image:
.jpg
,.png
⟷np.ndarray
, …Fastq:
.fastq
⟷ /VCF:
.vcf
⟷ /QC:
.html
⟷ /
You’ll find these values in the
suffix
&accessor
fields.LaminDB makes some default choices (e.g., serialize a
DataFrame
as a.parquet
file).See also
Storage
Storage locations for artifacts.
Collection
Collections of artifacts.
from_df()
Create an artifact from a
DataFrame
.from_anndata()
Create an artifact from an
AnnData
.
Examples
Create an artifact from a file path and pass
description
:>>> artifact = ln.Artifact("s3://my_bucket/my_folder/my_file.csv", description="My file") >>> artifact = ln.Artifact("./my_local_file.jpg", description="My image")
You can also pass
key
to create a virtual filepath hierarchy:>>> artifact = ln.Artifact("./my_local_file.jpg", key="example_datasets/dataset1.jpg")
What works for files also works for folders:
>>> artifact = ln.Artifact("s3://my_bucket/my_folder", description="My folder") >>> artifact = ln.Artifact("./my_local_folder", description="My local folder") >>> artifact = ln.Artifact("./my_local_folder", key="project1/my_target_folder")
Why does the API look this way?
It’s inspired by APIs building on AWS S3.
Both boto3 and quilt select a bucket (akin to default storage in LaminDB) and define a target path through a
key
argument.In boto3:
# signature: S3.Bucket.upload_file(filepath, key) import boto3 s3 = boto3.resource('s3') bucket = s3.Bucket('mybucket') bucket.upload_file('/tmp/hello.txt', 'hello.txt')
In quilt3:
# signature: quilt3.Bucket.put_file(key, filepath) import quilt3 bucket = quilt3.Bucket('mybucket') bucket.put_file('hello.txt', '/tmp/hello.txt')
Make a new version of an artifact:
>>> artifact = ln.Artifact.from_df(df, key="example_datasets/dataset1.parquet").save() >>> artifact_v2 = ln.Artifact(df_updated, key="example_datasets/dataset1.parquet").save()
Alternatively, if you don’t want to provide a value for
key
, you can userevises
:>>> artifact = ln.Artifact.from_df(df, description="My dataframe").save() >>> artifact_v2 = ln.Artifact(df_updated, revises=artifact).save()
Attributes¶
-
features:
FeatureManager
¶ Feature manager.
Features denote dataset dimensions, i.e., the variables that measure labels & numbers.
Annotate with features & values:
artifact.features.add_values({ "species": organism, # here, organism is an Organism record "scientist": ['Barbara McClintock', 'Edgar Anderson'], "temperature": 27.6, "study": "Candidate marker study" })
Query for features & values:
ln.Artifact.features.filter(scientist="Barbara McClintock")
Features may or may not be part of the artifact content in storage. For instance, the
Curator
flow validates the columns of aDataFrame
-like artifact and annotates it with features corresponding to these columns.artifact.features.add_values
, by contrast, does not validate the content of the artifact.
- property labels: LabelManager¶
Label manager.
To annotate with labels, you typically use the registry-specific accessors, for instance
ulabels
:candidate_marker_study = ln.ULabel(name="Candidate marker study").save() artifact.ulabels.add(candidate_marker_study)
Similarly, you query based on these accessors:
ln.Artifact.filter(ulabels__name="Candidate marker study").all()
Unlike the registry-specific accessors, the
.labels
accessor provides a way of associating labels with features:study = ln.Feature(name="study", dtype="cat").save() artifact.labels.add(candidate_marker_study, feature=study)
Note that the above is equivalent to:
artifact.features.add_values({"study": candidate_marker_study})
- property n_objects: int¶
-
params:
ParamManager
¶ Param manager.
Example:
artifact.params.add_values({ "hidden_size": 32, "bottleneck_size": 16, "batch_size": 32, "preprocess_params": { "normalization_type": "cool", "subset_highlyvariable": True, }, })
- property path: Path | UPath¶
Path.
File in cloud storage, here AWS S3:
>>> artifact = ln.Artifact("s3://my-bucket/my-file.csv").save() >>> artifact.path S3Path('s3://my-bucket/my-file.csv')
File in local storage:
>>> ln.Artifact("./myfile.csv", key="myfile").save() >>> artifact = ln.Artifact.get(key="myfile") >>> artifact.path PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/myfile.csv')
- property stem_uid: str¶
Universal id characterizing the version family.
The full uid of a record is obtained via concatenating the stem uid and version information:
stem_uid = random_base62(n_char) # a random base62 sequence of length 12 (transform) or 16 (artifact, collection) version_uid = "0000" # an auto-incrementing 4-digit base62 number uid = f"{stem_uid}{version_uid}" # concatenate the stem_uid & version_uid
- property type: str¶
Simple fields¶
-
uid:
str
¶ A universal random id.
-
key:
str
|None
¶ A (virtual) relative file path within the artifact’s storage location.
Setting a
key
is useful to automatically group artifacts into a version family.LaminDB defaults to a virtual file path to make renaming of data in object storage easy.
If you register existing files in a storage location, the
key
equals the actual filepath on the underyling filesytem or object store.
-
description:
str
|None
¶ A description.
-
suffix:
str
¶ Path suffix or empty string if no canonical suffix exists.
This is either a file suffix (
".csv"
,".h5ad"
, etc.) or the empty string “”.
-
kind:
Literal
['dataset'
,'model'
] |None
¶ ArtifactKind
(defaultNone
).
-
otype:
str
|None
¶ Default Python object type, e.g., DataFrame, AnnData.
-
size:
int
|None
¶ Size in bytes.
Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.
-
hash:
str
|None
¶ Hash or pseudo-hash of artifact content.
Useful to ascertain integrity and avoid duplication.
-
n_files:
int
|None
¶ Number of files for folder-like artifacts,
None
for file-like artifacts.Note that some arrays are also stored as folders, e.g.,
.zarr
or.tiledbsoma
.Changed in version 1.0: Renamed from
n_objects
ton_files
.
-
n_observations:
int
|None
¶ Number of observations.
Typically, this denotes the first array dimension.
-
version:
str
|None
¶ Version (default
None
).Defines version of a family of records characterized by the same
stem_uid
.Consider using semantic versioning with Python versioning.
-
is_latest:
bool
¶ Boolean flag that indicates whether a record is the latest in its version family.
-
created_at:
datetime
¶ Time of creation of record.
-
updated_at:
datetime
¶ Time of last update to record.
Relational fields¶
-
space:
Space
¶ The space in which the record lives.
-
collections:
Collection
¶ The collections that this artifact is part of.
Class methods¶
- classmethod df(include=None, features=False, limit=100)¶
Convert to
pd.DataFrame
.By default, shows all direct fields, except
updated_at
.Use arguments
include
orfeature
to include other data.- Parameters:
include (
str
|list
[str
] |None
, default:None
) – Related fields to include as columns. Takes strings of form"ulabels__name"
,"cell_types__name"
, etc. or a list of such strings.features (
bool
|list
[str
], default:False
) – IfTrue
, map all features of theFeature
registry onto the resultingDataFrame
. Only available forArtifact
.limit (
int
, default:100
) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.
- Return type:
DataFrame
Examples
Include the name of the creator in the
DataFrame
:>>> ln.ULabel.df(include="created_by__name"])
Include display of features for
Artifact
:>>> df = ln.Artifact.df(features=True) >>> ln.view(df) # visualize with type annotations
Only include select features:
>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Q
objects.expressions – Fields and values passed as Django query expressions.
- Return type:
QuerySet
- Returns:
A
QuerySet
.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.ULabel(name="my label").save() >>> ln.ULabel.filter(name__startswith="my").df()
- classmethod from_anndata(adata, key=None, description=None, run=None, revises=None, **kwargs)¶
Create from
AnnData
, validate & link features.- Parameters:
adata (AnnData | UPathStr) – An
AnnData
object or a path of AnnData-like.key (str | None, default:
None
) – A relative path within default storage, e.g.,"myfolder/myfile.h5ad"
.description (str | None, default:
None
) – A description.revises (Artifact | None, default:
None
) – An old version of the artifact.run (Run | None, default:
None
) – The run that creates the artifact.
- Return type:
Artifact
See also
Collection()
Track collections.
Feature
Track features.
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> adata = ln.core.datasets.anndata_with_obs() >>> artifact = ln.Artifact.from_anndata(adata, description="mini anndata with obs") >>> artifact.save()
- classmethod from_df(df, key=None, description=None, run=None, revises=None, **kwargs)¶
Create from
DataFrame
, validate & link features.- Parameters:
df (
DataFrame
) – ADataFrame
object.key (
str
|None
, default:None
) – A relative path within default storage, e.g.,"myfolder/myfile.parquet"
.description (
str
|None
, default:None
) – A description.revises (
Artifact
|None
, default:None
) – An old version of the artifact.run (
Run
|None
, default:None
) – The run that creates the artifact.
- Return type:
See also
Collection()
Track collections.
Feature
Track features.
Examples
>>> df = ln.core.datasets.df_iris_in_meter_batch1() >>> df.head() sepal_length sepal_width petal_length petal_width iris_organism_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0 >>> artifact = ln.Artifact.from_df(df, description="Iris flower collection batch1") >>> artifact.save()
- classmethod from_dir(path, key=None, *, run=None)¶
Create a list of artifact objects from a directory.
Hint
If you have a high number of files (several 100k) and don’t want to track them individually, create a single
Artifact
viaArtifact(path)
for them. See, e.g., RxRx: cell imaging.- Parameters:
path (lamindb.core.types.UPathStr) – Source path of folder.
key (
str
|None
, default:None
) – Key for storage destination. IfNone
and directory is in a registered location, the inferredkey
will reflect the relative position. IfNone
and directory is outside of a registered storage location, the inferred key defaults topath.name
.run (
Run
|None
, default:None
) – ARun
object.
- Return type:
list
[Artifact
]
Examples
>>> dir_path = ln.core.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage) >>> artifacts = ln.Artifact.from_dir(dir_path) >>> ln.save(artifacts)
- classmethod from_mudata(mdata, key=None, description=None, run=None, revises=None, **kwargs)¶
Create from
MuData
, validate & link features.- Parameters:
mdata (
MuData
) – AnMuData
object.key (
str
|None
, default:None
) – A relative path within default storage, e.g.,"myfolder/myfile.h5mu"
.description (
str
|None
, default:None
) – A description.revises (
Artifact
|None
, default:None
) – An old version of the artifact.run (
Run
|None
, default:None
) – The run that creates the artifact.
- Return type:
See also
Collection()
Track collections.
Feature
Track features.
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> mdata = ln.core.datasets.mudata_papalexi21_subset() >>> artifact = ln.Artifact.from_mudata(mdata, description="a mudata object") >>> artifact.save()
- classmethod get(idlike=None, **expressions)¶
Get a single record.
- Parameters:
idlike (
int
|str
|None
, default:None
) – Either a uid stub, uid or an integer id.expressions – Fields and values passed as Django query expressions.
- Return type:
- Returns:
A record.
- Raises:
lamindb.core.exceptions.DoesNotExist – In case no matching record is found.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ulabel = ln.ULabel.get("FvtpPJLJ") >>> ulabel = ln.ULabel.get(name="my-label")
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str
|DeferredAttribute
|None
, default:None
) – The field to look up the values for. Defaults to first string field.return_field (
str
|DeferredAttribute
|None
, default:None
) – The field to return. IfNone
, returns the whole record.
- Return type:
NamedTuple
- Returns:
A
NamedTuple
of lookup information of the field values with a dictionary converter.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> bt.Gene.from_source(symbol="ADGB-DT").save() >>> lookup = bt.Gene.lookup() >>> lookup.adgb_dt >>> lookup_dict = lookup.dict() >>> lookup_dict['ADGB-DT'] >>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") >>> genes.ensg00000002745 >>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str
) – The input string to match against the field ontology values.field (
str
|DeferredAttribute
|None
, default:None
) – The field or fields to search. Search all string fields by default.limit (
int
|None
, default:20
) – Maximum amount of top results to return.case_sensitive (
bool
, default:False
) – Whether the match is case sensitive.
- Return type:
QuerySet
- Returns:
A sorted
DataFrame
of search results with a score in columnscore
. Ifreturn_queryset
isTrue
.QuerySet
.
Examples
>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name") >>> ln.save(ulabels) >>> ln.ULabel.search("ULabel2")
- classmethod using(instance)¶
Use a non-default LaminDB instance.
- Parameters:
instance (
str
|None
) – An instance identifier of form “account_handle/instance_name”.- Return type:
QuerySet
Examples
>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name") uid score name ULabel7 g7Hk9b2v 100.0 ULabel5 t4Jm6s0q 75.0 ULabel6 r2Xw8p1z 75.0
Methods¶
- cache(is_run_input=None)¶
Download cloud artifact to local cache.
Follows synching logic: only caches an artifact if it’s outdated in the local cache.
Returns a path to a locally cached on-disk object (say a
.jpg
file).- Return type:
Path
Examples
Sync file from cloud and return the local path of the cache:
>>> artifact.cache() PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')
- delete(permanent=None, storage=None, using_key=None)¶
Trash or permanently delete.
A first call to
.delete()
puts an artifact into the trash (sets_branch_code
to-1
). A second call permanently deletes the artifact. If it is a folder artifact with multiple versions, deleting a non-latest version will not delete the underlying storage by default (ifstorage=True
is not specified). Deleting the latest version will delete all the versions for folder artifacts.FAQ: Storage FAQ
- Parameters:
permanent (
bool
|None
, default:None
) – Permanently delete the artifact (skip trash).storage (
bool
|None
, default:None
) – Indicate whether you want to delete the artifact in storage.
- Return type:
None
Examples
For an
Artifact
objectartifact
, call:>>> artifact = ln.Artifact.filter(key="some.csv").one() >>> artifact.delete() # delete a single file artifact
>>> artifact = ln.Artifact.filter(key="some.tiledbsoma". is_latest=False).first() >>> artiact.delete() # delete an old version, the data will not be deleted
>>> artifact = ln.Artifact.filter(key="some.tiledbsoma". is_latest=True).one() >>> artiact.delete() # delete all versions, the data will be deleted or prompted for deletion.
- describe(print_types=False)¶
Describe relations of record.
Examples
>>> artifact.describe()
- load(is_run_input=None, **kwargs)¶
Cache and load into memory.
See all
loaders
.- Return type:
Any
Examples
Load a
DataFrame
-like artifact:>>> artifact.load().head() sepal_length sepal_width petal_length petal_width iris_organism_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0
Load an
AnnData
-like artifact:>>> artifact.load() AnnData object with n_obs × n_vars = 70 × 765
Fall back to
cache()
if no in-memory representation is configured:>>> artifact.load() PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')
- open(mode='r', is_run_input=None)¶
Return a cloud-backed data object.
Works for
AnnData
(.h5ad
and.zarr
), generichdf5
andzarr
,tiledbsoma
objects (.tiledbsoma
),pyarrow
compatible formats.- Parameters:
mode (str, default:
'r'
) – can only be"w"
(write mode) fortiledbsoma
stores, otherwise should be always"r"
(read-only mode).- Return type:
AnnDataAccessor | BackedAccessor | SOMACollection | SOMAExperiment | PyArrowDataset
Notes
For more info, see tutorial: Slice arrays.
Examples
Read AnnData in backed mode from cloud:
>>> artifact = ln.Artifact.get(key="lndb-storage/pbmc68k.h5ad") >>> artifact.open() AnnDataAccessor object with n_obs × n_vars = 70 × 765 constructed for the AnnData object pbmc68k.h5ad ...
- replace(data, run=None, format=None)¶
Replace artifact content.
- Parameters:
data (lamindb.core.types.UPathStr) – A file path.
run (
Run
|None
, default:None
) – The run that created the artifact gets auto-linked ifln.track()
was called.
- Return type:
None
Examples
Say we made a change to the content of an artifact, e.g., edited the image
paradisi05_laminopathic_nuclei.jpg
.This is how we replace the old file in storage with the new file:
>>> artifact.replace("paradisi05_laminopathic_nuclei.jpg") >>> artifact.save()
Note that this neither changes the storage key nor the filename.
However, it will update the suffix if it changes.
- restore()¶
Restore from trash.
- Return type:
None
Examples
For any
Artifact
objectartifact
, call:>>> artifact.restore()
- save(upload=None, **kwargs)¶
Save to database & storage.
- Parameters:
upload (
bool
|None
, default:None
) – Trigger upload to cloud storage in instances with hybrid storage mode.- Return type:
Examples
>>> artifact = ln.Artifact("./myfile.csv", description="myfile") >>> artifact.save()
- view_lineage(with_children=True)¶
Graph of data flow.
- Return type:
None
Notes
For more info, see use cases: Data lineage.
Examples
>>> collection.view_lineage() >>> artifact.view_lineage()