Hubmap¶
The HubMAP (Human BioMolecular Atlas Program) consortium is an initiative mapping human cells to create a comprehensive atlas, with its Data Portal serving as the platform where researchers can access, visualize, and download tissue data.
Lamin mirrors most of the datasets for simplified access here: laminlabs/hubmap.
If you use the data academically, please cite the original publication Jain et al. 2023.
Here, we show how the HubMAP instance is structured and how datasets and be queried and accessed. If you’d like to transfer data into your own LaminDB instance, see the transfer guide.
# pip install lamindb
!lamin init --modules bionty --storage ./test-hubmap
Show code cell output
→ initialized lamindb: testuser1/test-hubmap
import lamindb as ln
import h5py
import anndata as ad
import pandas as pd
Show code cell output
→ connected lamindb: testuser1/test-hubmap
Create the central query object for this instance:
db = ln.DB("laminlabs/hubmap")
Getting HubMAP datasets and data products¶
HubMAP associates several data products, which are the single raw datasets, into higher level datasets. For example, the dataset HBM983.LKMP.544 has four data products:
The laminlabs/hubmap instance registers these data products as Artifact that jointly form a Collection.
The key attribute of ln.Artifact and ln.Collection corresponds to the IDs of the URLs.
For example, the id in the URL https://portal.hubmapconsortium.org/browse/dataset/20ee458e5ee361717b68ca72caf6044e is the key of the corresponding collection:
small_intenstine_collection = db.Collection.get(key="20ee458e5ee361717b68ca72caf6044e")
small_intenstine_collection
Show code cell output
Collection(uid='xvmP4QeSH584JUbg0000', version=None, is_latest=True, key='20ee458e5ee361717b68ca72caf6044e', description='RNAseq data from the small intestine of a 67-year-old white female', hash='bxpInd96BItVhxWNhgQStw', reference=None, reference_type=None, meta_artifact=None, branch_id=1, space_id=1, created_by_id=5, run_id=35, created_at=2025-05-21 11:15:36 UTC, is_locked=False)
We can get all associated data products like:
small_intenstine_collection.artifacts.all().to_dataframe()
Show code cell output
! no run & transform got linked, call `ln.track()` & re-run
! run input wasn't tracked, call `ln.track()` and re-run
| uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | version | is_latest | is_locked | created_at | branch_id | space_id | storage_id | run_id | schema_id | created_by_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | ||||||||||||||||||||
| 876 | dYhDR2fx8dccLWer0000 | f6eb890063d13698feb11d39fa61e45a/scvelo_annota... | RNAseq data from the small intestine of a 67-y... | .h5ad | None | AnnData | 641007602 | HxvPzL_Pkx6ncEJJcS_GWw | None | NaN | None | True | False | 2025-05-21 11:15:19.475249+00:00 | 1 | 1 | 2 | 35 | None | 5 |
| 30 | enXVzwjw4voS8UCb0000 | f6eb890063d13698feb11d39fa61e45a/expr.h5ad | RNAseq data from the small intestine of a 67-y... | .h5ad | None | AnnData | 139737320 | kR476u81gwXI6rEbXzNBvQ | None | 6000.0 | None | True | False | 2025-01-28 14:16:43.385980+00:00 | 1 | 1 | 2 | 11 | None | 3 |
| 29 | fWN781TxuZibkBOR0000 | f6eb890063d13698feb11d39fa61e45a/secondary_ana... | RNAseq data from the small intestine of a 67-y... | .h5ad | None | AnnData | 888111371 | ian3P5CN68AAvoDMC6sZLw | None | 5956.0 | None | True | False | 2025-01-28 14:16:39.348589+00:00 | 1 | 1 | 2 | 11 | None | 3 |
| 28 | AzqCWQAKLMV3iTMA0000 | f6eb890063d13698feb11d39fa61e45a/raw_expr.h5ad | RNAseq data from the small intestine of a 67-y... | .h5ad | None | AnnData | 67867992 | of_TeLP6cet2JBj3o_kZmQ | None | 6000.0 | None | True | False | 2025-01-28 14:16:35.355582+00:00 | 1 | 1 | 2 | 11 | None | 3 |
Note the key of these three Artifacts which corresponds to the assets URL.
For example, https://assets.hubmapconsortium.org/f6eb890063d13698feb11d39fa61e45a/expr.h5ad is the direct URL to the expr.h5ad data product.
Artifacts can be directly loaded:
small_intenstine_artifact = small_intenstine_collection.artifacts.get(
key__icontains="raw_expr.h5ad"
)
adata = small_intenstine_artifact.load()
adata
Querying single-cell RNA sequencing datasets¶
The artifacts corresponding to the raw_expr.h5ad data products are labeled with metadata.
The available metadata includes ln.Reference, bt.Tissue, bt.Disease, bt.ExperimentalFactor, and many more.
Please have a look at the instance for more details.
# Get one dataset with a specific type of heart failure
heart_failure_artifact = db.Artifact.filter(
diseases__name="heart failure with reduced ejection fraction"
).first()
heart_failure_artifact
Show code cell output
Artifact(uid='ZXUwnz25dyr8QaJd0000', version=None, is_latest=True, key='2d0deacd8be70eefbdc33ac107d97e58/expr.h5ad', description='RNAseq data from the heart of a 25-year-old white female', suffix='.h5ad', kind=None, otype='AnnData', size=54174128, hash='KISfOkWKEi6-RNn6--I-TQ', n_files=None, n_observations=52534, branch_id=1, space_id=1, storage_id=2, run_id=11, schema_id=None, created_by_id=3, created_at=2025-01-28 15:57:03 UTC, is_locked=False)
Querying bulk RNA sequencing datasets¶
Bulk datasets contain a single file: expression_matrices.h5, which is a hdf5 file containing transcript by sample matrices of TPM and number of reads.
These files are labeled with metadata, including ln.Reference, bt.Tissue, bt.Disease, bt.ExperimentalFactor, and many more.
To make the expression data usable with standard analysis workflows, we first read the TPM and raw count matrices from the file and then convert them into a single AnnData object.
In this object, raw read counts are stored in .X, and TPM values are added as a separate layer under .layers["tpm"].
# Get one placenta tissue dataset:
placenta_data = db.Artifact.filter(tissues__name="placenta").first().cache()
def load_matrix(group):
values = group["block0_values"][:]
columns = group["block0_items"][:].astype(str)
index = group["axis1"][:].astype(str)
return pd.DataFrame(values, index=index, columns=columns)
with h5py.File(placenta_data, "r") as f:
tpm_df = load_matrix(f["tpm"])
reads_df = load_matrix(f["num_reads"])
# Use raw read counts as the main matrix
placenta_adata = ad.AnnData(X=reads_df.values)
placenta_adata.obs_names = reads_df.index
placenta_adata.var_names = reads_df.columns
# Store TPM normalized values in a layer
placenta_adata.layers["tpm"] = tpm_df.values
placenta_adata