RDF export & SPARQL queries¶
SPARQL is a query language used to retrieve and manipulate data stored in Resource Description Framework (RDF) format. In this tutorial, we demonstrate how lamindb registries can be queried with SPARQL.
Show code cell content
import warnings
warnings.filterwarnings("ignore")
# !pip install 'lamindb[aws,bionty]'
!lamin load laminlabs/cellxgene
→ connected lamindb: laminlabs/cellxgene
import bionty as bt
from rdflib import Graph, Literal, RDF, URIRef
Generally, we need to build a directed RDF Graph composed of triple statements. Such a graph statement is represented by:
a node for the subject
an arc that goes from a subject to an object for the predicate
a node for the object.
Each of the three parts can be identified by a URI.
We can use the DataFrame
representation of lamindb registries to build a RDF graph.
Building a RDF graph¶
diseases = bt.Disease.df()
diseases.head()
→ connected lamindb: laminlabs/cellxgene
uid | name | ontology_id | abbr | synonyms | description | source_id | run_id | created_at | created_by_id | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
769 | 1sfLCprL | clonal hematopoiesis | MONDO:0100542 | None | None | None | NaN | 27.0 | 2024-07-12 14:20:56.067237+00:00 | 1 |
767 | 7Zxvmde6 | hereditary motor neuron disease | MONDO:0024257 | None | genetic anterior horn cell disease|hereditary ... | An Instance Of Motor Neuron Disease That Is Ca... | 49.0 | 27.0 | 2024-07-12 14:01:02.371130+00:00 | 1 |
766 | 3nbMYxeu | familial amyotrophic lateral sclerosis | MONDO:0005144 | None | hereditary amyotrophic lateral sclerosis | An Instance Of Amyotrophic Lateral Sclerosis T... | 49.0 | 27.0 | 2024-07-12 14:01:00.473714+00:00 | 1 |
765 | 4wI2szgu | basal cell neoplasm | MONDO:0020799 | None | basal cell tumor | A Neoplastic Proliferation Of Basal Cells In T... | 49.0 | 27.0 | 2024-07-12 14:00:59.087086+00:00 | 1 |
764 | g0eRt9m2 | scleroderma | MONDO:0019340 | None | dermatosclerosis|scleroderma (disease)|sclerod... | Scleroderma Is A Rare Autoimmune Connective Ti... | 49.0 | 27.0 | 2024-07-12 14:00:57.255644+00:00 | 1 |
We convert the DataFrame to RDF by generating triples.
rdf_graph = Graph()
namespace = URIRef("http://sparql-example.org/")
for _, row in diseases.iterrows():
subject = URIRef(namespace + str(row["ontology_id"]))
rdf_graph.add((subject, RDF.type, URIRef(namespace + "Disease")))
rdf_graph.add((subject, URIRef(namespace + "name"), Literal(row["name"])))
rdf_graph.add(
(subject, URIRef(namespace + "description"), Literal(row["description"]))
)
rdf_graph
<Graph identifier=N19fc43a14dff4a518bf7e771fc209ef7 (<class 'rdflib.graph.Graph'>)>
Now we can query the RDF graph using SPARQL for the name and associated description:
query = """
SELECT ?name ?description
WHERE {
?disease a <http://sparql-example.org/Disease> .
?disease <http://sparql-example.org/name> ?name .
?disease <http://sparql-example.org/description> ?description .
}
LIMIT 5
"""
for row in rdf_graph.query(query):
print(f"Name: {row.name}, Description: {row.description}")
Name: clonal hematopoiesis, Description: None
Name: hereditary motor neuron disease, Description: An Instance Of Motor Neuron Disease That Is Caused By An Inherited Modification Of The Individual'S Genome.
Name: familial amyotrophic lateral sclerosis, Description: An Instance Of Amyotrophic Lateral Sclerosis That Is Caused By An Inherited Modification Of The Individual'S Genome.
Name: basal cell neoplasm, Description: A Neoplastic Proliferation Of Basal Cells In The Epidermis (Part Of The Skin) Or Other Anatomic Sites (Most Frequently The Salivary Glands). The Basal Cell Neoplastic Proliferation In The Epidermis Results In Basal Cell Carcinomas. The Basal Cell Neoplastic Proliferation In The Salivary Glands Can Be Benign, Resulting In Basal Cell Adenomas Or Malignant, Resulting In Basal Cell Adenocarcinomas.
Name: scleroderma, Description: Scleroderma Is A Rare Autoimmune Connective Tissue Disorder Characterized By Abnormal Hardening Of The Skin And, Sometimes, Other Organs. It Is Classified Into Two Main Forms: Localized Scleroderma And Systemic Sclerosis (Ssc), The Latter Comprising Three Subsets; Diffuse Cutaneous Ssc (Dcssc), Limited Cutaneous Ssc (Lcssc) And Limited Ssc (Lssc).