RDF export & SPARQL queries

SPARQL is a query language used to retrieve and manipulate data stored in Resource Description Framework (RDF) format. In this tutorial, we demonstrate how lamindb registries can be queried with SPARQL.

Hide code cell content
import warnings

warnings.filterwarnings("ignore")
# pip install 'lamindb[bionty]' rdflib
!lamin connect laminlabs/lamindata
 connected lamindb: laminlabs/lamindata
import bionty as bt

from rdflib import Graph, Literal, RDF, URIRef

Generally, we need to build a directed RDF Graph composed of triple statements. Such a graph statement is represented by:

  1. a node for the subject

  2. an arc that goes from a subject to an object for the predicate

  3. a node for the object.

Each of the three parts can be identified by a URI.

We can use the DataFrame representation of lamindb registries to build a RDF graph.

Building a RDF graph

diseases = bt.Disease.df()
diseases.head()
 connected lamindb: laminlabs/lamindata
uid name ontology_id abbr synonyms description space_id source_id run_id created_at created_by_id _aux _branch_code
id
172 4c6NK4On acute disease MONDO:0020683 None disease, acute|acute disease|acute diseases Disease Having A Short And Relatively Severe C... 1 76 NaN 2025-01-08 13:33:31.718016+00:00 2 None 1
171 IfbzfDzV non-Hodgkin lymphoma MONDO:0018908 None non-Hodgkin lymphoma|non-Hodgkin's lymphoma|no... Distinct From Hodgkin Lymphoma Both Morphologi... 1 76 NaN 2025-01-08 13:33:31.717997+00:00 2 None 1
170 2AhKtWA4 lymphoid hemopathy MONDO:0015757 None None None 1 76 NaN 2025-01-08 13:33:31.717978+00:00 2 None 1
169 7EIZsogb acute leukemia MONDO:0010643 None acute leukaemia (disease)|acute leukemia|acute... A Clonal (Malignant) Hematopoietic Disorder Wi... 1 76 NaN 2025-01-08 13:33:31.717959+00:00 2 None 1
168 5fuD5lYR T-cell leukemia MONDO:0005525 None leukaemia (disease) of T cell|T cell leukemia ... A Malignant Disease Of The T-Lymphocytes In Th... 1 76 NaN 2025-01-08 13:33:31.717940+00:00 2 None 1

We convert the DataFrame to RDF by generating triples.

rdf_graph = Graph()

namespace = URIRef("http://sparql-example.org/")

for _, row in diseases.iterrows():
    subject = URIRef(namespace + str(row["ontology_id"]))
    rdf_graph.add((subject, RDF.type, URIRef(namespace + "Disease")))
    rdf_graph.add((subject, URIRef(namespace + "name"), Literal(row["name"])))
    rdf_graph.add(
        (subject, URIRef(namespace + "description"), Literal(row["description"]))
    )

rdf_graph
<Graph identifier=Nde160a2346d74052afc693948898f0bd (<class 'rdflib.graph.Graph'>)>

Now we can query the RDF graph using SPARQL for the name and associated description:

query = """
SELECT ?name ?description
WHERE {
  ?disease a <http://sparql-example.org/Disease> .
  ?disease <http://sparql-example.org/name> ?name .
  ?disease <http://sparql-example.org/description> ?description .
}
LIMIT 5
"""

for row in rdf_graph.query(query):
    print(f"Name: {row.name}, Description: {row.description}")
Name: acute disease, Description: Disease Having A Short And Relatively Severe Course.
Name: non-Hodgkin lymphoma, Description: Distinct From Hodgkin Lymphoma Both Morphologically And Biologically, Non-Hodgkin Lymphoma (Nhl) Is Characterized By The Absence Of Reed-Sternberg Cells, Can Occur At Any Age, And Usually Presents As A Localized Or Generalized Lymphadenopathy Associated With Fever And Weight Loss. The Clinical Course Varies According To The Morphologic Type. Nhl Is Clinically Classified As Indolent, Aggressive, Or Having A Variable Clinical Course. Nhl Can Be Of B-Or T-/Nk-Cell Lineage.
Name: lymphoid hemopathy, Description: None
Name: acute leukemia, Description: A Clonal (Malignant) Hematopoietic Disorder With An Acute Onset, Affecting The Bone Marrow And The Peripheral Blood. The Malignant Cells Show Minimal Differentiation And Are Called Blasts, Either Myeloid Blasts (Myeloblasts) Or Lymphoid Blasts (Lymphoblasts).
Name: T-cell leukemia, Description: A Malignant Disease Of The T-Lymphocytes In The Bone Marrow, Thymus, And/Or Blood.