RDF export & SPARQL queries

SPARQL is a query language used to retrieve and manipulate data stored in Resource Description Framework (RDF) format. In this tutorial, we demonstrate how lamindb registries can be queried with SPARQL.

Hide code cell content
import warnings

warnings.filterwarnings("ignore")
# !pip install 'lamindb[aws,bionty]'
!lamin load laminlabs/cellxgene
 connected lamindb: laminlabs/cellxgene
import bionty as bt

from rdflib import Graph, Literal, RDF, URIRef

Generally, we need to build a directed RDF Graph composed of triple statements. Such a graph statement is represented by:

  1. a node for the subject

  2. an arc that goes from a subject to an object for the predicate

  3. a node for the object.

Each of the three parts can be identified by a URI.

We can use the DataFrame representation of lamindb registries to build a RDF graph.

Building a RDF graph

diseases = bt.Disease.df()
diseases.head()
 connected lamindb: laminlabs/cellxgene
uid name ontology_id abbr synonyms description source_id run_id created_at created_by_id
id
770 1tVxb3df premalignant hematological system disease MONDO:0060782 None premalignant hematologic condition A Hematologic Disorder Which Does Not Display ... 49.0 NaN 2024-12-04 14:02:04.904910+00:00 6
769 1sfLCprL clonal hematopoiesis MONDO:0100542 None None None NaN 27.0 2024-07-12 14:20:56.067237+00:00 1
767 7Zxvmde6 hereditary motor neuron disease MONDO:0024257 None genetic anterior horn cell disease|hereditary ... An Instance Of Motor Neuron Disease That Is Ca... 49.0 27.0 2024-07-12 14:01:02.371130+00:00 1
766 3nbMYxeu familial amyotrophic lateral sclerosis MONDO:0005144 None hereditary amyotrophic lateral sclerosis An Instance Of Amyotrophic Lateral Sclerosis T... 49.0 27.0 2024-07-12 14:01:00.473714+00:00 1
765 4wI2szgu basal cell neoplasm MONDO:0020799 None basal cell tumor A Neoplastic Proliferation Of Basal Cells In T... 49.0 27.0 2024-07-12 14:00:59.087086+00:00 1

We convert the DataFrame to RDF by generating triples.

rdf_graph = Graph()

namespace = URIRef("http://sparql-example.org/")

for _, row in diseases.iterrows():
    subject = URIRef(namespace + str(row["ontology_id"]))
    rdf_graph.add((subject, RDF.type, URIRef(namespace + "Disease")))
    rdf_graph.add((subject, URIRef(namespace + "name"), Literal(row["name"])))
    rdf_graph.add(
        (subject, URIRef(namespace + "description"), Literal(row["description"]))
    )

rdf_graph
<Graph identifier=N8826cbe23c50453d95b12f34d24634ee (<class 'rdflib.graph.Graph'>)>

Now we can query the RDF graph using SPARQL for the name and associated description:

query = """
SELECT ?name ?description
WHERE {
  ?disease a <http://sparql-example.org/Disease> .
  ?disease <http://sparql-example.org/name> ?name .
  ?disease <http://sparql-example.org/description> ?description .
}
LIMIT 5
"""

for row in rdf_graph.query(query):
    print(f"Name: {row.name}, Description: {row.description}")
Name: premalignant hematological system disease, Description: A Hematologic Disorder Which Does Not Display The Morphologic And/Or Clinical Characteristics Of An Overt Malignancy. Representative Examples Include Atypical Lymphoproliferative Disorders And Myelodysplastic Syndromes.
Name: clonal hematopoiesis, Description: None
Name: hereditary motor neuron disease, Description: An Instance Of Motor Neuron Disease That Is Caused By An Inherited Modification Of The Individual'S Genome.
Name: familial amyotrophic lateral sclerosis, Description: An Instance Of Amyotrophic Lateral Sclerosis That Is Caused By An Inherited Modification Of The Individual'S Genome.
Name: basal cell neoplasm, Description: A Neoplastic Proliferation Of Basal Cells In The Epidermis (Part Of The Skin) Or Other Anatomic Sites (Most Frequently The Salivary Glands). The Basal Cell Neoplastic Proliferation In The Epidermis Results In Basal Cell Carcinomas. The Basal Cell Neoplastic Proliferation In The Salivary Glands Can Be Benign, Resulting In Basal Cell Adenomas Or Malignant, Resulting In Basal Cell Adenocarcinomas.