RDF export & SPARQL queries

SPARQL is a query language used to retrieve and manipulate data stored in Resource Description Framework (RDF) format. In this tutorial, we demonstrate how lamindb registries can be queried with SPARQL.

Hide code cell content
import warnings
warnings.filterwarnings("ignore")

Install the lamindb Python package:

pip install 'lamindb[aws,bionty]'
!lamin load laminlabs/cellxgene
💡 connected lamindb: laminlabs/cellxgene
import bionty as bt

from rdflib import Graph, Literal, RDF, URIRef
💡 connected lamindb: laminlabs/cellxgene

Generally, we need to build a directed RDF Graph composed of triple statements. Such a graph statement is represented by:

  1. a node for the subject

  2. an arc that goes from a subject to an object for the predicate

  3. a node for the object.

Each of the three parts can be identified by a URI.

We can use the DataFrame representation of lamindb registries to build a RDF graph.

Building a RDF graph

diseases = bt.Disease.df()
diseases.head()
uid name ontology_id abbr synonyms description public_source_id run_id created_by_id updated_at
id
769 1sfLCprL clonal hematopoiesis MONDO:0100542 None None None NaN 27.0 1 2024-07-12 14:20:56.067341+00:00
767 7Zxvmde6 hereditary motor neuron disease MONDO:0024257 None genetic anterior horn cell disease|hereditary ... An Instance Of Motor Neuron Disease That Is Ca... 49.0 27.0 1 2024-07-12 14:01:02.371548+00:00
766 3nbMYxeu familial amyotrophic lateral sclerosis MONDO:0005144 None hereditary amyotrophic lateral sclerosis An Instance Of Amyotrophic Lateral Sclerosis T... 49.0 27.0 1 2024-07-12 14:01:00.473774+00:00
765 4wI2szgu basal cell neoplasm MONDO:0020799 None basal cell tumor A Neoplastic Proliferation Of Basal Cells In T... 49.0 27.0 1 2024-07-12 14:00:59.087143+00:00
764 g0eRt9m2 scleroderma MONDO:0019340 None dermatosclerosis|scleroderma (disease)|sclerod... Scleroderma Is A Rare Autoimmune Connective Ti... 49.0 27.0 1 2024-07-12 14:00:57.255706+00:00

We convert the DataFrame to RDF by generating triples.

rdf_graph = Graph()

namespace = URIRef("http://sparql-example.org/")

for _, row in diseases.iterrows():
    subject = URIRef(namespace + str(row['ontology_id']))
    rdf_graph.add((subject, RDF.type, URIRef(namespace + "Disease")))
    rdf_graph.add((subject, URIRef(namespace + "name"), Literal(row['name'])))
    rdf_graph.add((subject, URIRef(namespace + "description"), Literal(row['description'])))

rdf_graph
<Graph identifier=Nc2bda3da11884237a94b5ccea28d02b8 (<class 'rdflib.graph.Graph'>)>

Now we can query the RDF graph using SPARQL for the name and associated description:

query = """
SELECT ?name ?description
WHERE {
  ?disease a <http://sparql-example.org/Disease> .
  ?disease <http://sparql-example.org/name> ?name .
  ?disease <http://sparql-example.org/description> ?description .
}
LIMIT 5
"""

for row in rdf_graph.query(query):
    print(f"Name: {row.name}, Description: {row.description}")
Name: clonal hematopoiesis, Description: None
Name: hereditary motor neuron disease, Description: An Instance Of Motor Neuron Disease That Is Caused By An Inherited Modification Of The Individual'S Genome.
Name: familial amyotrophic lateral sclerosis, Description: An Instance Of Amyotrophic Lateral Sclerosis That Is Caused By An Inherited Modification Of The Individual'S Genome.
Name: basal cell neoplasm, Description: A Neoplastic Proliferation Of Basal Cells In The Epidermis (Part Of The Skin) Or Other Anatomic Sites (Most Frequently The Salivary Glands). The Basal Cell Neoplastic Proliferation In The Epidermis Results In Basal Cell Carcinomas. The Basal Cell Neoplastic Proliferation In The Salivary Glands Can Be Benign, Resulting In Basal Cell Adenomas Or Malignant, Resulting In Basal Cell Adenocarcinomas.
Name: scleroderma, Description: Scleroderma Is A Rare Autoimmune Connective Tissue Disorder Characterized By Abnormal Hardening Of The Skin And, Sometimes, Other Organs. It Is Classified Into Two Main Forms: Localized Scleroderma And Systemic Sclerosis (Ssc), The Latter Comprising Three Subsets; Diffuse Cutaneous Ssc (Dcssc), Limited Cutaneous Ssc (Lcssc) And Limited Ssc (Lssc).