A Common SPARQL Extension

I have often heard people lament the lack of federated systems in the Semantic Web. Certainly, the use of URIs allows data from many different sources to link together seamlessly, and the judicious use of assertions like owl:sameAs can help fill in the gaps, but if a developer wants to do something interesting with data from more than one SPARQL endpoint, they must draw it all in from various locations before merging and linking it all locally. Where is the “webiness” in this data? Wasn’t SPARQL supposed to do all of the processing work for us? Why do we need to do this work manually?

The SPARQL Working Group does intend to address this issue, and has just accepted a note on federated queries. This still has a way to go to publication, so expect to see changes. Once the proposal have been finalized there is still a long way before we see widespread deployment, but it is good to see that the question is being addressed. In the meantime, there are alternatives approaches.

When graphs are named in SPARQL queries, they are referred to by URI. This follows the standard model of RDF in which every possible resource can have a URI associated with it. URIs allow graphs to interact seamlessly with the rest of the data in the database, and indeed, the rest of the data in the Semantic Web at large. Because the RDF data model is open, and the Semantic Web extends far beyond the local database, it seems reasonable that a SPARQL endpoint could be asked about a graph that it has no prior knowledge of. What should happen in this situation? A traditional database in a closed world would report an error, and this used to be the behavior in some RDF stores as well. Once SPARQL was standardized, RDF stores were required to return an empty result set when asked for an unknown graph in a dataset. But a number of RDF stores choose to interpret the dataset as something a little more broad than just the local database.

Graph Locations

URIs contain a scheme, and this typically identifies a protocol. URIs (Uniform Resource Identifiers) quite often identify a location on the internet, forming a URL (Uniform Resource Locator), most often starting with the ubiquitous http://. Because location defining URIs allow applications to identify and link with more data, these are by far the most popular form of URI used in the Semantic Web. Like all other resources, the URIs used for graphs often form perfectly good locators on the net.

The specification for RDF says that an RDF document defines a graph. So when a database with a SPARQL endpoint is asked for a graph that it knows nothing about, instead of returning an empty set of data it may check to see if it can use the graph URI to locate an RDF document on the net. If an RDF document can be retrieved from the given location, then the datastore will treat this as the graph specified in the original query. In this way, it is treating the World Wide Web of documents as part of the dataset that it resolves against.

Not every SPARQL endpoint implements this extension, but it is reasonably common. For instance, among the open source projects Sesame and Mulgara both support this behavior, and Joseki has an option to enable it. Accessing a document this way naturally makes a database slower when compared to a system that has fully indexed its RDF, but it works correctly and can be a useful method of processing data that is available elsewhere.

CONSTRUCT as a Graph

While the most common type of SPARQL query is the SELECT query, CONSTRUCT queries are common as well. While both use an identical form of WHERE clause to find and process data, they return results in very different formats. The results from a CONSTRUCT query must form a template having a multiple of three columns in order to form new triples. These templates can also include explicit URIs and Literals, which are not permitted in the SELECT clause of SPARQL 1.0. The end result is a set of triples that forms a new RDF graph.

SELECT queries provide a great deal of flexibility, in that they can have an arbitrary number of variables in a single binding (or “row” to use the parlance of relational database results). While less common, it is even possible to have different sets of variables in each binding (this may happen when using OPTIONAL and UNION expressions). By contrast, CONSTRUCT queries always result in something that is three columns wide: representing the subject, predicate, and object of a triple. They overcome the limitation in width by being able to generate multiple triples for each set of bindings that they process from their WHERE clauses. To facilitate this, they can also manufacture new blank nodes to link information between rows together.

Regardless of the form of query, both SELECT and CONSTRUCT provide a representation of data that has been obtained through their common form of WHERE clause. There are some minor differences, such as the ability to generate blank nodes, but in general these queries can be used to answer the same sorts of questions. This is important to realize, as it means that responses to most queries can be represented as a constructible graph.

CONSTRUCT as a URI

Clients communicate with a SPARQL endpoint using the SPARQL Protocol. This describes how a client connects to an endpoint, how to send the query, and what the response will look like. The protocol specification describes both SOAP and HTTP bindings, but over time SOAP has become less important, and receives little consideration in the proposed SPARQL 1.1. The HTTP bindings use the most common web operations, providing a similar access pattern to the popular REST architecture, allowing many SPARQL endpoints to enhance their functionality with an adjunct REST API. This REST style of access has become important enough that SPARQL 1.1 will formalize the approach in the new document for SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs.

Using HTTP to send a query to a SPARQL endpoint is relatively simple. The endpoint is represented with a URL, and all requests are made up of HTTP GET or POST methods to that location. The query itself is passed as a parameter named “query”. There are other parameters that can also be set, but this is all that is needed for a basic query.

To illustrate this, we can consider a SPARQL endpoint at the location http://example.com/sparql/, and a query we want to issue which asks to see all of the people in the default graph of a store. One query that will do this is:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?person WHERE { ?person a foaf:Person }

Once this query has been encoded for URLs it can be added to the end of the endpoint URL, providing the final URL of:

http://example.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com
%2Ffoaf%2F0.1%2F%3E%0ASELECT+%3Fperson+WHERE+%7B+%3Fperson+a+
foaf%3APerson+%7D

While this URL may appear to be somewhat unwieldy, it can be used in an HTTP GET request just like any other URL. By default, the response to this should be an XML document containing the appropriate SPARQL result set. If there were just one person in the result, then it might look like this:

<?xml version="1.0" encoding="UTF-8"?>
<sparql xmlns="http://www.w3.org/2005/sparql-results#">
<head>
<variable name="person"/>
</head>
<results>
<result>
<binding name="person">
<uri>http://example.com/people/Fred#me</uri>
</binding>
</result>
</results>
</sparql>

Exactly the same information can be retrieved as a graph using a CONSTRUCT query. In this case the query is:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
CONSTRUCT { ?person a foaf:Person } WHERE { ?person a foaf:Person }

Which leads to a request URL of:

http://example.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com
%2Ffoaf%2F0.1%2F%3E%0ACONSTRUCT+%7B+%3Fperson+a+foaf%3APerson+%7D
+WHERE+%7B+%3Fperson+a+foaf%3APerson+%7D

However, this time instead of getting back a result set, we see the information represented in an RDF document:

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="http://example.com/people/Fred#me">
<rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person"/>
</rdf:Description>
</rdf:RDF>

Gluing it Together

The interesting thing to note about the last query is that a single URI was provided which could be resolved to a location that in turn yielded an RDF graph. This is exactly the condition required for the common graph location extension described earlier. Therefore, any URL that describes a CONSTRUCT query against a SPARQL endpoint can be used as a graph URI in a SPARQL query. In other words, URIs of this form can be used in FROM clauses to specify the dataset for a query, or in a GRAPH modifier inside the WHERE clause.

To see this in action, we could ask an arbitrary SPARQL endpoint to return all the email addresses for people found at the example.com endpoint. A query that expresses this is:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
SELECT ?person ?email WHERE {
?person vcard:email ?email .
GRAPH <http://example.com/sparql/?query=PREFIX+foaf%3A+%3Chttp%3A%2F%2Fxmlns.com%2Ffoaf%2F0.1%2F%3E%0A
CONSTRUCT+%7B+%3Fperson+a+foaf%3APerson+%7D+WHERE+
%7B+%3Fperson+a+foaf%3APerson+%7D>
{ ?person a foaf:Person }
}

Admittedly, this is an ugly query that is not meant for human consumption. However, it is trivial for an application to generate a graph URI like this, and insert it into a query. Using automatic code generation, it is even possible to nest queries inside queries, addressing a new SPARQL endpoint at every step, though it is unlikely that you will ever want to do deep nesting like this. More likely is the possibility of referring to multiple endpoints in the main query.

Building SPARQL queries that refer to other SPARQL endpoints in this way allows the initial endpoint to bring together information from lots of other sources, and merge them to form a single unified result. This is the definition of a federated query. There is also possibility of each of the subordinate queries describing complex operations in their WHERE clauses, creating the potential for distribution of the processing work. Alternatively, a query can even refer back to the same SPARQL endpoint to create a kind of subquery. This last option is particularly useful when debugging complex queries before sending them out across the internet.

Federation

Not every SPARQL endpoint will support the remote graph retrieval extension. However, every compliant SPARQL endpoint will allow data to be fetched from it this way. So long as there is a single endpoint capable of this feature (and many are), then it is possible to federate queries out to any group of endpoints.

Creating federated queries using this extension also has several other difficulties. Subordinate queries need to be manually written and encoded before they can be used in the main query. These queries also need to be CONSTRUCT queries, which may not be as flexible as a simple SELECT. Also, the structure of result documents are unlikely to be optimal for network transfers. Add to this the need for a client to parse and process the resulting document into a graph representation, and it becomes readily apparent that this approach is not optimal.

Despite these difficulties, manual federation through graph retrieval is a viable option to federate queries using SPARQL 1.0. Once SPARQL 1.1 is available for general deployment, there will be better options available for both federating queries and embedding subqueries. Meanwhile, because manual federation is based on simple premises and standard web operations, any systems taking advantage of this technique can be assured that they will continue to work, even as better alternatives become available in the future.