Introduction
Semantic Universe has begun producing linked data for its Enterprise Data World and Semantic Technology Conferences. There were several motivations behind this effort.
First, there was a certain amount of eating their own dog food needed. As the producer of the leading conference series in the semantic technology space, it was important to highlight and contribute to the growing universe of publicly-available linked data. Organizations of all types (commercial, governmental, non-profit) are encouraged by the community to share as much of their information publicly as they are comfortable doing. Not every data set will be of use to everyone, but, like the Web itself, having wide diversity of information available is a big part of what will make the linked data clouds empowering. Data quality issues persist in many sources, and integration is not as easy as some imagine, but as more organizations provide global identifiers, data and metadata, there will be more examples to learn from and more opportunity to reuse existing URIs from different domains.
The next reason for producing this information is that they wanted to be able to use it themselves. Future conference sites will adopt RDFa to weave machine-processable information into the pages about speakers, talks, tracks, etc. With the information exposed as RDF to their sites, it will be much easier to slice and dice the schedules, tracks, speakers, organizations and other data in ways that will help overwhelmed attendees. Imagine tools to help build personalized schedules that are exportable to your calendars based on arbitrary views of the conference.
In order to manage a transition to this reality, they needed to start to have this information named and linkable in a global context. Rather than simply holding the information privately, they wanted to share it in the hopes that others would find the information useful as well. As a snapshot of the thought-leaders of an emerging community discussing itself, there are several interesting discoveries contained within this data.
Finally, this was intended as a proof-of-concept. There are several powerful software platforms and packages for producing and consuming large and dynamic RDF sets. As a relatively small dataset that does not change all that frequently, it seemed like an opportunity to highlight how easy it was to produce a usable, up-to-date, browsable and queryable linked data face to the world. In addition to serving as an example for RDF production, they also wanted to highlight the process of putting a RESTful API around this data. They asked me to help them produce these services. To underscore the learning potential, we will be blogging about the initial features, what we have learned and as we add new features and uses, we will share those with the community as well.
The Data
To start with, there are three years of information available through existing internal XML feeds for both the Semantic Technology and Enterprise Data World conferences. The feeds exposed public-facing information about the conferences held in relational databases. One of the first decisions was to not store the RDF separately if we didn’t have to. This isn’t a ton of information and it doesn’t change frequently (at least after the conference sessions are chosen), but we did want it to be current. If records were changed in the internal database, we wanted to reflect them in the data without having to do any kind of export and load process.
A system like D2RQ would have been useful to convert the relational schemas into RDF, but we didn’t want to have to install anything on the backend machines given that the XML feeds were already in place. Otherwise, this (or something similar) would have been a good intermediate step.
Each conference instance (e.g. SemTech 2009) had its own feed and was self-contained. There were no global identifiers contained within the entries other than links to talk HTML pages, images, etc. We considered a formal mapping or naming scheme to manage things like speaker identity, but the community was small enough that there were few collisions. Speaker e-mail addresses were not part of the data feeds for privacy reasons so we needed a mechanism for generating URIs for them. Where speakers appeared in multiple feeds, there was a good deal of consistency in how they were named so a relatively simple algorithm was chosen. Still, there were issues.
For example, Christine Connors of TriviumRLG, LLC was registered as “Christine J Manuel Connors” for SemTech 2008 and 2009 but “Christine JM Connors” for SemTech 2010. Through our algorithm, we converted her names to “christine-j-manuel-connors” and “christine-jm-connors” respectively. While we do plan to improve the co-reference handling over time, the exceptions are not numerous enough to warrant a big effort. Some manual hackery and heuristics will probably be the first approach. Still, these issues will only get worse over time as we have more collisions (so far we have only noticed one) so we plan to allow user-driven feedback for owl:sameAs assertions in the future.
There were a few other identity issues we ran into, but nothing serious. If we discover more, we will address them as we go.
With that in mind, we designed a DBpedia-inspired naming scheme as follows:
The logical name for a resource would be:
http://data.semanticuniverse.com/resource/christine-jm-connors
the HTML view of her RDF data is available at:
http://data.semanticuniverse.com/page/christine-jm-connors
and the direct RDF feed is:
http://data.semanticuniverse.com/data/christine-jm-connors.rdf
The conferences themselves are advertised at:
http://data.semanticuniverse.com
from which we see:
http://data.semanticuniverse.com/resource/semtech-2008
http://data.semanticuniverse.com/resource/semtech-2009
http://data.semanticuniverse.com/resource/semtech-2010
http://data.semanticuniverse.com/resource/edw-2008
http://data.semanticuniverse.com/resource/edw-2009
http://data.semanticuniverse.com/resource/edw-2010
Each of these resources is a largish and comprehensive view of the data contained within a given feed. This includes information about speakers, talks, tracks, organizations and rooms. With these as starting points, it is easy to start to navigate the datasets or query directly against the appropriate RDF endpoint using SPARQL.
Other resources are treated somewhat differently. When we seek information about a particular speaker, organization, track, room, etc. we look across all of the conferences. For example:
http://data.semanticuniverse.com/page/lee-feigenbaum
Presents information about Lee as well as connects him to his organization, Cambridge Semantics, Inc., and talks from all three SemTech conferences.
Other sample resources include:
Talks: http://data.semanticuniverse.com/page/edw-2009-1365
Tracks: http://data.semanticuniverse.com/page/edw-2009-project-management
Rooms: http://data.semanticuniverse.com/page/hillsborough
Organizations: http://data.semanticuniverse.com/page/cleveland-clinic
While the SPARQL endpoint is not published yet, you can still run queries against the data:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX vcard: <http://www.w3.org/2001/vcard-rdf/3.0#>
PREFIX swc: <http://data.semanticweb.org/ns/swc/ontology#>
select distinct ?title ?name
from <http://data.semanticuniverse.com/data/semtech-2010.rdf>
where {
?talk dc:creator ?speaker .
?talk dc:title ?title .
?speaker foaf:name ?name .
?talk swc:isPartOf <http://data.semanticuniverse.com/resource/semtech-2010-foundational-topics>
} order by asc(?title)
If you want the RDF back in a different form, you can use content negotiation against either the logical name or the data links. For example, with the curl command line tool:
curl -L -H “Accept:text/turtle” http://data.semanticuniverse.com/resource/lee-feigenbaum
curl -H “Accept:text/turtle” http://data.semanticuniverse.com/data/lee-feigenbaum.rdf
You’ll notice the need to tell curl to follow redirects on the first one.
In the second blog entry, we will explore the REST API and SPARQL endpoints for this linked data. In the third blog entry, we will explore a simple effort to use this data. In the fourth entry, we will discuss the NetKernel environment used to develop these capabilities.
Comments
thanks for post this
thanks for post this article!!
additional properties
if possible can we, in addition to the existing properties, expose the location of a speaker in the conference linked data set please?