In the broader library world, Linked Open Data (LOD)1 has gained a lot of attention over the last two years, with projects moving increasingly from theory to practice. The library domain is gearing more and more towards the technical and legal issues implied by this paradigm shift, with the announcement of ‘A Bibliographic Framework for the Digital Age’ by the Library of Congress (Marcum & Library of Congress, 2011) and the Conference of European National Librarians’ affirmation of open licensing for their data (Conference of European National Librarians, 2011) being only two examples.
Since 2009 the North Rhine-Westphalian Library Service Center (hbz) has been exploring Linked Open Data and Semantic Web technologies, where both the legal and the technological aspects of this ongoing paradigm change in information representation and provision. The hbz launched its LOD service lobid.org — standing for “Linking Open Bibliographic Data” — in August 2010 and since then has continuously been improving it.
The overall goal is to develop the underlying software framework so that its read/write services can be run, including a web presentation of the underlying RDF data and online forms to create, update and delete the underlying information represented in RDF (Resource Description Framework). To keep track of and to be able to revoke changes, the system should also fully versionize the underlying data structures. This paper explains its motivation (section 2) and describes the LOD service lobid.org (section 3), as these developments were initiated to improve the service. In section 4 the Fresnel Display Vocabulary for RDF is introduced which serves as a generic way to configure the presentation of RDF data. Section 5 explains Phresnel, a free software framework for presenting and editing RDF data based on Fresnel and PHP. Finally, in section 6 prospects for further developing Phresnel are listed.
2. Motivation and expected benefits
The hbz has been in the business of cooperative cataloguing for some time now, running a union catalogue since 1973. From this perspective, Linked Open Data provides a very interesting approach for distributed cooperative cataloguing in a web environment.
2.1. Web integration
Adherence to international and cross-domain web standards for Linked Data means web integration of library data, from which the following benefits are expected in the long term:
- Increased discoverability. Web-integrated data can easily be harvested by search engines and other discovery services.
- Multiple usability. RDF data stored in one data sink can easily be used as such by different services within the hbz and beyond.
- Interoperability and re-usability. Web standards facilitate reuse by reducing the need for conversion processes and post-processing.
- Flexibility. RDF and Triple Stores are very flexible regarding extensions and changes in the data model used.
2.2. Synergy effects
Following the best practices of the Linked Open Data community and working towards a standardization of the data produced by the different services within the hbz, return of investments in form of intra- and inter-organizational synergy effects are expected. Within the organization, we already see some effects on hbz projects reusing each others’ data. With LOD, this is possible in a straightforward way, whereas having to deal with proprietary interfaces and different formats often makes it quite labor-intensive for services to communicate effectively. Thus, standardization of services has the effect of liberating resources which then can be used for additional services or for improving existing services.
2.3. Less vendor dependencies
The provision of many library services depends on technology, and there only exist a few different vendors in the library business. Often organizations are depending on products by one or two vendors, with significant costs for switching to a different vendor or product. In other words, the lock-in effect is very strong in the library domain.
The utilization of Linked Open Data best practices improves this situation in two ways:
- Being able to easily get hold of your data in an open, cross-domain standard enables you to switch to another product with less effort.
- With data being held in RDF rather than open but opaque formats like MARC, there will likely emerge many additional softwares of interest from other vendors. Growing competition leads to improved products and/or decreasing prices.
In August 2010 the hbz launched its experimental Linked (Open) Data service lobid.org2, which is comprised of two services: a “catalogue” of bibliographic resources and holding information (lobid-resources3) and an index of libraries and related organizations (lobid-organizations4). lobid.org fully employs Linked Data principles as well as — whenever possible — Open Data principles.
Since 2010 the two lobid.org services and their underlying data have been continuously improved:
- Information is being extended by adding more fields from legacy data to the mapping and by revising vocabulary and property choices.
- Context is being added by linking resources to other Linked Data sets.
Interaction options for end users are being improved, e.g. by a search engine interface and by aligning the user interface of both sub-services (see 5).
Figure 1 gives an overview over the currently employed technology stack, data sources and conversion processes. As a triple store lobid.org employs Garlik’s 4store5 for full-text indexing and elasticsearch6 for searching. The web front end runs on an Apache server and is generated by the Phresnel framework described in more detail below.
As one can see, lobid.org is currently almost entirely based on legacy data dumps from existing systems that are converted to RDF using custom tools. The resulting RDF data are enriched with links to other datasets in the LOD cloud. Some external LOD datasets are also indexed into the triple store: currently these are the ontologies used within lobid.org as well as the German national authority file Gemeinsame Normdatei (GND)7 provided by the German National library. Until now, no possibilities exist for manually adding and editing the RDF data, e.g. to add new information (commonly called ‘cataloguing’) or to correct mistakes.
3.1. lobid organisations
When the hbz started to publish Linked Open Data, it became clear that the bibliographic records from the hbz union catalogue would just be the start. If you want to build useful services on top of Linked Open Data, you also need URIs for and descriptions of items, holding institutions and services. For example, a geo-based query which gives you back all items of a specific manifestation in a 5 km radius requires URIs for and RDF descriptions of at least three entities: There is a manifestation M that is exemplified by an item I that is held by organisation O. In the RDF serialization turtle it reads as illustrated in Figure 2.
The corresponding graph looks as in Figure 3.
Since two years ago people and organizations wouldn’t move very much if you asked them to provide Linked Open Data, the realization was made that one has to do things yourself. Thus, lobid.org was launched with lobid-organizations in July 2010 (Ostrowski, 2010).
The underlying data come from the German ISIL registry8 and the MARC organization code database9 maintained by the Library of Congress. By now, lobid.org has minted URIs for more than 40,000 institutions and provides basic RDF descriptions of them.10 Currently there exists neither an openly available dump of the data nor is it openly licensed as we cannot decide on this, not having produced the data ourselves.
In addition to the data obtained from the mentioned data sources, new links to other datasets in the LOD cloud are created. By now, links to DBpedia and Wikipedia (Christoph, 2012d) and to GeoNames (Christoph, 2012a) have been added. Furthermore, organization descriptions are enriched with a QR code which contains contact information (ibid.). Also, based on the geo coordinates for most of the libraries, we show their location on Open Street Map11 embedded in the web page.
An example web page for the German National Library – generated from underlying RDF as described in 5.1 — is illustrated in Figure 4.
3.2. lobid resources
lobid-resources is basically the LOD interface for Open Data from the hbz union catalogue. It offers URIs for and descriptions of bibliographic resources like monographs and multi-volume works on a FRBR-manifestation level as well as URIs for and descriptions of corresponding items held by hbz member libraries. Also, journals and serials are included that cannot be comprised under the FRBR WEMI (work-expression-manifestation-item) model (Pohl, 2011). Information on FRBR expression or especially on work-level is planned to be integrated in the future.
Since the first main open data publication in March 2010 (North Rhine-Westphalian Library Service Center, 2010) gradually more and more data from this catalogue has been published as open data in agreement with cooperating libraries. As of August 2012, the dataset comprises approximately 16 Million records published under a Creative Commons Zero license13, which represents 85% of the hbz union catalogue (Christoph, 2012e). Using custom conversion tools, the data are generated based on an XML dump from the hbz Aleph system. The resulting RDF data can be queried via a public SPARQL endpoint14, and a full data dump is also available for download15.
Because identifiers from the German-wide authority file for names, subject headings and corporate entities already existed in the legacy data, links to the Linked Data version of the Gemeinsame Normdatei (GND, first published in 2010) were easy to implement.
Existing language encodings were replaced by links to the ISO 639-2 Codes for the Representation of Names of Languages provided by the Library of Congress.16
Also, links to other datasets that include bibliographic data were added step by step. Using simple matching algorithms for ISBN and title string in combination with some post processing based on simple heuristics, links to Dbpedia (Christoph, 2012b), Open Library (Christoph, 2012c) and Project Gutenberg were added to a subset of resources. These links provide some kind of work-level bundling of resources, enabling for instance mutual enrichment of bundled resources with subject headings, links etc.
In the future the hbz aims at enhancing the data even more by adding subject headings and classification as well as by providing more links to other datasets and to full texts online. A simple API will be developed to enable easy use by libraries who want to re-use these enrichments
An example resource description is depicted in Figure 5.
4. Presenting RDF data using the Fresnel Display Vocabulary for RDF
In the beginning, the converted legacy data for bibliographic resources was exposed using Pubby, “a Linked Data Frontend for SPARQL Endpoints”18. While being very easy to set up, the resulting views — and among those especially the human-readable HTML — that were generated by Pubby exposed too much of the underlying technology. The organizations data on the other hand was presented using a custom SPARQL query, PHP scripts and some HTML-templates in order to, e.g. include a map in the HTML view. This provided more flexibility but was not easily adapted to data other than that about organizations since each content model needed a manually created query and corresponding template. Besides that, there was the idea to enable libraries to easily create RDFa19 descriptions of their organizations. In order to do so, the need for a simple, intuitive editor arose. Instead of exposing the underlying RDF model to content-creators, a browser based HTML form was aimed at, providing a familiar environment for anybody acquainted with the Web.
With these requirements in the back of the head, the search for a schema language from which such a front-end could be derived began. When dealing with RDF data, RDF Schema (RDFS) or the Web Ontology Language (OWL) are the first candidates that spring to mind. Since ontologies expressed in these languages are usually designed to be application-independent, experiments in this direction were rather fruitless, because the resulting views were too generic to fulfil the requirements. Especially mixing classes and properties from several vocabularies in a concise and comprehensible way is nearly impossible. Luckily, the Fresnel Display Vocabulary for RDF came across. It is designed precisely to specify “what information contained in an RDF graph should be presented and how this information should be presented”20 without interfering with the underlying ontologies. Similar to the ontology languages mentioned above, it is itself based on RDF, making it possible to stay within one data model all throughout the implementation.
Fresnel lenses address the first aspect mentioned above, namely which data should be displayed. A single lens can be related to instances in several ways, the simplest possibility being a reference to its class (i.e. its rdf:type values) as demonstrated in Figure 6. For the selected instances, an ordered list of properties is supplied, which is very easy and readable in turtle notation. In order to include in the output data about a related entity, another lens may be referred to. In the example in Figure 6, the author’s first and last name will be displayed in a document description and not only its URI as would be the case when listing dc:creator without referring to such a :person sublens as is done on the right hand side.
Applying the above lenses to an (imaginary) triple store should yield the triples in Figure 7, which should be displayed in that order:
All in all, Fresnel lenses allow for a very concise and declarative way to express which data to select and which order to display it in.
Fresnel formats deal with the second issue stated above: they express how the selected data should be displayed. Possibilities range from custom labels for properties that differ from the labels defined in an ontology, to styling hooks used to reference CSS classes. An example can be found in Figure 8.
In a way similar to Fresnel lenses as a way to select and order data, the format vocabulary allows to configure the way that data is displayed in a declarative, application independent way.
5. Phresnel – Implementing Fresnel in PHP
Several implementations of Fresnel already existed21 when the development of Phresnel began. Most of them are written in Java and none is implemented in PHP, upon which the pre-Phresnel version of lobid.org was based. Stand-alone applications supporting Fresnel, such as IsaViz22, do not fulfil our requirement of providing a classical browser-based user interface. JFresnel23 as a low-level Fresnel API is an interesting library to implement Java-based Fresnel-aware applications, but it does not deliver any application logic. Reusing our existing PHP web-application code — such as request dispatching / URL routing — would not have been possible, thus we decided against the usage of this library. Longwell24 is geared towards faceted browsing, which is indeed an important aspect of a system such as lobid.org. LENA25 is yet another Linked-data viewer, but both of these solutions do not provide the means to alter data. Also, in both cases the demos are offline and development appears to have stalled. There are further Linked-data front-ends for SPARQL endpoints, such as Pubby26 and Elda27, but these do not use Fresnel and once again do not provide editing capabilities.
Thus, a new, PHP-based editing-aware framework dubbed ‘Phresnel’28 was implemented as a proof-of-concept. Currently only a small subset of lens and format features is implemented in Phresnel, limited to those absolutely necessary to get the prototype up and running.
In order to display data according to a Fresnel lens, the web application detects the lens to be used from the URL to which a GET request was issued, e.g. “document”. It then uses the Phresnel framework to generate a generic (X)HTML view (with embedded RDFa) of the requested data as shown in Figure 9. At this point it is assumed that an HTTP-303-redirect following the linked-data design pattern29 has already occurred in a previous step.
Internally, Phresnel uses the lens definitions to generate a series of SPARQL CONSTRUCT queries such as those depicted in Figure 10 and then uses the lens and format definitions to order and style the resulting RDF.
Currently, only a hard-coded box model (Bizer, Lee, & Pietriga, 2005) based on nested tables is available. The ordering and transformation to (X)HTML can obviously be skipped when a pure RDF representation is requested via content-negotiation.
When assembling an editing the view of a resource, the steps are very similar to those when requesting a simple display representation, as can be seen in Figure 11. But there is one major difference. When the RDF resulting from the SPARQL queries is transformed to an (X)HTML form, all literals are simply converted to text input elements. Unfortunately this would result in an awkward interface for those cases where links to other entities are expected, since the URIs of those entities would have to be looked up and typed in manually. This is unacceptable from a usability point of view.
While the Linked Data paradigm is great for navigation, search is vital to actually discover resources. Although SPARQL supports regular expressions30 that can be used for search, this currently does not scale to the data volume of lobid.org. This is why the search for organizations31 as well as the one for resources32 on lobid.org is currently backed by an elasticsearch33 index, accessed via a custom web application that provides a CQL interface. This concrete setup has historical reasons. The resource index has existed long before lobid.org and is used by several hbz services, and it was easiest to simply integrate the organization index into the same infrastructure.
Due to this construction, search results — which are received as Atom feeds — have a data structure that does not match the Fresnel lens definitions driving lobid.org. Hence, currently only the identifiers are extracted from the search result, which are used to construct the URI of the discovered resource, using Phresnel to then receive and format data as described above. This works fairly well and is much more efficient than using native SPARQL queries for fulltext search but suffers from the fact that after the resource has been identified the data about it has to be retrieved again, this time from the triple store. This is only one of the problems that will have to be tackled next.
The current proof-of-concept implementation for a read/write system for LOD-based library data uncovers interesting prospects for future data management. It is clear though that the efforts are still very much at the beginning. Further Phresnel developments will explore options in the following areas.
6.2. Data production and maintenance
The results of editing data in the web front-end will have to become persistent. There are several non-trivial decisions that have to be made in this respect: in how many places should the data be stored (triple store, flat files, search index), how should the data be organized (named graphs (Dodds, 2009), …), which provenance should be recorded, which authorization system should be used etc. Since an application like the one described in this paper naturally lives in a networked, decentralized environment, thought will also have to be put into a solution to inform other connected services about creation, updates and deletions of data.35 One idea is to use a real-time, message based protocol such as IRC or XMPP.36
Another important feature regarding data management is obviously identification and authentication of the agents acting upon the data. Instead of implementing such a system from scratch, it should be based upon a standard and ideally it should also be based on Linked Data principles. Because of this, the most likely approach to be used is WebID (Sporny, Inkster, Story, Harbulot, & Bachmann-Gmür, 2011) which uses FOAF descriptions of agents in conjunction with SSL certificates. This results in a secure distributed identification and authentication mechanism that is reasonably easy to be used by humans as well as by machines.
Among the most important provenance information is a seamless history of changes made to the data, along with the identification of the agent (be it a system or person) that is responsible for these changes. While it is possible to express changes to RDF data as change-sets using an RDF37 vocabulary, it is very likely that this is not the most efficient way to store them since the triple count in a store would explode (at least if it is the same store that holds the actual data). Alternatives that are on the map to be explored are using a versioning system that operates on flat files, such as git38, or using the versioning features of elasticsearch, which cannot only be considered as a search engine but also as a document store. In order to expose the different versions of the data in a standardized way, it is being considered to implement a Memento39 interface for the selected versioning system.
6.4. JSON-LD in ES / Fresnel-based search engine indexing
The way in which search is currently tied into the application is not very generic, and it depends on external organizational and technical processes. Since look-up is a very important part of the system, a solution that ties in more naturally is preferred. Elasticsearch being schema-less should play nicely with RDF data. Since it indexes JSON data, an evaluation of several JSON-RDF-serialisations has begun. The most promising approach seems to be JSON-LD40, on the one hand because it is the one most likely to become a standard, and on the other hand because it structures data in a way that matches the key-value approach that elasticsearch is based upon. The Fresnel lenses driving the front-end could be reused to generate the JSON-structures for the index.