Results of scientific research are nowadays as a rule published digitally. It is almost inconceivable that any scientist would not present the results of his or her work in digital form, even if it is in parallel to a physical copy delivered by a publisher. This is certainly the case for Science, Technology and Medicine, the research fields that produce over 80% of all scientific publications. Traditionally academic libraries have contributed to the keeping of the output of science through time. For deposit libraries - mostly the national libraries - the maintenance of the Record of Science is even a key task. The maintenance of collections by libraries is under pressure as more publications turn digitally. One reason is the change of policy of publishers who preferably do no longer sell but rather license their publications, offering them through sophisticated search and retrieval services. Another reason is that handling and maintaining digital publications requires new skills and a different infrastructure than for printed publications. And complicating matters even further is the fact that great research and development efforts are required, as no best practices on the issues are yet available.
IBM developed the deposit system on site on the KB premises. In October 2002 the system was delivered to the KB. Meanwhile the library had created a workflow for electronic publications and had designed interfaces to the catalogue and other digital library functions. The KB deposit system initially has a storage capacity of 12 TeraBytes and is scalable over 500 TeraBytes. The system was constructed using as much as possible off-the-shelf components, like DB2, TSM, WebSphere and Content Manager. As the system is a generic archival system, IBM has branded it with the name Digital Information Archiving System, or DIAS, and will maintain it as a product. By using this product, the KB maintains the service branded as e-Depot. Parallel to the development of the system KB and IBM have jointly studied and tested Long-Term Preservation issues. The e-Depot will be extended further, technically and functionally, including realising specific preservation technologies.
The e-Depot offers on the one hand the storage facility and on the other hand the functionality for digital preservation. The KB has developed the workflow for archiving the electronic publications and has realised the other parts of the infrastructure in which the deposit system is embedded. This infrastructure consists of a variety of functions: for accepting and pre-processing the electronic publications, for generating and resolving identifier numbers, for searching and retrieving publications and for identifying, authenticating and authorising users, etc.
The stage of processing and ingesting the digital content is called loading. Two types of electronic publications are to be stored in the e-Depot: offline media like CD-ROMs, also referred to as installables, and online media like the high-volume of electronic articles sent by the publisher. Ingest of the installable is a time-consuming process that is performed manually. First, the CD-ROM should be installed completely, including all additionally needed software like viewers or media players. All files from the CD-ROM are subsequently copied to a Reference Workstation (RWS), so that the CD-ROM can be viewed stand-alone. A snapshot of the entire workstation is then generated into an image, including its operating system. After the bibliographic description is generated manually, the image is subsequently loaded into the e-Depot. If a customer wants to view a particular CD-ROM, the entire image is retrieved from the e-Depot, and installed on a dedicated RWS. By including the operating system in the stored package, the CD-ROM is guaranteed to work -also in future conditions involving new operating systems.
The second type of publications that are currently processed are online media. These publications are either send to the KB on tapes (for processing the backfiles) or by means of FTP. In both cases, publications ready for ingest end up in an electronic post office in which they are validated. In this stage the content of the submission is checked on well-formedness, based on specifications agreed upon earlier. If the material does not fulfil the checksum, or if other errors occur, the content is passed to a database for error recovery (BER). In fact, inspection of this database is currently the only manual effort involved. If the content occurs to be valid, content and metadata are put together as a Publisher Submission Package (PSP), and this PSP is then processed by a part of DIAS called the batch builder. In fact, the batch builder itself consists of a series of applications, like Content Manager, and Tivoli Storage Manager.
The batch builder ingests both the content and the metadata and converts the bibliographical descriptions from the publisher into the KB's internal format, including the addition of a National Bibliographic Number (NBN). After conversion the content itself is stored into the e-Depot, while the metadata is stored into the KB catalogue. Clients may query the online catalogue and retrieve the full text of the publications, in the case of restrictions imposed by the publisher only after a process of identification, authentication and authorisation (IAA). The e-Depot itself cannot be accessed directly, but passes relevant documents to the client after clarification. See Figure 1 for a complete overview of the data flow.
Conceptually, there is little difference between manual loading and fully automatic loading. The process of loading and storing is performed by DIAS. The DIAS solution provides a flexible and scalable open deposit library solution for storing and retrieving massive amounts of electronic documents and multimedia files. It conforms with the ISO Reference OAIS standard and supports physical and logical digital preservation. Once the asset is successfully stored it will be maintained and preserved. Stored assets can be accessed either via a web-based interface (for assets having standard file types) or via a specific work environment on a Reference Workstation.
At the same rate at which our world is becoming digital, our information is threatened. New types of hardware, computer applications and file formats supersede each other, making digital information inaccessible. Even if the hardware or the carrier-media does not deteriorate within the time frame considered, the technology to access the information will inevitably become obsolete. Information technology is developing constantly and rapidly offering us new and appealing applications while at the same time making existing hardware and software obsolete. Preservation or permanent availability of the record of science is one of the processes which is dramatically affected by the change to an all digital world. In recent years, the digital preservation challenge has been recognised by people outside the traditional memory institutions (libraries, museums, etc). The problem of digital preservation is broad and society as a whole is having to deal with it. This has resulted in the issue of digital preservation being widely discussed today.
|·||preserving the (formatted) bit stream, also called the 'digital object'|
|·||ensuring accessibility over time to the information embedded in the digital object|
For the problem of preserving digital objects, the European project NEDLIB has consolidated in a nutshell a wide range of internationally acquired research results in the 'Guidelines for setting up a Deposit System for Electronic Publications'. A strategic choice by NEDLIB was to transfer the electronic publications from the publishing environment to an archiving environment, nowadays often referred to as a 'Safe Place'. Note that the word 'place' in 'Safe Place' should not be taken literally, but rather as a concept, indicating an institution that is committed to digital preservation and possesses appropriate infrastructure, resources and skills for the task.
The essence of the data preservation concept is to extract the data from the format it has been published in order to render it in a different IT environment, also in future times. In order to describe the subsequent parts of the current IT environment needed to render the object today, the KB makes use of Viewpaths. Viewpaths are instantiations of an abstract model called the Preservation Layer Model (or PLM).
In order to manage all file types and their corresponding viewpaths, the KB and IBM have jointly developed the LTP Preservation Manager, which enables us to address the management of long-term preservation aspects of the stored digital items. It will keep track of stored file formats, manage the technical metadata and signal endangerment of rendering functionality. So if Windows 95 appears to become obsolete, this can easily be specified in the LTP Preservation Manager. It automatically determines the view paths that are Windows 95 depended, and marks them invalid. If a stored document, due to obsolescence of any software or hardware, is in danger of not being viewable anymore, actions have to be performed to 'save' the document (such as conversion, migration, emulation). Implementation of these actions are currently being developed by the KB and IBM, resulting in a proto-type of what is called the Universal Virtual Computer (UVC). This system is expected to render digital items based on a logical data view, independent of any future software or hardware environment. For more information about the development of the UVC the reader is referred to the KB and IBM websites.
Implementing the LTP Preservation Subsystem is the first step towards long-term preservation; guaranteed rendering of that information is the second part. In addition to the safe keeping of the digital objects, access to the information in the objects has to be permanently provided. Tools, techniques and procedures are needed to provide access to the stored objects now as well as in the future. Research on tools and procedures for permanent access has started, but is still in its infancy. IT companies are only recently becoming aware of the problem of relatively short-term accessibility of digital objects. The standardised archival system developed by KB and IBM is designed to preserve and control digital information for the long term. Still to be resolved however is how to guarantee permanent access to the stored information. How can we render digital information for users in the future? The problem of permanent access has up to now been addressed by several, scattered and small-scale initiatives. To accelerate this development national libraries, archives, universities, research institutions and IT companies should collaborate in order to create tools for permanent access. In this context, the KB and IBM are constantly looking forward to set up new consortiums in order to join forces for further research and development on this important topic.
Deposit of Dutch Electronic Publications, Koninklijke Bibliotheek. http://www.kb.nl/kb/menu/ken-arch-en.html
Digital Information Archiving System (DIAS): http://www.ibm.com/nl/dias/
Dutch Publishers Association = Nederlands Uitgeversverbond. http://www.nuv.nl/
Koninklijke Bibliotheek, National Library of the Netherlands. http://www.kb.nl/
Reference Model for an Open Archival Information System (OAIS)
LIBER Quarterly, Volume 13 (2003), No. 3/4