In 2000 the Deutsche Literaturarchiv, Marbach, acquired the archive of Thomas Strittmatter (1961–1995) which includes an Atari computer and 43 floppy disks (Kramski & Von Bülow, 2011, p. 145; Von Bülow, 2003). With regard to appraisal and cataloguing this insignificant detail proved to be problematic, because of the obsolete hard- and software1 the digital data were unavailable. Computer and floppy disks had to be made readable again by the archive’s IT experts. This succeeded only partially: All efforts to access the Atari computer failed, and its data were irretrievably lost. The time interval between the active use of the hardware and the accession had apparently contributed to these difficulties, although the elapsed time was only about 10 years in this special case.2 The relatively short durability of digital media and the thereupon encoded information is in contrast to the durability of its analogue counterparts like paper. Goethe’s personal archive, for example, was made accessible to the public after more than 50 years without any problem. In the case of Strittmatter’s archive this period would have led to a loss of much of the stored data.
This example illustrates a major difference between analogue media – like paper – and digital media. Both can be destroyed or damaged (e.g., by fire, water or improper storage), however, the potential loss associated with digital media is much higher, as not only the state of the media decides on the readability of the information stored on it, but also the configuration of various software and hardware components. Even if only one of these components failed, the stored information can be permanently lost. This is, for example, the case, if a data format is obsolete or a hardware damage exists and the damaged component cannot be replaced.3 To minimize those risks is the task of digital preservation which is faced with one unknown: the creators of personal archives.
The present paper is a revised version of considerations the author published in German (Weisbrod, 2014). It makes the case that creators may be included in the preservation processes by using cloud technology. This will be explained below by the example of special collections, which collect literary archives and manuscripts. The statements are in principle applicable to all collecting institutions.
2. Background: Why are Digital Papers a Particular Challenge for Digital Preservation?
Papers differ fundamentally from all other materials in collecting institutions, since they are not created by pre-established rules – such as a registry – or publications such as books, newspapers and magazines. Their structure and content arise rather from the personality and the personal attitude of the creators – thus in the case of literary archives: the writers. Their working practices and discipline regarding the documents managed by them includes a whole range of options that determine the uniqueness of each individual archive.4 Although other factors matter as well, the reference to the personality may be enough to make the “individual” and “unique” character of papers clearly.
The characteristics of personal papers are not unknown and had been considered already in pre-digital times (Dachs, 1965, p. 81; Schmidt, 1965, p. 74; Von Harnack, 1947, pp. 262–263). So Goethe, who characterised his personal papers as “manifold” and “complicated” (Von Goethe & Beutler, 1950, p. 737). In a digital environment this variety is multiplied. What hitherto existed tangible on paper, consists now of intangible bits and bytes, is stored on volatile media and is retrievable only by means of suitable hardware and software. In other words: the variety of (paper) documents is joined by the diversity of media, formats and computer systems. In addition, the computer skills of a writer influence the composition and the degree of preservation of his personal digital documents (Williams, Dean, Rowlands, & John, 2008).
For several years the problem has intensified as a result of further development: due to the growing use of online media and especially the cloud, personal archives are no longer located only on local devices, but also on various servers somewhere in the Web (BITKOM, 2013).5 This means that in the moment of the acquisition by a special collection, texts, emails, pictures and other digital objects of a writer’s archive can be scattered over various social networks and other web services. The question is: How can a special collection identify and take the online stored part of a personal archive, if it is not documented, if passwords are missing or if the provider of an online service is denied access?
If special collections (e.g., manuscript collections) want to make the above described phenomena manageable for digital preservation, they need to develop a preservation strategy that matches with the characteristics of digital papers. It is the assumption of this article that collections need to expand their “custodial” to a “pre-custodial” view, that it is necessary to pay more attention to the period before the acquisition of papers and therefore to the writers and their lifelong nascent personal archives.6 Writers should no longer be reduced to their role as donors, but contribute their own part to digital preservation. Based on this understanding and considering the view of the writers, adequate solutions can be developed inductively. The following shows that the cloud provides an adequate architecture for such an approach.
3. Related Work: IT-Supported Self-archiving
A look at the relevant research literature shows that mainly researchers and projects in the English-speaking countries engaged in personal archives and digital papers. Examples are the British projects “Paradigm” (Paradigm project, 2007) and “Digital Lives” (John, Rowlands, Williams, & Dean, 2010) or the presentation of Salman Rushdie’s digital papers in the United States (Carroll, Farr, Hornsby, & Ranker, 2011). Meanwhile a few articles had been published elsewhere. Of particular interest are the Danish project “MyArchive” (The Royal Library – National library of Denmark and Copenhagen University Library, 2014) and a publication of the German librarian Anke Hertling (Hertling, 2012).
To answer the question what an attractive solution both for special collections as well as for writers could be, the author of this article should like to point to the British research project “Paradigm”. The project was carried out by the Oxfordian Bodleian library in cooperation with the John Rylands University Library in Manchester between 2005 and 2007. Using the example of politicians’ archives the project team researched the impact of digital media on the work of archivists and librarians. One result of Paradigm was a new perspective on “Collection Development”. In addition to the traditional approach in which the creators or their heirs leave archives to special collections, pre-custodial approaches were developed. The approaches include (Paradigm project, 2007, pp. 10–16):7
- Regular snapshot captures of the creators’ digital data by preservation specialists (e.g., manuscript curators. The snapshot captures are to be transferred directly into a managed digital repository.)
- The periodic transfer of data via retired hardware and media to a special collection.
- The post-custodial approach (The creators themselves maintain their digital materials, supervised by preservation specialists.)
- IT-supported self-archiving.
The latter approach – applied to special collections collecting literary archives and manuscripts – describes IT solutions that give writers the opportunity to transfer digital objects (e.g., work manuscripts, unfinished work stages, e-mails) from their personal archives into the digital repository of a special collection, or to manage them directly in such a repository. In this way the writers themselves care for their digital papers’ archiving. The special collection (or an authorised institution) runs the necessary IT environment and assumes the digital preservation of the writers’ digital objects stored in such a self-archiving system. The approach of self-archiving is open to various stages of development ranging from the use of existing communication channels to a differentiated cloud solution.
In 2010 the Danish National Library launched a simple variant which could be described as self-archiving by e-mail. In a test phase six scientists gained an e-mail account at the National Library in order to forward e-mails or documents (as attachments) they considered as important for their life and work. The forwarded objects remained now in an environment that was managed by the library’s IT specialists. The Copenhagen solution shows how simple self-archiving can be, since the National Library’s already existing mail server was used in order to create a self-archiving environment (Hertling, 2012, pp. 8–9). Under the name “MyArchive”, this solution is meanwhile opened for regular use (The Royal Library – National library of Denmark and Copenhagen University Library, 2014).
In 2012 the German librarian Anke Hertling proposed an advanced self-archiving solution. Due to a critical analysis of the Copenhagen solution, Hertling developed the model of a “digital Vorlass System” (digital papers archiving system). She rightly stated that the use of e-mails could be simply a workaround. This solution was not appropriate for large amounts of data, because the attachments had to separated manually from the emails (Hertling, 2012, p. 9). That’s why Hertling enhanced the Copenhagen solution: “The ‘digitale Vorlass System’, with its approach of an automated data transfer, represents a concept that will meet today’s IT potentials. It would be imaginable to set up a document management system into which the creators feed their digital work and life documents via data transfer already during their lifetime.” (Hertling, 2012, p.7). The “substantial customer potential” of such a solution was the “possibility to relieve individuals from their data and to care simultaneously for the digital preservation of this data.” (Hertling, 2012, p. 8). The system architecture Hertling suggested is comparable with the web service “Dropbox”. In this application the user installs a client on his computer which serves as an interface to the “Dropbox” server. All the digital objects that the user puts into the folder he defined as “Dropbox” are uploaded in the cloud operated by the service. At the same time the original objects remain on the user’s computer. Using Dropbox creators can access the service from any device that has a web connection, e.g., smartphones. With regard to the creators copyright and privacy Hertling recommended to offer the “digitale Vorlass-System” not by a private service provider (like Dropbox) but by a library or archive. She suggested the document management system Fedora as a suitable software environment (Hertling, 2012, p. 9).
4. Review of the Described Solutions
However, there are some reasons to enhance self-archiving. The question what is actually stored in a self-archiving system reveals a fundamental shortcoming of the solutions described above. Both the self-archiving via email (Copenhagen) and the “digitale Vorlass System” (Hertling) base upon the transfer of data copies to the archive’s or library’s digital repository – either via email or via data upload. However, the original objects remain on the creator’s computer, from where he could still work on them. This is in contrast with the characteristic of collecting institutions to collect unique materials. Their collection policy is to preserve biographical and scientific sources. Archiving a copy instead of the original data is contrary to this policy. Thus one should ask what kind of relation in this case exists between original data and copies and whether inconsistencies and redundancies arise by an interim storage of copies? How can authenticity and integrity be ensured, if the original object can change independently of the copy? What effort arises after the final deposition of a personal archive, if original data and copies are in the same system?”
Another objection concerns the writer’s active participation. Since the objects’ transfer in a self-archiving system implies a writer’s act of volition – he selects the digital objects to transfer and feeds them into the respective transfer channel – the risk persists of an incomplete self-archiving practice. In this case one should ask: “Does the writer use the possibility of self-archiving regularly or does his interest slacken gradually? Do there exist ways and means that induce him to a regular use?’
Thirdly, Hertling (2012) has pointed out that a document management system like Fedora doesn’t offer the possibility to manage e-mails. Thus, an important group of materials is not covered by this system. A substitute has to be found for the management of the digital correspondence, for which Hertling recommended the use of the Danish solution (self-archiving via email) (Hertling, 2012, p. 9–10). In this case the “digitale Vorlass System” consists of two system environments that would cause a significant additional work in the digital documents’ preservation. A pure DMS solution, however, would lead to a fragmentary data transfer.
5. Self-archiving and Digital Preservation in the Cloud
Kramski and Von Bülow (2011) expressed concerns about the email solution. They argued that the e-mail correspondence was shifting more and more into the cloud and thus the cloud providers, for reasons of capacity, were forced “to cancel inactive accounts after a certain time”. To an even greater extent this development applied to social networks and it was foreseeable that users in the future would also store traditional data online; systematic local backups of these data were not to expected. In both cases these data were lost, if a literary archive would access them (Kramski & Von Bülow, 2011, p. 160).
This finding is certainly true, but it includes also a possible solution. However, this requires that special collections consider the cloud not as a problem but as an opportunity. The objections to self-archiving raised previously are refuted, if one assumes that the writer’s personal archive and the special collection’s self-archiving system are in one and the same IT environment. In such an environment the writer is not only provided with an opportunity to feed digital objects in the collection’s self-archiving system – he can rather work in the system directly. In order to provide such an offer, Hertling’s “digitale Vorlass System” should be expanded to technologies that are known as public cloud services. The term “public cloud” refers to cloud-based applications, e.g., mailboxes, document storage, or social networks, which are provided chiefly by commercial services to a wide public. These services are often gratuitous for the users.8
Thus, public cloud services that integrate a variety of applications in a single system serve as a rule for the quested solution. Two aspects differentiate a “collection cloud” from a public cloud:
- The cloud providers are not private companies but special collections.
- The access is not available for everyone, but only for writers.
The above described concerns can be avoided by a cloud-based architecture: Using a cloud special collections receive no longer snapshots in the form of data copies or disk images, but they host the writer‘s whole personal archive or at least parts of it. Inconsistencies and redundancies are avoided, because the original object is stored instead of a copy. Since the self-archiving system is his working environment, the writer doesn‘t have to be motivated to feed important objects regularly into a separate channel. By the use of a cloud-based architecture, the problem that occurs by storing emails with Fedora is avoided too, because the cloud provides the mail function as an integrated component instead of two different channels for uploading documents and storing e-mails.
Why should special collections install such a cloud-based self-archiving system, if sufficient public services exist? The main argument against public cloud solutions was already mentioned: With a commercial cloud provider, a third instance enters into the relationship between writer and special collection. This will complicate the acquisition process, because the provider’s rights have to be considered once a writer’s data shall be transferred from a cloud service to a special collection. If the respective writer, for example, did not leave any instructions (e.g., a deposition agreement or a documentation of the passwords he used), the public cloud provider may deny access to the writer’s account. A second complication is that providers delete accounts after a certain period of inactivity, such as after the death of a writer.9 The providers meanwhile address the post-mortem problem with some tools, like the Memorialization Request by Facebook (Schloemann, 2009) or the Inactive Account Manager by Google. The latter enables the user to decide what happens with his data during periods of inactivity. He can either specify the deletion of his data after a certain period or nominate a person who gets access to the account (Google, 2013). However, these solutions are unsatisfactory, because it is unclear under current legislation on what terms service providers have to grant third parties access to a deceased person’s account (Ewig online, 2013). Assuming that many documents are stored online, one must expect that they will be lost due to the unpredictable factor of the cloud providers’ terms and conditions. A collection-powered cloud solution disables this factor and reduces the data loss rate.
The collection cloud, thus, pursues the objective to combine the functionality of frequently used cloud services in a service provided for writers by special collections and to prevent the risks associated with the use of public cloud services. Such a solution might include the following services:
- Data storing (like Dropbox)
- Website hosting
5.1 Architecture of a Collection Cloud
The cloud’s architecture and organization can be derived from the needs of the involved stakeholders. First, a comment to the collections:
Special Collections have, basically, two options when setting up a cloud: the use and expansion of their own IT structures as well as the cooperation with other institutions or partners. Not only the cost of the cloud’s development, implementation and operation have to be considered, but also the structural and political conditions on location. Thus, a great special collection that has an IT department or that is connected to the responsible’s data centre (e.g., municipality, university) will make another cost-benefit analysis as a small archive that takes occasionally digital papers. It is therefore conceivable that large national or regional institutions will opt for a stand-alone solution whereas others, due to the lack of resources, favour a cooperation with other institutions.
While a reflection about the cloud architecture depending on the size and organization of an institution may lead to different results, it has a lot to commend for a cooperation regarding a consistent initial capture of digital papers. A cooperation of special collections and related institutions (e.g., national libraries, city archives, state archives, family archives) gives the opportunity to implement the requirements of consistent standards in one nationwide (or international) system. It’s not about the exact compliance of description or cataloguing rules [such as the AACR, DACS, ISAD(G)2, EAD or – for German-speaking countries – the RNA], metadata standards (e.g., METS) and controlled vocabularies, but the user-end supply of functions that facilitate the subsequent preservation and access of digital papers.10 The basis of those requirements should be the manuscript curators’ experiences with digital papers and, thus, functions that established public cloud services do not include. Roland Kamzelak (2010), for example, exposed how emails complicate the description of a writer’s correspondence. He mentioned the citation functionality by which “components of a received letter, marked as such, get part of the answer. Collages of repetitions are arising, that are carried, comprising several pages, several email messages along.” (Kamzelak, 2010, p. 469). A description-friendly citation tool might put things right. It is also conceivable to give the writer opportunity to mark digital objects, because of default subject headings or to store them in a predetermined file system. Manuscript curators should discuss those problems trans-institutionally and dissolve them in cooperation prior to the cloud’s development in order to build a custom-fit solution. This argument, and the support needed by small or less well prepared institutions, argue for a cooperational solution.
This leads to the question of how a cooperation should be organised? There are multiple possibilities:
- A large special collection or national library with a powerful data centre masterminds the cloud and all other institutions contribute to the costs.
- A decentralised architecture, consisting of a network of institutional data centres, is built.
- Above all the commissioning of commercial data centres is conceivable, if these are more favourable and offer trustworthy and safe solutions.
Unfortunately, there is not enough space to discuss the funding of each option. However, one will have to consider the country-specific conditions, e.g., existing national research programs or the organizational and financial structure of national archive/library networks.
Of course, every single participating collection should have the right to maintain the exclusive access to “their” digital papers. To this end, a digital rights management system should be set up. Access could take place, for example, via VPN (Virtual Private Network), so that a safe virtual space is provided for each collection. The assurance of exclusive access is a self-evident presupposition for each participant who wishes to keep his particular interests.
The wish for a safe and exclusive working environment may also motivate the writers. In particular the well-known collaboration of commercial cloud providers with secret service agencies made the need for an alternative obvious, so that the writers, with their distinct sense of political hazards, may acclaim special collections as trustworthy partners. For this, special collections or a parent corporation should arise as holders of the cloud and communicate that the collection cloud is hosted in the respective nation under the respective national law.
What has been said above in relation to the special collections applies consequently also to writers:
- They should access the cloud via a password-secured VPN.
- They should have the exclusive disposal of their own account in order to decide confidently what happens to their data, e.g., whether digital objects are deleted or transferred to other systems.
- They should also be free to decide whether and when a digital object is transferred finally from their account to a special collection’s account. (This could be done by moving an object to a specific folder, or by marking it.)
Precondition of such a transfer is that a prior donation agreement was concluded between the respective writer and a participating special collection. The agreement may include separate clauses, e.g., for famous writers who want to sell their digital manuscripts under special conditions. Otherwise a transfer will not be possible, and the whole personal archive remains in the writer’s account – maybe in anticipation of a later agreement or until the account’s decay by the writer. The time span between a writer’s inactivity (e.g., in the case of death) and the account deletion should be sufficient in order to achieve an agreement with his heirs.
From these few considerations follows that the collection cloud consists of two sections. The first section contains exclusively the writers’ accounts and is in fact the collection side supply of a cloud service with the resultant duties: Monitoring, maintenance, support and periodic refreshing of the entire system. These activities ensure that the stored personal archives and data will not be corrupted or destroyed due to hardware or system damage, by malware or hackers.
The final preservation takes place in the second section, namely when an agreement exists and objects were marked respectively by the writer. From now on the special collection, which made an agreement with the respective writer, assumes the digital objects. It leaves to each special collection to decide what happens now, whether they store the objects permanently in the cloud or download them into their own data centre. Somehow or other, the manuscript curators should process the received objects in accordance with workflows specially developed for this purpose and with the target to produce OAIS compliant packages. To sum it up, the cloud proposed here consists of two sections:
- The writers’ space, which is provided by the collections
- The collections’ space, in which curators manage and preserve the received papers on their own responsibility.
5.2 Access and Fees
This architecture answers even another tricky problem – the question of access authorisation: Which writers may use the cloud? A conventional answer, derived from the experiences with paper materials, would refer to the bulk of the received data, that produce – compared with the literary value of those materials – a disproportional managing effort for curators. Thus, the collections are trying to sustain a certain quality by collection guidelines that exclude young or less known authors from acquisition. Due to the volatility of digital media, such an approach would lead to a loss of digital documents from the early days – when the writer is still unknown – or of the whole work of a writer whose importance is recognised only after his death. To avoid that, it is necessary to embrace as many writers as possible (known or unknown) without producing more work and cost for the collections. Exactly this is offered by the bipartite cloud architecture: A data transfer from writer to special collection depends on the signing of a donation agreement, otherwise digital objects will remain in the writer’s account without any actions by the curators. Only the operational cost for the cloud system incurs in this case.
Should the collection cloud be gratuitous for writers? One possible answer results from the potential user expectations. Commercial cloud services, such as Google, Facebook, or Twitter, are gratuitous in fact. Their business model bases precisely on a free access, which allows them to gather and to market the user data. Knowing this, only a few writers would be ready to pay money for a service that is available elsewhere for free. Therefore, special collections should adapt and modify the strategy of commercial services. After all, they are interested in data collection just as these, but not for commercial reasons. Their interest results from the collection policy and the tasks involved, such as acquiring and preserving relevant papers (Ott, 1999, p. 33–34). From this point of view, as well as to meet the user expectations, the collection cloud should be a gratis service.
To sum it up: The conditions exist that writers will accept the collection cloud as an alternative to the established public cloud services. In recent years the cloud technology has reached such a stage of maturity that it can be regarded as sustainable. By establishing a cloud, special collections get an instrument that provides writers with a reasonable working environment and, at the same time, enables the preservation of their personal digital archives. The time span between an object’s creation and its preservation, this critical factor of digital preservation, reduces to a minimum. Mismatches between an original object and a transferred copy are avoided and fragmentations of personal archives (e.g., by the use of commercial services) decrease. The currently successful cloud technology is therefore an option for special collections, who have to face the challenges of digital media sooner or later anyway.