1. Background and introduction
In the social sciences (especially economics, political science and sociology) more and more researchers analyse data provided by official statistics or by specialised providers of research data (e.g., from the ALLBUS at GESIS1 or from the SOEP at DIW Berlin2). In addition, relevant data may often also be purchased from companies like Thomson Reuters or Bloomberg.
Especially in economics, compared to other branches of empirical research, the compilation of own datasets is not common. A major exception is the field of experimental economics, where researchers often generate their own datasets in the course of investigations motivated by game theory.
Although a rising number of publications in almost all scientific disciplines are based on the analysis of datasets, there are few effective ways to effectively replicate or re-examine the results of an empirical article, to verify it, or to make it available for re-utilisation and to support scholarly debates.
Even research data, which — in principle — is publicly available, will not typically be archived (e.g., in a final working-file) with respect to the specific selection and adjustment procedures. Therefore, while replications are not necessarily prevented, they are extremely difficult in the cases of ambitious analysis based on specific data selections and calculations.
The current situation confronts both the scientific community and scientific infrastructure service providers, like libraries and research data centres, with multiple challenges. In addition to questions concerning data availability and incentives for sharing data, there exist also infrastructure challenges. In particular the roles and responsibilities of scientific infrastructure providers, e.g., research data centres (RDCs), for managing and operating a data archive that facilitates the replications of published research often are not clearly outlined.
The first part of our paper describes some of the problems that lead to poor replicability of social sciences research. Then our paper describes the outcome of desktop research and an online survey evaluating scientific infrastructure with respect to their potential services for the management of publication-related research data in the field of social sciences.
The conclusion of our paper discusses the roles and responsibilities of several stakeholders for operating data archives for scholarly journals. Experiences in other scientific areas are integrated in our suggestions for establishing data archives that are based on the complementary know-how of research data centres (RDCs) and libraries.
1.1. Why is social science research often not replicable?
According to the literature the following reasons for missing replicability may be mentioned:
- First, and most importantly, there is a lack of incentives for researchers to share their data with the community. The academic reward system does not honour the often time-consuming efforts of data sharing — in sharp contrast to publications, although “[a]n applied economics article is only the advertising for the data and code that produced the published results” (Anderson, Greene, McCullough, & Vinod, 2008, p. 101).
- Furthermore, social scientists may worry that data sharing could lead to personal disadvantages. Researchers who work up and share data with the community do not receive appropriate compensation, e.g., reputation, for their efforts and may even suffer in terms of academic career because data sharing takes time that cannot be spent on research. In addition, many researchers suspect others will “misuse” their data, for example with faulty interpretations or by using a dataset without due reference to the creator of the dataset (Fecher, 2014). Eventually, the legal status of research data with regard to data sharing is not sufficiently clear, which also leads to reservations in data sharing.3
- Few social science journals have currently implemented guidelines requiring their authors to provide their data and statistical computation codes (McCullough, 2009; Vlaeminck, 2013). So called “data availability policies” may, in some instances, oblige the authors of empirical research papers to supply the underlying data of their results and the code/syntax of their analysis along with the manuscript of the article. Those policies often are in line with the “replication standard” formulated by Gary King (1995).
- Useful infrastructure components for the management of publication-related research data are rarely applied, which, in turn, prevents any uniform way of citing the underlying data. Available technical solutions like Dataverse,4 a powerful tool for managing and documenting publication-related research data, are adopted by only a few journals. In this context a critical point focusses on how professional research data centres handle research-related data and what kind of services, if any, they offer.
1.2. Do research data centres offer services for archiving publication-related research data?
Research data centres could actually be ideal institutions for managing publication-related research data published as attachments to articles within scholarly journals. These capacities originate from decades of expertise in the handling of social- and economic research data, from core-competencies in the creation and maintenance of metadata collected and tagged from surveys as well as extensive experiences in managing access to data (Research Information Network, 2011). Cox and Pinfield (2013) argue that librarians, in contrast, already often feel over-taxed with the multiple roles that they have in the various activities of their libraries. In addition, libraries may lack technical knowledge, domain-specific expertise and may also have limited personal experience in the common research processes. As such it may be difficult to position libraries as key players in this area. A loophole could be the close collaboration between libraries and research data centres to solve upcoming challenges, as Christensen-Dalsgaard (2012) suggests.
Therefore, the EDaWaX (European Data Watch Extended – www.edawax.de) project, funded by the German Research Foundation (DFG) 2011 to 2016, conducted a study evaluating if such services for publication-related research data are currently available from scientific infrastructure service providers like research data centres, libraries and archives. For this purpose a list of 46 scientific infrastructure organisations was prepared. It includes all German research data centres and data service centres accredited by the German Data Forum (RatSWD),5 research data centres organised within the Council of European Social Science Data Archives (CESSDA),6 the library networks in Germany as well as individual libraries and public archives.
Our investigation into the services provided by these data centres for managing publication-related research data in the social sciences is a hands-on approach to evaluate the possibility of cooperation between research libraries and research data centres. Therefore our study followed the suggestion of Lyon (2012) to develop “a proactive approach to collaborating with disciplinary, national and international data centres … for data deposit in such archives” (p. 130).
In a first step, the websites of these organisations were examined with regard to potential services for storing and hosting publication-related research data. The ICPRS (Inter-university Consortium for Political and Social Research — University of Michigan) provides a publication-related archive7 that is used by numerous authors to deposit their publication-related data.8 NARCIS,9 a research information system located in the Netherlands, offers a specific service for publication-related data.10 DANS EASY,11 another service located in the Netherlands, can also be used to deposit such data in principle.12
However, desk research could not uncover other information needed for further analysis, which is why, in order to start a more detailed evaluation of potential services by these organisations, an online survey was conducted.
2. The online-survey
In October and November 2012 an online-questionnaire was sent to 46 organisations — among them 35 national and international research data and data service centres, 1 archive, 7 library networks and single libraries, as well as three other organisations (non-European research data centres). A satisfactory, especially when compared to average return rates of mail survey.
Due to the structure of the questionnaire, not all participating organisations responded to all questions, which explains deviations in the number of responses (Figure 1).
Certainly more important than the return rate is the structure of respondents and non-respondents. The large majority of responses came from research data centres in Germany and Europe (86%). Significantly under-represented were respondents from German library networks and archives. The three non-European research data centres did not respond.
We can only presume that the library networks and the archive do not offer relevant services for research data management and, therefore, did not respond to our survey.
2.1. Empirical findings
Initially, the survey asked whether institutions would, in principle, host and store publication-related research data.13 In addition, the survey also asked whether organisations would also host and store (self-compiled) software components and the code of computation/syntax of statistical analyses. These three types of data are often part of empirical submissions to economic journals.14
More than three-fourths of all organisations responding accept external datasets for storage (Figure 2). At the same time the lion’s share of respondents reported that research data would only be accepted if certain criteria were met. Such criteria are subject to the specific competencies of many research data centres, but also to the specific regional/supra-regional or national competencies. Moreover, technical and organisational aspects (e.g., proper documentation, machine-readability, etc…) as well as legal concerns were cited as criteria. Approximately 74% of the respondents indicated that their organisations would host these types of data (Figure 3). If any criteria for hosting were mentioned, the subject-specific orientation of an institution was stated as main criterion
With regard to storing and hosting of (self-compiled) software components, which are often used for economic simulations, our survey indicates that just under a fourth of responding organisations accept storing and hosting software components without restrictions (Figure 4). Another 17% pointed out that they have criteria for assessing if software can be stored and hosted (e.g., if essential for the analysis of the data). Therefore, a gap exists in the availability of hosting and storing software components. Only a limited number of organisations offer this service.
2.4. Code of computation/Syntax of statistical analyses
Almost 70% of the organisations responding offer options to store and host computation codes (Figure 5). However, a quarter does not do so at present and is not considering offering such services in the near future. One respondent also stated a criterion — noting that the storing and hosting of these data would only be useful in the case of derived variables.
Within our analyses we also examined the availability of application programming interfaces (APIs), which enable automated data exchanges. Our results show that less than half of all responding organisations have these interfaces at their disposal (Figure 6).
Most frequently APIs were mentioned as a device for data search (47%), followed by APIs used for uploading research data. Slightly more than a third (35%) of all respondents declared the availability of an API to analyse research data.
However, further analysis by EDaWaX shows that the reported interface consists only of searching and uploading interfaces on the respondents’ websites. We were not able to find an API. Presumably, APIs in terms of external reading and writing accesses are by and large unknown among our respondents and not readily available.
2.6. Metadata schemata and the creation of metadata
2.6.1. Employed metadata schemata
We were also interested in the metadata schemata currently used by the organisations in their daily work. Our survey shows that more than 70% of the respondents use DDI (Figure 7). Other schemata like Dublin Core are rarely used (29%).15 All other metadata schemata are used rather sporadically.
2.6.2. Persistent Identifiers (PI)
In addition, we asked, whether organisations assign persistent identifiers (e.g., handle, DOI, URN, etc…) to datasets and other materials. The persistent identification of research data is an important issue, for instance because it enables researchers to cite datasets.
More than 56% of the organisations in our sample assign such identifiers by default, but almost a third do not (Figure 8). The persistent identification of research data is an important issue, for instance because it enables researchers to cite datasets.
2.6.3. Support of Semantic Web Technologies
In our survey we also examined the implementation of RDF (Resource Description Framework). RDF is a general method for conceptual description or modelling of information implemented in web resources. Among the organisations answering this question a minority of 6% claimed to use and disseminate RDF-files. Almost a quarter of all respondents did not specify whether their organisation uses RDF, which presumably indicates that RDF is largely unknown.
2.6.4. Support for creating metadata
Again and again, a critical issue regarding the reuse of research data is the quality of data documentation. Therefore, a matter of particular interest is whether respondents support researchers in generating metadata and, if so, how.
Our survey shows, that the majority (almost 65%) of all responding organisations do so (Figure 9).
Furthermore, we were keen to know whether this support is software-based — e.g., if there is a web frontend where researchers may type in the required information that is then converted into a standardised metadata schema.
We find that 36% of the respondents use this type of software-based support with researchers (Figure 10).
There are a striking number of statements in the section other. Part of the other support for researchers, for instance, consists of written data deposit forms.
Our question regarding the software program names revealed that at least two institutions use Nesstar.16 Many organisations also use in-house solutions.
2.6.5. Digital long-term preservation
In our survey we wanted to identify to what extent the respondents’ institutions have implemented specific measures for long-term research data preservation. Therefore we asked the respondents whether their organisations take specific actions for digital long-term preservation. Because format-migration is one of the dominant strategies for long-term preservation (Harvey, 2012), we suggested format migrations as one such method.
Our survey indicates that more than 80% of all organisations use these types of procedures (Figure 11).
3. Conclusion and Discussion
Our study aims to evaluate if services for publication-related research data are available from data centres, libraries and archives. Based on existing services, our project defines roles and responsibilities for operating a publication-related data archive for journals in the fields of social sciences. This approach is in line with numerous recommendations put forth by European and national organisations as well as projects to interrelate research outputs with their underlying research data (German Council of Science and Humanities, 2012; Kroes, 2012; Reilly, Schallier, Schrimpf, Smit, & Wilkinson, 2011).
A question often arising in the context of linking data and publications is the discussion about stakeholders and their roles and responsibilities in the process (Costas, Meijer, Zahedi, & Wouters, 2013; Lyon, 2007). At first glance publishers appear to be the optimal stakeholders to perform the task of building up effective and efficient data archives because many publishers already host supplementary material for their journal articles. Therefore developing and implementing data archives, collecting and disseminating research data and metadata for datasets and other material could be an easy task for publishers. Therefore wouldn’t it be a good idea to rely on the publishing industry? For answering the question we have to differentiate the role of academic publishers.
On the one hand, currently publishers do not see the need to implement data archives for journals on their own (De Waard, 2012). One reason might be that implementing and operating data archives raises the costs of publication. On the other hand, the availability of a data archive does not necessarily increase the number of journal subscriptions. Hence, the incentives to build up and operate data archives are not readily apparent to publishers.
In addition, questions of ownership and access conditions to archived research data could cause uncertainty for researchers, despite a publisher’s announcement “not to require any transfer of or ownership in such data or data sets as a condition of publication of the article in question” (STM & ALPSP, 2006, p. 1).
Despite the fact that publishers do not operate data archives for their journals, they can nevertheless play an important role in the process of interrelating research data and publications. We already observe such collaborations in some scientific disciplines where publishers and data archives actively cooperate. In disciplines like the earth sciences, the e-infrastructure needed for storing and hosting research data in conjunction with appropriate documentations of the data has already been ongoing for quite some years. From a publisher’s perspective linking research data and publications provides a benefit for their journals if the scientific outputs that are enriched with research data generate more citations (Piwowar, Day, & Fridsma, 2007). In addition, these links enable a more accurate research process and offer protection against scientific misconduct (McCullough, 2009).
Excellent examples of collaborations between publishers and data repositories include PANGAEA and Dryad. PANAGEA is the “data publisher” for earth and environmental sciences. It partners with Reed-Elsevier.17 Dryad is a non-profit repository for data underlying the international scientific and medical literature. It partners with numerous journals.18
Based on these experiences, the best solution is to implement and operate a discipline-specific data archive that gains importance by acquiring more and more data, which subsequently partners with publishers. The evolution and success of PANGAEA and Dryad underlines this approach impressively.
So, if it is not up to the publishers to run a data repository, other stakeholders come to the fore. In particular, research libraries and research data centres are the best positioned to take on the responsibilities of running such a disciplinary data repository. The results of our empirical investigation lead us to the conclusion that research data centres (RDCs) are likely the most relevant places to taking on the role of hosting and storing publication-related research data that is submitted to journals. RDCs already meet many prerequisites. In particular, the RDCs we analysed, in the broader field of social sciences, have much data handling experience. They are well trained in the storing, handling and documentation of these types of data as well as in taking appropriate measures for long-term data preservation.
Because RDCs currently do not comply with all requirements with respect to storing and hosting publication-related research data, collaborations between libraries and research data centres appear to be a promising way for establish such data archives: Libraries have the skills for managing publications. These include a dedicated knowledge of using authority files and multiple metadata schemata, in cataloguing information and providing this information to their discovery systems. Or as James L. Mullins, Dean of Purdue University’s Library, describes it, “Our ability to see structure to overlay on a mass of disparate ‘parts,’ as well as the ability to identify taxonomies to create a defined language for accessing and retrieving data is what is needed from us” (Baykoucheva, 2011, p. 46). Unlike RDCs, it seems to be much more common for libraries to provide their stocks to their customers and to implement technical systems and the APIs necessary to do so.
According to Pullinger and Wagner (2010), managing research data comprises of a mix of information that goes beyond the traditional separate realms of publications (the primary responsibility of national libraries), official records (the responsibility of national archives) and datasets (the responsibility of researchers themselves, statistical offices and RDCs) (Pullinger & Wagner, 2010, p. 3). In addition, Cox and Pinfield (2013) emphasize that scientific libraries often do not possess specialised units experienced in both IT-skills and knowledge in domain specific research data — a factor that hinders libraries’ engagement in research data management. Establishing these departments takes time and costs money — often not attractive to scientific libraries during times of budget cuts.
Based on Lyon’s suggestions to assign roles between libraries and RDCs (Lyon, 2007; adapted by Vlaeminck, 2013), we suggest the following tasks for the implementation of our project’s pilot application of a data archive for economics journals, in which we strive to realise a workflow based on this division of complementary know-how.
In this distribution of tasks, ZBW — the Leibniz Information Centre for Economics — adopts the role of hosting and maintaining the metadata catalogue. Libraries then provide the technical implementation of APIs to other (library or research data) catalogues with the purpose of enriching and disseminating metadata. However one of the German RDCs, the research data centre of the Socio-Economic Panel (RDC SOEP), should take over the tasks of hosting, storing and preserving the data that has previously been submitted by editorial offices using the project’s application.
By developing, implementing and operating a publication-related data archive for economics journals, both libraries and RDCs would help to ensure the validity of published economic research and to facilitate replications of these scientific outputs.