1. Introduction

Library catalogues are a prominent information source for anyone studying the history of publishing and the associated social change. The standardized large-scale nature of these data collections, cataloguing up to millions of documents, calls for an automated quantitative framework that could be used to shed light on the development of book production in the early modern world, for example.4 Library catalogues have conventionally been seen as a tool for finding a particular item in the library system.5 We demonstrate how these catalogues can be used not only as research tools, but also as research objects. We have integrated parts of the English Short Title Catalogue (ESTC) from the British Library with statistical algorithms to establish a data-analytical ecosystem via which to analyse changes in book production in early modern Britain and North America during the hand press era (1470–1800).6

We propose using such open data-analytical ecosystems, libraries of a kind, to supplement and explore the full contents of digital data resources with state-of-the-art data-analysis techniques.7 This would notably complement conventional database-query interfaces such as the ESTC, EEBO and ECCO, which allow specific searches but not a full exploration of the database contents. Such arbitrary restrictions on data availability place severe limitations on how the data can be used, and create a significant bottleneck for data-driven research. Moreover, even when data collections are made available, the lack of statistical tools specifically designed for such analysis constitutes another practical research obstacle. The ecosystems we propose combine research data and custom algorithms to enable flexible and deep quantitative analysis of the full data collections. These systems can be implemented in open-source software libraries that are developed as part of the research process and provide customised tools for analysing data collections of interest. Here we demonstrate the benefits of such an approach in a practical case study on knowledge production in the field of history during 1470–1800.

2. Library catalogues as a research resource

According to Peter Stallybrass (2004), researchers should be turning to librarians to understand knowledge production (see also Kraus, 1986). It is well known that union catalogues such as the ESTC can be used as a research resource for statistical analysis, although they were not originally designed for such a purpose (Suarez, 2009).8 We propose a new way of carrying out such research and demonstrate how library catalogues can provide a valuable data resource for historical studies. We show how the analysis can be performed within a comprehensive quantitative framework, following the best open-science practices of transparency, reproducibility and code sharing.

Our analysis covers documents in the ESTC catalogue that include the word ‘history’ in any of the catalogued subject fields covering the years 1473–1800. This includes 50,766 entries among the 466,000 documents catalogued in the ESTC (˜10%). We do not aim at an exhaustive or objective classification of history as a genre (or genres) as represented in the catalogue (and we acknowledge the limitation that it does not include all documents published in Britain and North America from 1470 to 1800).9 Nevertheless, anyone interested in studying David Hume’s History of England, for example, will now have convenient, adaptive tools for quantitative analysis based on bibliographic catalogues that will help considerably to situate this work in the context of general knowledge production. The generic principles of openness and automation we promote here could be extended later to the mining of full-text databases to gain further insights into the historical evolution of terms, concepts and research topics.10

Although the quantitative approach to the history of the early modern book has been recommended on several occasions (Weedon, 2007; see also the pioneering work of Bell and Barnard, 1992 , 1998), it has not been implemented to the extent that is possible.11 Some scholars constructing quantitative analyses of book history have been rather sceptical about their own approach.12 One reason for this is that, as many scholars indicate, library catalogues do not represent a stable database, but are constantly being changed and updated, leaving the results from large-scale studies vulnerable to change in the database contents (see, in particular, Karian, (2011); Raven (2014) was also critical). This is precisely why there is a need for automated open workflows such as the one implemented in this research, whereby results can easily and automatically be updated when new versions of the data arrive. Thus, in the case of the ESTC for example, should plans to formulate an improved ESTC21 catalogue be realised one day, our tools could easily be applied to the new version.13 Large-scale analysis of general trends complements the analysis of specific documents, authors or publication periods by setting them in the wider context of overall knowledge production. Moreover, the analysis of large-scale statistical trends in knowledge production can be expected to be robust against specific database updates. We believe that our methods will significantly expand the use of a quantitative framework for qualitative research. At the same time, it is important to acknowledge the limitations of such approach; the overall information content of the catalogue and supporting information sources that can be linked to the study (May, 1984) sets limits on what the analysis can provide.14

To our knowledge, this is the first organized plan to move towards a comprehensive transparent quantitative framework within which to study the history of the book. An article on the bibliometric analysis of surviving records published as recently as 2009 (Suarez, 2009), for example, does not propose any solutions to the problems of transparency, automation and reproducibility that we have resolved here.

3. Open-data analytical ecosystems as quantitative research tools

The starting point of our analysis is the library catalogue, which is further extracted, transformed, and supplemented with supporting information, and subjected to rigorous quantitative analysis. The analysis is based on custom data analysis algorithms that are implemented as part of the research project. The combination of data and algorithms constitutes an ‘ecosystem’ that can be further refined and extended by updating the research data or source code. The algorithms provide generic research tools that can be potentially used beyond the scope of the original study.

The open-data analytical ecosystems we propose here ideally include: (i) the full database contents in an open, machine-readable format, provided by the institution that holds the data; (ii) supporting data sources, preferably from open source repositories; and (iii) well-documented open source algorithms to extract relevant information from the data and transform it into the final statistical summaries and visualizations in a fully automated, reproducible and transparent manner. The ESTC represents the main data source of interest in our case, further supported by external data sources such as publicly available name-gender mappings, geographical coordinate databases, custom lists of author pseudonyms, and other supplementary sources that support the interpretation, as we demonstrate in more detail below. The algorithms and the complete analysis workflow are being made available via the ESTC R package on the Github social coding platform (https://github.com/ropengov/estc) that facilitates further community contributions and feedback.

A central element of our work is that we make the full algorithmic details openly available for anyone to use, verify and improve further.15 The source code provides a detailed description of all the steps, from the data to the final quantitative results, tables and figures. Whereas the source code implementing a specific analysis is typically newly created and customized in each research project, many algorithms for specific analysis tasks can be readily borrowed from existing open source libraries. This leaves the researchers more time to focus on the new research questions, and thus makes the research more efficient. At the same time, our original contributions within this project have been publicly shared from the very beginning. Ideally, other scholars and the general public will be able to use the algorithms to study related research questions, or as a starting point to develop further tools and find new uses in other contexts.

We have implemented the work in the R statistical programming environment (https://www.r-project.org/), which is already widely used in other fields of science. This enables seamless integration of the data sets with state-of-the-art data-analysis techniques, and allows researchers to build their own research tools by combining existing standard algorithms with a custom source code, specifically designed for the given research project. In contrast to commercial software suites such as Matlab, SAS or SPSS, our approach is fully open source and provides dedicated tools for large-scale analyses of library catalogues. Unlike standard query interfaces such as ESTC, EEBO or ECCO, which provide only limited access to the database and do not allow large-scale data mining, our approach takes advantage of the complete data contents of the library catalogue.

4. Who wrote history?

It is interesting to ponder on the question of who wrote history, and on whether a quantitative analysis of publication volumes and numbers of imprints would support the common understanding of the most famous historians who published in the English language.16 A key challenge in this analysis is that the same author may be listed under multiple variants of the name. To overcome this we implemented parsers that remove special characters and recognize first and last names based on large background lists from public databases and manually prepared supplementary lists of synonymous names and pseudonyms, and finally convert the names in a harmonized presentation format (“last, first”). After harmonizing the author names we generated visualizations of author life years, which are also listed in the library catalogue. In some cases this revealed ambiguous names that in fact referred to different authors with the same name but who lived at different times (Figure 2). We therefore used the combined author name and life year as the final unique identifier for each author, and removed names that could not be unambiguously identified from the final data so as to avoid bias. Ultimately, we generated lists of the most commonly accepted author names, and also of the discarded names to monitor the conversion quality and to spot any obvious errors in the data handling: these summary tables are publicly available at https://github.com/rOpenGov/estc/blob/master/inst/examples/summary.md Every detail of this analysis is fully transparent, and any observed errors can be fixed in the source data and algorithms, and this iterative process continues until the majority of the names are handled correctly by the analysis ecosystem. Whereas the original data lists the author names for 22,320 documents, we were able to find a unique, unambiguous author name for 83 per cent (18,493) of these documents after the pre-processing. Similar conversions take place for the publication places and years, document dimensions and other fields: the full algorithmic details can be browsed at https://github.com/ropengov/estc After this initial polishing, the database is ready to be subjected to final statistical analysis.

A look at the most common authors who wrote history based on the title count reveals a mixture of pamphleteers and writers who are more commonly understood as historians (Figure 1). We highlight three of these authors in Figure 1 for further comparison (William Prynne, Daniel Defoe and David Hume). We also took into consideration a namesake of the more famous eighteenth-century David Hume to underline the relevance of individuating the authors in the catalogue (Figure 2). It is noticeable that the birth dates of the most popular authors are fairly evenly distributed throughout the early modern period, indicating that there were no particular peak moments for publishing history titles by known authors.

It is useful when evaluating the nature of a particular author’s works to set the number of titles on a timeline (Figure 3). What is noticeable in the comparison of Prynne, Defoe and Hume is that William Prynne caused quite intensive peaks in publication numbers during the English civil war, but his works were no longer published after the late seventeenth century. Defoe caused a very steep peak in publication numbers in the Union debates (1705–1706), but his works continued to be published throughout the eighteenth century. Hume’s historical writings showed a more steady development (there is no pamphleteering among them, as there is in Prynne and Defoe). Hume could be considered very successful in producing a steady flow of historical works throughout the second half of the eighteenth century.

Fig. 1: 

Early modern authors17 who published the most titles on history according to the ESTC catalogue data; the highlighted authors are compared further below.

Fig. 2: 

The life spans of the top early modern authors based on the title count: the visualization also reveals ambiguities arising from authors having the same name but living at different times (e.g. David Hume).

Fig. 3: 

The title counts per year for William Prynne, Daniel Defoe and David Hume (highlighted in Figures 1 and 2) provide an overview of their publishing activity up until 1800.

Analysis of the document dimensions reveals further information on specific authors: this is another example in which considerable polishing of the original data fields is needed before proper statistical analysis is possible. We set up automated algorithms to recognize and harmonize the most commonly used forms of the standard document sizes, such as quarto, which is also commonly referred to as 4 to and 4o. In some cases the physical dimensions (in cm) are declared instead of the standard sizes. Where possible, our algorithms aim to augment such missing information based on ready-made conversion tables that assign the common standard sizes with their corresponding physical dimensions (see e.g. https://github.com/rOpenGov/bibliographica/blob/master/inst/extdata/documentdimensions.csv). Finally, standardized document-size estimates are obtained for most documents, facilitating the comparison of publication activity among different authors.

Figure 4 compares the top authors based on the title count in relation to the paper consumed in their books. From this perspective David Hume appears to have been successful indeed as a historian in terms of producing a steady flow of books of significant size. At the same time, Defoe’s historical documents seem to be of more of a pamphleteering nature than those of William Prynne. Clarendon, Robertson, Goldsmith and Burnet also start to stand out in this graph, as anyone familiar with the historiography of early modern Britain might expect. Thus, when the evidence from the three previous graphs is combined a certain consistency in David Hume’s historical publications emerges. Once he started publishing on history in the 1750s the volume grew steadily, and there was constant reproduction throughout the rest of the eighteenth century. Unlike Hume’s writing, Daniel Defoe’s works in particular are more random: although a prolific prose writer, he was also a pamphleteer switching from one topic to another. What is noticeable in William Prynne’s publications is that he was very resourceful as a seventeenth-century writer on history, yet, after his death his works stopped being reproduced.

Fig. 4: 

Title count versus paper consumption among the highlighted authors: the visualization reveals the nature of the authors’ publications, distinguishing pamphleteering (many titles, few pages) and the authoring of books (fewer titles, more pages).

Analysis of the author-gender distribution gives an example of supplementing the original library catalogue data. We supplemented the author information by estimating the gender of each one on the basis of publicly available information on first names and genders from the US national census, the R package gender and other sources that is incorporated into our analytical ecosystem (https://github.com/rOpenGov/bibliographica/tree/master/inst/extdata/names). Although a significant proportion of female authors in the early modern period wrote under a masculine pen name, or anonymously, there were still a significant number of female authors writing history catalogued in the ESTC (Figure 5).18 Even when women do not feature among the most prolific early modern authors of history, there were female authors with a significant volume of publications who compare favourably with the most famous male authors. In the future we hope to incorporate further data on pseudonym genders and on variations in name-gender distributions over time. Such improvements are easily incorporated into our ecosystem, and our analysis provides the first quantitative estimate of the publishing activities of female authors throughout the study period; we anticipate that the overall trends will remain largely robust for such updates.

Fig. 5: 

The most active known female authors based on the title count: the gender is inferred automatically from the first names.19

5. Where was history published?

London dominated the publishing business in Britain and North America during the early modern period until 1800 (Figure 6: see Myers, 1973; Pollard, 1978; Twyman, 1994).20 Most importantly, the Licensing Act restricted publishing in the English provinces until 1695, although booksellers played a crucial role in distributing books in peripheral areas while printing was limited (Barnard & Bell, 2002; Feather, 2004). It is also well known that the top publication locations shown in Figure 7 (Dublin, Edinburgh, Philadelphia, Boston and the University towns of Oxford and Cambridge) were also of importance once the book trade became more accessible.21 What has not been possible without great effort thus far is the quantitative comparison of different, especially smaller, publication locations, which would facilitate investigation into how this might reflect the publication of different genres. Below we analyse the major historical trends in publishing outside of London, in particular in Ireland, Scotland and the USA (the three biggest producers of printed documents after England).

Fig. 6: 

Publication volumes at the six top publication locations in Britain and Ireland, year 1700: the circle diameter corresponds to the logarithm (log10) of the title count.22

Fig. 7: 

The top publication locations in Britain and North America ranked by the title count (number of published titles).

A comparison of paper consumption and title count reveals, for instance, that there were relatively more publications compared to overall paper consumption in the area what we now know as USA than in the other two countries (Figure 8). Further comparison among the three during the early modern period reveals that publishing activity in the US was very sizeable in terms of the number of titles, but in terms of paper consumption the volume seems to have been proportionally much lower than in Scotland and, especially, Ireland (Figure 9). The implication is that many of the historical documents published in the USA were pamphlets.23 Given the volume development in US publication especially towards the end of the eighteenth century, this gives an intriguing picture of the historical relevance of pamphleteering—in some sense the young colonies were going through much of what Britain had experienced in the seventeenth century. Meanwhile the tendency, especially in Scotland and Ireland, was to publish octavo-sized books.

Fig. 8: 

Title count and overall paper consumption in the top publication locations: Places in US in darker colour.

Fig. 9: 

Title count and paper consumption in Ireland, Scotland and the USA.24

6. How did publishing on the subject of history change over time?

Despite the advancements in historiography, there is still uncertainty about disciplinary boundaries and the adoption of a quantitative approach to the subject.25 We investigate the seemingly naïve question of what history is in relation to publication volumes, taking as our starting point all documents that contain the word ‘history’ in their classification field in the ESTC catalogue. It is evident from a comparison of publication volumes in history and all documents in the ESTC that other subject areas expanded much more rapidly (Figure 10). This might seem counterintuitive to the assumed relevance of the rise in historical awareness and the concept of progress in the late eighteenth century: one might rather assume that, proportionally, there would have been an increase in publishing on history towards the end of the eighteenth century. A more detailed look at the title count of documents including the word ‘history’ reveals substantial peaks in the numbers of titles published in the 1640s, the 1650s and the 1690s, attributable in part to the existence of the Thomason tracts, and also to a rise in the numbers of pamphlets published during those times.

Fig. 10: 

A comparison between the title count for history publications and for all documents in the ESTC catalogue, 1470–1800.

Analytical bibliography is not, of course, about counting pages (Tanselle, 2000).26 However, it may prove useful for estimating paper consumption. We have enough information, for instance, to distinguish three-volume works in folio from a half-sheet broadside based on our analytical ecosystem.27 We can study book production by looking at book sizes and, with regard to the ESTC even by providing exact quantitative estimates on paper consumption for each year of this 300-year period during which the publishing system was established. The analysis of paper consumption requires information on document dimensions, page counts, and print-run sizes. Although none of these is available in the ESTC catalogue in a directly usable format, we can derive estimates based on the available information that can be extracted from the data fields via dedicated functions that we have implemented for this purpose. The cleaning up of the document dimensions is described in Section 4 above. Estimating page counts requires the summing up of information on cover pages and special pages, actual content, and possible multi-volume information, for example, according to the standard rules for page listing. The functions in the estc and bibliographica R packages can interpret the page-count field and convert it into exact numeric estimates of the total page count in each document. We have also added specific unit tests to check automatically that standard examples are converted correctly whenever the algorithms are updated. For print-run sizes we use the estimate of a rough ‘London average’ of 1,000 copies for every edition regardless of the format.28 It is well known that there is variation in actual print runs, and that the numbers rise considerably in times of crisis, especially regarding the most popular pamphlets (even up to 10,000). As a general rule it seems to apply fairly consistently that the print runs ranged from 750 to 1,250 (or 1,500 copies maximum). When an edition was sold out a new one followed, and we can make quite good estimates of the number of copies sold, especially of books, by counting the number of editions. In cases of missing page-count information we have used averages calculated over books of a similar size, treating multi-volume sets as a separate category. We have also manually checked that the amount of missing information does not change significantly between historical periods and thus bias the analysis. There will also be cases of lost books, but because our approach does not rely on having a complete corpus, this will not form a bottleneck in terms of reaching general statistical conclusions.29 We could incorporate further perspectives to improve the estimates of overall paper consumption, such as taking supplementary information from the Early English Booktrade Database and other sources into our data-analytical ecosystem. In short, we have used the ESTC catalogue as a basis for an ecosystem where the idea is that this can then be supplemented from other information sources. In an excellent study Gants (2002) provides what he calls a snapshot of five years in the London book trade, examining in accurate detail the question of how much paper was used in sheets in London publishing. Whereas Gants measures the amount of paper used in making the books, we look at the volume of paper in the books recorded in the overall ESTC catalogue. We are not concerned with what might have been trimmed off these books, for example.

A comparison of the total number of history titles (Figure 11) with our estimates of overall paper consumption in the same documents (Figure 12) shows that the overall volume of history publishing in fact rose rather sharply towards the end of the eighteenth century. The indication is that many more books were published than pamphlets, which had previously dominated publications on history.30 This finding supports the notion that the relevance of historical analysis did indeed take a new turn towards the end of the eighteenth century. Thus, whereas the number of history titles published annually over time remained stable, the publication volume measured in paper consumption rose exponentially during the eighteenth century.31 One might assume that the exponential growth in paper consumption was even more substantial if all the documents in the ESTC are taken into consideration.32 Simon Eliot (2007) states, ‘the explosion of book production and of all kinds of print production actually took place in the nineteenth century, and more precisely after 1850, after what has long been called the “industrial revolution.”’ This might be true, but at the same time, in terms of the actual volume of paper usage, the eighteenth-century development could be regarded as a genuine outburst during the handpress period.33 This also makes sense with regard to the technological innovation of machine presses in printing that eventually ended the handpress period in the 1830s. Our analysis suggests that handpress printing was pushed to its limits during the latter part of the eighteenth century, and that technological innovation was therefore called for.

Fig. 11: 

The title count of history publications, 1470–1800.

Fig. 12: 

Paper consumption in history publications, 1470–1800.

It is commonly assumed that the average size of book formats became smaller towards the end of the seventeenth century for practical reasons.34 This complies with our finding that more books on history were published during the eighteenth century. In this regard, too, 15th- and 16th-century publications tend to be more sizeable than 18th-century documents. A good example of this is Holinshed’s Chronicles, which shows as a clear publication peak in the 1570s according to the data (Figure 13). Later on, even though the folio size was the most common for books until the end of the seventeenth century in the sample of history documents studied for this article, history publishing involved fewer heroic undertakings than the Chronicles.

Fig. 13: 

Average paper consumption per document in history publications, 1470–1800.

In terms of book history, this article moves beyond the mere counting of titles and pages. It is obvious that both pamphlets and books played an important role in early modern publishing (Halasz, 1997; Raymond, 2003), and it is relevant to make a distinction between the two in studying publishing trends. When they are compared on the basis of the page count, as we have done (Figure 14), it is evident that although the average size of an individual book was larger during the earlier stages in the history of publishing, annual paper consumption rose steadily in the late seventeenth century.35 The octavo form was commonly used for books on history in the later part of the eighteenth century (Figure 15). We can also point to the time when folio-sized publications begin to decline. Also the increased number of pamphlets caused by the collection of variants in the Thomason Tracts included in the ESTC is clearly visible in the paper consumption of quarto-sized documents during the Civil War era.36 It is thus clear that the octavo book hailed a new form of publishing in the eighteenth century. Intriguingly enough, a title-count-based comparison of the octavo and folio publications of top authors (Figure 16) reveals that Edmund Burke published the most octavo volumes. William Prynne, a seventeenth-century author, had no octavo publications at all, unlike some of his contemporaries (such as Shakespeare) who were also published in the eighteenth century. David Hume, whose overall output consumed the most paper based on our analysis, turns out to have a moderate number of publications in the octavo format.

Fig. 14: 

Paper consumption in books versus pamphlets, 1470–1800.

Fig. 15: 

Paper consumption for different book formats over time.

Fig. 16: 

A comparison by title count between the octavo and the folio format among the top authors (see Figures 1 and 2).

Specific peculiarities, such as the presence of the Thomason Tracts in the mid-17th century, have to be taken into account when interpreting the library catalogues. At the same time we can infer that also real social change is reflected in the publication volumes. We consider this a more direct way of accounting for what has been considered crucial in book history since Elizabeth Eisenstein’s (1980) contributions to the field. For example, the English Civil War, as well as the Restoration, the Glorious Revolution and the Union debates of 1705–1706, are clearly reflected as increased numbers of published titles in the graph depicting publishing activity in Edinburgh (Figure 17). At the same time, American Independence did not seem to cause a publication peak in history titles in Edinburgh. This indicates that in the early modern period history was published locally during times that also turned out to have particular historical relevance in that particular place, suggesting a degree of subjectivity in the catalogue classifications.

Fig. 17: 

The publishing of historical works in Edinburgh on a timeline highlighting the eras of the English Civil War (1642–1651), the Restoration (1660), the Glorious Revolution (1688–1689), the Union Debates (1705–1706) and American Independence (1776).

Descriptive bibliography facilitates the analysis of books as material objects with particular shapes and sizes whose distribution and production had strong direct and indirect effects on the development of the early modern commercial society. [For the vast literature on this subject, see e.g., Barber (1994); Baron, Lindquist, & Shevlin (2007); Bermingham & Brewer (1995); Febvre & Martin (1990), p. 155–166; Rich & Wilson (1977); Sher (2007); Steinberg (2001), p. 106–129.] Librarians may make mistakes in inserting data into the catalogues, and not everything is catalogued, hence library and information studies are open to correction and improvement (De Morgan, 1853). The research is lacking in such large-scale analysis, however, as the catalogues have been used merely to locate individual books. It is clear that the value of the metadata in library catalogues and the potential new perspectives it opens up have been underestimated.

7. Conclusion

We have demonstrated how library catalogue data can be used to analyse large-scale developments of knowledge production such as the publishing of historical writings in early modern Britain and North America. This is only the beginning. Library data facilitates analysis of the development of various documents as material objects. In the case of history, for example, the long-term transformation from the folio to the octavo format is indicative of the growing readership as well as the wider distribution of historical knowledge during the later eighteenth century. Moreover, the analysis of paper consumption as well as the number of titles published highlights different aspects of publication volumes and allows differentiation between books and pamphlets, which in turn facilitates the analysis of individual authors and their oeuvre and places of publication. The case of the US and the volume of pamphleteering in the field of history are indicative of the resonance of knowledge production with regard to social change, which is also evident in the publication activity in Edinburgh analysed above. Our automated and open source statistical tools constitute a promising starting point for similar studies developing quantitative, data-driven analyses covering the whole of early modern Europe, when work will begin on catalogues that really cross national boundaries. Although we believe that our analysis gives a robust picture of publication activity in the early modern era, the open-source approach guarantees that such a picture can only improve over time given that the data and algorithms are easily updated. Any detected errors or shortcomings can be reported via issue tracker and corrected, and the full analysis can be updated automatically, taking the corrections into account in all the steps. It is evident that the book was the most important vehicle of knowledge transfer in early modern Europe (Johns, 1998). Eventually, we believe, this kind of undertaking will be necessary to produce a more coherent account of historical constructions such as the Enlightenment and the European Republic of Letters. We anticipate that as libraries and other data holders give researchers and the general public full access to their data resources, there will be a rapidly increasing demand for open source ecosystems of data analysis in this field, the application of which we have demonstrated in this article.