Creating Dickens Search
What is Dickens Search?
Dickens Search is an open access resource that aims to combine book history, and a commitment to bibliographic transparency, with digital tools. What this means in practice is that Dickens scholars are involved in locating, authenticating and transcribing first printings, first editions and (eventually) lifetime editions of Dickens’s works (including serials, novels, short stories, prefaces, journalism, speeches and poetry), making them freely available on our website. While uploading texts, we use an adapted version of Dublin Core (an archival metadata standard) to create records for items, and simultaneously create TEI files. The metadata and text for each file is downloadable (under a CC-BY-NC-SA 4.0 licence), and these prepared digital texts will be used as a dataset for digital transformation. For example, currently users can view Ngrams that draw on the collections we have currently uploaded to the website.
There are two driving factors that have led to the creation of this database. The first stems from my own research into Dickens’s less popular forms of output, including his speeches and short stories. While a Hyper-Concordance of the novels, Christmas books, reprinted pieces and select criticism has been online since 2003 (created by Mitsuharu Matsuoka), and the more recent and more sophisticated CLiC Dickens has supplemented this functionality with corpus linguistics tools that have been used not only to open up Dickens’s fiction but also other writers’ works, it is much harder to find searchable, authoritative text of Dickens’s other kinds of writing online. The digital landscape has shifted enormously in recent years, making a vast amount of digitised material freely available, but the Open Access Dickens available to users is still really only the Dickens of the novels. Being able to search across different formats, and analyse and visualise a more complete body of the author’s work, would provide a more nuanced understanding of Dickens as a writer, as well as demonstrating what might be possible for other writers and other bodies of work.
As a search for The Pickwick Papers on HathiTrust (which gives over 17,000 results) shows, even with the fiction, where projects such as HathiTrust, the Internet Archive and Project Gutenberg have made OCRed and/or volunteer-transcribed text available, a user must contend with an overwhelming number of results across a range of sites. It’s surprisingly difficult to find first printings using these databases, though they are present, and to work with the data involves navigating a range of different providers (and associated copyright issues). Dickens Search is intended to simplify this process with less restrictive terms. As it develops, the functionality will also expand to facilitate comparison of different versions of the same text. Currently, users can view two texts side by side using our compare feature; we hope to enhance this tool to highlight textual differences between two items.
The second motivation derives from my time on the Oceanic Exchanges: Tracing Global Information Networks in Historical Newspaper Repositories, 1840-1914 project. In addition to giving me a grounding in database structure and metadata that has facilitated the technical construction of this database, the collaborative aspect of the project also introduced me to a range of tools that supplement the creation of concordances and offer new insights into a large dataset: in particular, the project used ShiCo, an open-source tool, to explore conceptual shift in digitised newspapers across linguistic and national boundaries using word vectors. While exploring the applications of this tool for nineteenth-century newspapers, the possibilities that would be created by turning such a tool to Dickens’s works were obvious; though the volume of text is not comparable to a century of newspapers, there are few other writers with such a prolific output across so many forms. So while this might not be feasible for many writers, Dickens’s fifteen novels contain 3,833,544 total words (as CLiC has calculated) and adding his other outputs to this should create a volume workable for exciting digital transformations, including conceptual analysis, sentiment analysis and topic modelling.
What have we done so far?
Our collaboration started in late 2020, and website development started in January 2021. In terms of content, Lydia and I began with Dickens’s poetry as the small size of the corpus enabled us to make fundamental decisions about what kind of metadata to record and how. In the course of preparing this collection, we have identified several poems mistakenly attributed to Dickens and new ones which have not appeared in any collected volume of Dickens’s poetry to date. We have also begun uploading early sketches and speeches. We have added functionality, as described above, to compare texts, perform Ngram searches and view the most common words in a specific sub-collection. We also have the capability to crowdsource transcriptions (though this is not currently active). In my next blog post I’ll delve into more of the technical side of the website, which has entirely been developed using open source, free-to-use software.
What’s next?
The immediate priority for Dickens Search is to locate, transcribe and upload first printings of all of Dickens’s works, working with existing databases and library partners. We anticipate this will largely be complete by the end of 2022. As we complete a collection (such as poetry, short stories, serials and so on), we will create Ngram corpora to make that collection searchable. Eventually, the full corpus of Dickens’s works will be searchable, and users will be able to view Dickens’s use of chosen words and phrases across the whole of his writing career.
Tools such as topic modelling and ShiCo, as mentioned above, need a critical mass to provide meaningful results. Once we have the volume to implement these enhanced tools, we will make them publicly available on the website.
Currently, in addition to the Ngram search, users of the website can search for specific texts, narrowing down by fields including publication and date. The search functionality of the website will be enhanced to improve the relevance and organisation of results (i.e. a snippet view for text search will be added).
We are also open to suggestions! Dickens Search is intended to be a collaborative, open access resource supportive of the Dickens studies community, so please contact us if there are particular tools or approaches you might be interested in us adding to the site.
Currently, this website has no official institutional or financial support. With the launch of the website in summer 2021, we will be pursuing funding opportunities and partnerships to ensure its longevity.
How to Cite:
Bell, Emily. ‘Creating Dickens Search.' Dickens Search. 2 July 2021. Accessed [date]. https://dickenssearch.com/creating-dickens-search.