EEBO TCP Metadata Mashup
Over the past year I've spent some time recreating the metadata of Phase 1 and 2 texts from the Text Creation Partnership's hand coded SGML files of Early English Texts Online for the Early Modern Conversions digital humanities project 'Distant Reading Early Modernity' (DREaM). It's been an interesting process working through the 44,418 texts of the TCP corpus (McGill has access to phase 2 as a TCP partner). Though it first involved marrying the TEI metadata headers with the text bodies, to create a master file for each text, subsequent work, has focused on each part, in turn. Last year I extracted
While a slight degree of orthographic normalization was necessary for us to use the TCP corpus in Stefan Sinclair's text-analysis suite, Voyant Tools, it was clear that the metadata itself could be richer. The essential objective of DREaM is the creation of user-driven corpora, a kind of data-driven or query-created automatic archive builder. TCP's existing metadata contained a certain amount of information, but wasn't always clean (for instance the
This left a major gap - the notorious publication data. It's important to note at this point that early printers did not establish standardized attribution formats for printing or even the role of 'publisher' until into the 17th century, if not later. The result are highly variant strings of text that can either be absent from the TCP metadata, or can contain a wealth of information, including printers, publishers, sponsors, dedications, locations of printing, or sale, as well as dates. Despite its obvious importance to the study of early modern English print culture, this information has not been parsed by OCLC or anyone else - until now.
Tearing apart the publication string was a messy process. It involved creating two gazetteers, and bootstrapping the resulting places and individuals so that each successive pass over the 23,644 distinct strings became more precise. In the end this required some 20 passes over the data, manual editing and cleaning of the gazetteer lists, especially the one for agents. The resulting metadata headers are very much unlike their predecessors, even the initial version which mashed up the OCLC data with TCP. For this revision I turned exclusively to VIAF and its API. Whereas the OCLC data has a certain level of confidence, the VIAF data was both troublesome to handle, and also inexact. I cannot edit all 44,418 headers, nor do I have the expertise to discern which B. Alsop is correct when VIAF's biographical metadata lacks any birth or death dates. Moreover, there are innumerable examples of 'F.S.' or 'E.M.' in the publication strings that make attribution nigh impossible, even with VIAF dates that might allow us an approximation.
As a result I opted to create data that asserted possible matches with varying confidence levels. When a possible VIAF match contained an identical publication string within the data, it was counted as an exact match. If the trawling through VIAF found an exact match for a name, with a stable VIAF ID where a text's publication date fell in between the individual's brith and death dates, I took it as a good indication that we had the right person(s). If the VIAF record lacked one of the dates, but it was clear the death or birth dates indicated a certain proximity, the confidence wasn't as clear, but nonetheless the match was plausible. The third category was when a name or a permutation matched, but there were no dates to go on. Often this occured when an individual's firstname was merely indicated by an initial. There were also instances where VIAF matching didn't create any possibilities, but an individual was also an author and described in the OCLC data. The mashed up data, as a consequence of this approach, lists all of the possible matches for VIAF data along with confidence indicators, allowing researchers to work through what might be a good match, and what isn't. It's important to note that in many cases the canonical name from VIAF's RDF.xml files
There's definitely room for improvement. I'd like to cycle through all of the
And so for now, I'm publishing Phase 1 as a dataset for anyone to use. Caveat Emptor - it's a bit of a hack, and is messy in places. There are some 5000 headers that lack OCLC data; some of these might contain VIAF matches in the
The following dataset is distributed according to the original EEBO TCP license and access arrangements.
EEBO TCP Phase 1 DREaM Metadata Headers - May 6, 2016 (~43MB zipped)