17 Nov 2015 00:00

EEBO TCP Metadata Mashup

Over the past year I've spent some time recreating the metadata of Phase 1 and 2 texts from the Text Creation Partnership's hand coded SGML files of Early English Texts Online for the Early Modern Conversions digital humanities project 'Distant Reading Early Modernity' (DREaM). It's been an interesting process working through the 44,418 texts of the TCP corpus (McGill has access to phase 2 as a TCP partner). Though it first involved marrying the TEI metadata headers with the text bodies, to create a master file for each text, subsequent work, has focused on each part, in turn. Last year I extracted elements from each file where the "lang" attribute contained "eng" as a value, in order to create English-only files we could use for orthographic normalization. I retained the appropriate sequence of the elements, and ran the results through Alastair Baron's VARD2 to produce various levels of normalized versions of the resulting corpus of 40,170 texts. We're working on an article describing the process so I'll keep it brief: after a few passes we decided that the appropriate normalization match level was 45% probability rather than 50%. There were a few hiccups, but after training the software, we obtained a level of normalization that seems to balance our normalization needs with retaining ambiguous variants that are often quite context-specific.

While a slight degree of orthographic normalization was necessary for us to use the TCP corpus in Stefan Sinclair's text-analysis suite, Voyant Tools, it was clear that the metadata itself could be richer. The essential objective of DREaM is the creation of user-driven corpora, a kind of data-driven or query-created automatic archive builder. TCP's existing metadata contained a certain amount of information, but wasn't always clean (for instance the elements contain various ways of rendering dates, and there were some artifacts like L's for 1's and O for 0's). Moreover, it didn't contain any of the Linked Open Data that appeared alongside the records in OCLC records. Mashing up OCLC Linked Open Data with EEBO TCP offered DREaM a way of enriching the TCP metadata in key ways - perhaps most critically by allowing us to describe the gender of authors, for instance. At the same time, it wasn't straightfoward. OCLC's records, which are of manifestations (to use a FRBR term) are conceptually different than EEBO's, which based on the Short Title Catalogue, document instances or actual physical copies. The first step was drilling through OCLC's data to find suitable matches, first by title (cleaned on both sides, and compared using levenshtein and metaphone distances), and only after a possible match, dating and authors. In all this first phase managed to obtain OCLC IDs for c. 39,000 of the 44418 texts. With this I produced the first revised version of the metadata, pulling in information from the linked VIAF records to flesh out authorial data like dates of birth and death, and gender (EEBO uses TEI, which employs a element, rather than gender, to describe this information, despite the problems inherent with sex / gender distinction). At the same time, I also cleaned up the date and place of publication fields, pulling in data where appropriate.

This left a major gap - the notorious publication data. It's important to note at this point that early printers did not establish standardized attribution formats for printing or even the role of 'publisher' until into the 17th century, if not later. The result are highly variant strings of text that can either be absent from the TCP metadata, or can contain a wealth of information, including printers, publishers, sponsors, dedications, locations of printing, or sale, as well as dates. Despite its obvious importance to the study of early modern English print culture, this information has not been parsed by OCLC or anyone else - until now.

Tearing apart the publication string was a messy process. It involved creating two gazetteers, and bootstrapping the resulting places and individuals so that each successive pass over the 23,644 distinct strings became more precise. In the end this required some 20 passes over the data, manual editing and cleaning of the gazetteer lists, especially the one for agents. The resulting metadata headers are very much unlike their predecessors, even the initial version which mashed up the OCLC data with TCP. For this revision I turned exclusively to VIAF and its API. Whereas the OCLC data has a certain level of confidence, the VIAF data was both troublesome to handle, and also inexact. I cannot edit all 44,418 headers, nor do I have the expertise to discern which B. Alsop is correct when VIAF's biographical metadata lacks any birth or death dates. Moreover, there are innumerable examples of 'F.S.' or 'E.M.' in the publication strings that make attribution nigh impossible, even with VIAF dates that might allow us an approximation.

As a result I opted to create data that asserted possible matches with varying confidence levels. When a possible VIAF match contained an identical publication string within the data, it was counted as an exact match. If the trawling through VIAF found an exact match for a name, with a stable VIAF ID where a text's publication date fell in between the individual's brith and death dates, I took it as a good indication that we had the right person(s). If the VIAF record lacked one of the dates, but it was clear the death or birth dates indicated a certain proximity, the confidence wasn't as clear, but nonetheless the match was plausible. The third category was when a name or a permutation matched, but there were no dates to go on. Often this occured when an individual's firstname was merely indicated by an initial. There were also instances where VIAF matching didn't create any possibilities, but an individual was also an author and described in the OCLC data. The mashed up data, as a consequence of this approach, lists all of the possible matches for VIAF data along with confidence indicators, allowing researchers to work through what might be a good match, and what isn't. It's important to note that in many cases the canonical name from VIAF's RDF.xml files doesn't appear in the publication string - results are collated according to the matching name, as found in the string itself. There are a few surprises: Reginalde Wolfe, for instance, appears as Reyner Wolfe, as that's what VIAF has in its data for the first authority name.

There's definitely room for improvement. I'd like to cycle through all of the elements in the matched VIAF IDs to build better matches in cases like Reginalde Wolfe, or Edigius van der Erve. The real problem in these last stages has been the simple fact that VIAF's AutoSuggest API is buggy - they've admitted as much. Sometimes it pulls in co-author's names for canonical names for a record. It means that's why Nicholas Bourne appears as Thomas Goodwill, and why I've had to try and trap matches after the fact. The next steps include obtaining the remaning c.5000 matches for the OCLC metadata, and working through a clearer method for publication by 'assignes of' or when a printer or author's name appears in the publication string after their death. Another real objective is the modeling of historic addresses contained within the strings; we're already well on the way in this regard.

And so for now, I'm publishing Phase 1 as a dataset for anyone to use. Caveat Emptor - it's a bit of a hack, and is messy in places. There are some 5000 headers that lack OCLC data; some of these might contain VIAF matches in the . And a final note - it's not clear to me how to structure possible matches using TEI. This is valid XML, but a note of caution about the TEI encoding for contained within . If you have revisions, please email me, and I can get cracking on cleaning it up. Here's to a hack, and moving forward!

The following dataset is distributed according to the original EEBO TCP license and access arrangements.

EEBO TCP Phase 1 DREaM Metadata Headers - May 6, 2016 (~43MB zipped)

back

EEBO TCP Metadata Mashup

Posts

About

@milner_matt