During the 16th and 17th centuries, Seville, Spain was the locus of the world’s largest trade flows. From 1503 to 1717, it housed the Casa de Contratación, the institution that centralized trade between Europe and the Spanish colonies in the Americas. As a result, the city ballooned in wealth and population; with the city’s specialization in trade, Sevillian artists started exporting their works abroad.
This github repository is meant as a standalone project, but also as a resource for those conducting similar Digital Humanities initiatives. If you are working on computational ways of extracting information from archival documentation in historical languages through NER, this provides a model and resources that can be tailored to your needs. It provides some usable resources for documents in early modern Spanish, created for a database comprised primarily of notarial and parish documents from the city of Seville.
This database includes information taken from 20 volumes published throughout the 19th and 20th centuries. These books compiled documents from several Sevillian archives on the activities of various local painters, sculptors, gilders, stonemasons, and architects, among other less common occupations. The books were OCR scanned, corrected for mistakes, and then divided into texts using OpenRefine.
Texts are stored as individual records within the database and usually (though not always) refer to a single archival document, either in transcription or summarized form. These texts are often accompanied by footnotes and comments that are included as an attribute of the text in the database. Where possible, we have included the archival reference to the original source, to the extent provided by the researchers that edited the published volumes.
The database is not only a repository of documents, but meant as a repository of the information these documents contain. Different tables have been developed to register the actors, locations, objects, dates and money amounts present in each document, as well as the attributes of the document itself (archival reference, bibliographic source, footnotes and comments). Information on entities was extracted using the Named Entity Recognizer made available by Spacy (the medium Spanish model, es_core_news_ml). This model was retrained on a set of training data, improving the model to work more efficiently for our data. This training data was tagged manually on DataTurks.com.
This Github repository is meant to record the process of development of the database, including code used, training sets, output and further resources.
Document Viewer A tool to search for strings within documents contained in the database.
Research Completeness An overview of the texts included in the database, by published source and original archival source.
Tags : Summary Visualizations Summary visualizations to describe the strings identified as entities (people, locations, organizations, dates, monetary amounts and objects).
NER Resources Resources for replicating, learning from or expanding on the NER employed in this project.
This project was developed by Felipe Álvarez de Toledo as part of a Ph.D. dissertation in the department of Art, Art History and Visual Studies at Duke University and as part of DALMI, the Duke, Art, Law and Markets Initiative.
This project is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. This license applies to the database and its contents and the resources made available in this repository. It does not apply to the texts included in the database themselves, which were taken from published sources.