I study massive cultural datasets in order to address new and existing research questions in the humanities and social sciences. I specialize in the application of statistical computing to large text and image corpora. A particular interest is the study of data contained linked text and images, such as newspapers with embedding figures or television shows with associated closed captions. Research products take on several forms: book length manuscripts, technical reports, new software implementations, and digital projects meant for broad public consumption. My work has received funding by the NEH, DARPA, and the ACLS.
The package cleanNLP provides a set of fast tools for converting a textual corpus into a set of normalized tables. The underlying natural language processing pipeline utilizes Stanford’s CoreNLP library, exposing a number of annotation tasks for text written in English, French, German, and Spanish. Annotators include tokenization, part of speech tagging, named entity recognition, entity linking, sentiment analysis, dependency parsing, coreference resolution, and information extraction.
Keras is a high-level neural networks API, originall written in Python, and capable of running on top of either TensorFlow or Theano. It was developed with a focus on enabling fast experimentation. This package provides an interface to Keras from within R. All of the returned objects from functions in this package are either native R objects or raw pointers to python objects, making it possible for users to access the entire keras API. The main benefits of the package are (1) correct, manual parsing of R inputs to python, (2) R-sided documentation, and (3) examples written using the API.
The iotools package provides a set of tools for input and output intensive data processing in R. The functions chunk.apply and read.chunk are supplied to allow for iteratively loading contiguous blocks of data into memory as raw vectors. These raw vectors can then be efficiently converted into matrices and data frames with the iotools functions mstrsplit and dstrsplit. These functions minimize copying of data and avoid the use of intermediate strings in order to drastically improve performance. Finally, we also provide read.csv.raw to allow users to read an entire dataset into memory with the same efficient parsing code. In this paper, we present these functions through a set of examples with an emphasis on the flexibility provided by chunk-wise operations.
The way materials are archived and organized shapes knowledge production. We argue that recommender systems offer an opportunity to discover new humanistic interpretative possibilities. We can do so by building new metadata from text and images for recommender systems to reorganize and reshape the archive. In the process, we can remix and reframe the archive allowing users to mine the archive in multiple ways while making visible the organizing logics that shape interpretation. To show how recommender systems can shape the digital humanities, we will look closely at how they are used in digital media and then applied to the digital humanities by focusing on the Photogrammar project, a Web platform showcasing US government photography from 1935 to 1945.
08. Uncovering Latent Metadata in the FSA-OWI Photographic Archive. Taylor Arnold, Lauren Tilton, Stacey Maples, and Laura Wexler. Digital Humanities Quarterly. 11.2 (2017). HTML, Project Website, GitHub.
We present our use of the FSA-OWI Photographic Archive, a collection of over 170,000 photographs taken by the US Government between 1935 and 1945, within Photogrammar as a case study of how to integrate methodological research into a digital, public project. Our work on the collection uses computational methods to extract new metadata surrounding individual photographs from the perspective of both the photographers and the original government archivists. Techniques for accomplishing this include mapping over an historical atlas, recreating historic cataloging systems, and digitally stitching together rolls of film. While many digital projects have focused on analysis at scale, our work on extracting new metadata actively demonstrates the power of digital techniques to assist in the close reading of even small sets of archival records.
07. Basic Text Processing in R. Taylor Arnold and Lauren Tilton. The Programming Historian (2017). HTML.
Learn how to use R to analyze high-level patterns in texts, apply stylometric methods over time and across authors, and use summary methods to describe items in a corpus. All of these will be demonstrated on a dataset from the text of United States Presidential State of the Union Addresses.
06. Efficient Implementations of the Generalized Lasso Dual Path Algorithm. Taylor Arnold and Ryan Tibshirani. Journal of Computational and Graphical Statistics, 25.1, 1-27 (2016). HTML, arXiv, GitHub.
We consider efficient implementations of the generalized lasso dual path algorithm. We first describe a generic approach that covers any penalty matrix D and any (full column rank) matrix X of predictor variables. We then describe fast implementations for the special cases of trend filtering problems, fused lasso problems, and sparse fused lasso problems, both with X = I and a general matrix X. These specialized implementations offer a considerable improvement over the generic implementation, both in terms of numerical stability and efficiency of the solution path computation. These algorithms are all available for use in the genlasso R package, which can be found in the CRAN repository.
This book teaches readers to use R within four core analytical areas applicable to the Humanities: networks, text, geospatial data, and images. This book is also designed to be a bridge: between quantitative and qualitative methods, individual and collaborative work, and the humanities and social sciences. Humanities Data with R does not presuppose background programming experience. This book uses an expanded conception of the forms data may take and the information it represents. The methodology will have wide application in classrooms and self-study for the humanities, but also for use in linguistics, anthropology, and political science.
04. Twenty-Four–Hour Pattern of Intraocular Pressure in Untreated Patients with Ocular Hypertension. Grippo, Tomas, John Liu, Nazlee Zebardast, Taylor Arnold, Grant Moore, and Robert Weinreb. Investigative Ophthalmology and Visual Science, 54.1, 512-517 (2015). HTML.
This paper characterizes the 24-hour pattern of intraocular pressure (IOP) in untreated ocular hypertensive (OHTN) patients. We show that baseline 24-hour IOP pattern in OHTN patients is similar to that in glaucomatous patients. In contrast to nonconverters, OHTN patients who converted to glaucoma are significantly different from healthy controls.
Methodology extending nonparametric goodness-of-fit tests to discrete null distributions has existed for several decades. However, modern statistical software has generally failed to provide this methodology to users. We offer a revision of R’s ks.test() function and a new cvm.test() function that fill this need in the R language for two of the most popular nonparametric goodness-of-fit tests. This paper describes these contributions and provides examples of their usage. Particular attention is given to various numerical issues that arise in their implementation.
02. Statistical Sleuthing by Leveraging Human Nature: A Study of Olympic Figure Skating. John Emerson and Taylor Arnold. The American Statistician, 65.3 (2011). HTML.
Analysis of figure skating scoring is notoriously difficult under the new Code of Points (CoP) scoring system, created following the judging scandal of the 2002 Olympic Winter Games. The CoP involves the selection of a random subpanel of judges; scores from other judges are reported but not used. An attempt to repeat the methods of previous studies establishing the presence of nationalistic bias in CoP scoring failed to recreate the competition scores from the raw scoring sheets. This raised the concern that different subpanels of judges were being selected for each skater (breaking ISU rules). However, it is also possible that the ISU was attempting to further reduce transparency in the system by permuting, separately for each skater, the order of the presentation of scores from the judging panel. Intuition suggests that it is impossible to tell the difference between accidental randomization and intentional permutation of the judges' scores. Although the recent changes do successfully prevent the study of nationalistic bias, this article provides strong evidence against the hypothesis that a separate random subpanel is chosen for each competitor. It addresses the problem by applying Gleser's extension of the Kolmogorov–Smirnov goodness-of-fit test.
01. Who Talks, and Who's Listening? Networks of International Security Studies. Bruce Russett and Taylor Arnold. Security Dialogue, 41.6, 589-595 (2010). HTML.
This article examines the international networks of communication among journals concerned with international security studies. It uses the Web of Knowledge database on which journals cited articles in which other journals over the decade 1999—2008, and on the overall impact of each journal in the field as a whole. We discover a complex set of networks, with different central journals exerting influence both overall and within subnetworks, as well as peripheral journals linked weakly to only a few others. Some subnetworks can be distinguished by methodology or theoretical schools. Subnetworks frequently cross geographical lines, including both European and USA journals. No single journal dominates the field.