Today’s libraries provide online access millions of bibliographic records. Librarians and re-searchers require tools that allow them to manage and understand these enormous data sets. Most libraries only offer textual interfaces for searching and browsing their holdings. The re-sulting lists are difficult to navigate and do not allow users to get a general overview. Further-more, data providers such as OCLC (Online Computer Library Center) are trying to enrich bib-liographic databases with semantically meaningful structures which are essentially lost when represented within a list-based interface.
Visualizations provide a means to overcome these difficulties. Graphical representations provide quick access to bibliographic data, facilitate the browsing and identification of relevant metadata, and provide a quick overview of the coverage of libraries. The fields of Information Visualization and Visual Analytics within Computer Science develop computer-supported, in-teractive, visual representations which allow users to extract meaning from large and hetero-geneous data sets. While such visual techniques have become common practice in the sciences, they are little employed by libraries, despite similar increases in available data.
eScience Research Engineers are collaborating with this project team, which aims to develop a cutting-edge visual analytics toolkit, to answer both the pressing needs of humanities researchers and concrete demands of the library industry. The tools will provide visual interfaces for:
- data cleaning, clustering, and enrichment,
- data analysis, and
- intuitive and interactive (geographic) representation of search results
The project team will accompany the toolkit development with extensive expert user testing, involving (e-)humanities researchers and the expert user group of OCLC.
Online and mobile news consumption leaves digital traces that are used to personalize news supply, possibly creating filter bubbles where people are exposed to a low diversity of issues and perspectives that match their preferences. Filter bubbles can be detrimental for the role of journalism in democracy and are therefore subject to considerable debate among academics and policymakers alike.
The existence and impact of filter bubbles are difficult to study because of the need to gather the digital traces of individual news consumption; to automatically measure issues and perspectives in the consumed news content; and to combine and analyse these heterogeneous data streams.
We will develop a mobile app to trace individual news consumption and gather user data (WP1); create a custom NLP pipeline for automatically identifying a number of indicators of news diversity in the news content (WP2); integrate and analyze the resulting heterogeneous data sets (WP3); and use the resulting rich data set to determine the extent to which news recommender algorithms and selective exposure leads to a lower diversity of issues and perspectives in the filter bubbles formed by news supplied to and consumed by different groups of users (WP4).
This allows us to determine the impact of biased and homogenous news diets on political knowledge and attitudes. The software developed in this project will be open source and re-usable outside the scope of this project by scholars interested in mobile behavior and news production and consumption.
Humanities research makes extensive use of digital archives. Most of these archives, including the KB newspaper data, consist of digitized text. One of the major challenges of using these collections for research is the fact that Optical Character Recognition (OCR) on scanned historical documents is far from perfect. Although it is hard to quantify the impact of OCR mistakes on humanities research, it is known that these mistakes have a negative impact on basic text processing techniques such as sentence boundary detection, tokenization, and part-of-speech tagging. As these basic techniques are often used prior to performing more advanced techniques and most advanced techniques use words as features, it is likely that OCR mistakes have a negative impact on more advanced text mining tasks humanities researchers are interested in, such as named entity recognition, topic modeling, and sentiment analysis.
The goal of this research is to bring the digitized text closer to the original newspaper articles by applying post-correction. Post-correction involves improving digitized text quality by manipulating the textual output of the OCR process directly. The idea is that better quality data boosts eHumantities research. Although the quality of the KB newspaper data would definitely benefit from improving the OCR process itself (improved image recognition), post-correction will still be necessary, because the quality of historical newspapers is suboptimal for OCR (for example, due to poor paper and print quality).
Existing approaches for OCR post-correction generally make use of extensive dictionaries to replace words in the OCRed text that do not occur in the dictionary with words that do. Based on the assumption that a number of characters in every word will be identified correctly, words not in the dictionary are replaced with alternatives that are as similar as possible to the text recognized, possibly taking into account word frequencies to solve ties. The main problem with these existing approaches is that they do not take into account the context in which words occur.
Deep learning techniques provide an opportunity to take this context into account. This project aims to learn a character based language model of Dutch newspaper articles. This is a model of the character sequences occurring in the text of a corpus. OCR mistakes can be viewed as deviations from this model. Mistakes can be fixed by intervening when text deviates too much from the model.
This project applies machine learning to study a) discursive practices of politicians and journalists on Twitter, and b) to what extent institutional differences between agents still matter, or even exist, now they have similar publishing opportunities on social media. While automated analysis of the content of tweets is intensively studied, the project’s focus on behaviour is innovative. It aims to develop a tool for large-scale automated content analysis of latent categories of behavior that should be scalable in terms of big data sets and various social media platforms.
The project builds upon previous work by the research team in which manual content analysis was applied to study discursive practices of politicians and journalists. A detailed coding scheme was designed to code latent categories of online behaviour (or: discursive practices) such as broadcasting, promoting, criticizing, branding, requesting input etc. These annotated data sets will be used to train the computer.
Our work suggests that although journalists and politicians have different roles and goals, their behaviour on social media is surprisingly similar. This hypothesized redistribution of power in the so-called “triangle of political communication” calls for a revision of classic theoretical insights that are key to both political communication and journalism studies.
Image: European Parliament (flickrCC)
Emotional expression plays a crucial role in everyday functioning. It is a continuous process involving many features of behavioral, facial, vocal, and verbal modalities. Given this complexity, few psychological studies have addressed emotion recognition in an everyday context. Recent technological innovations in affective computing could result in a scientific breakthrough as they open up new possibilities for the ecological assessment of emotions. However, existing technologies
still pose major challenges in the field of Big Data Analytics.
Little is known about how these lab-based technologies generalize to real world problems. Rather than a one-size-fits-all-solution, existing tools need to be adapted to specific user groups in more natural settings. They also need to take large individual differences into account. We take up these challenges by studying emotional expression in dementia. In this context, emotional functioning is strongly at risk, yet highly important to maintain quality of life in person-centered care. Our domain-specific goal is to gain a better understanding of how dementia affects emotion expression.
We carry out a pilot study, a comparative study between Alzheimer patients and matched healthy older adults, as well as a longitudinal study on the development of emotion expression in Alzheimer patients across time. We develop a unique corpus, use state of the art machine learning techniques to advance technologies for multimodal emotion recognition, and develop visualization and statistical models to assess multimodal patterns of emotion expression. We test their usability in a workshop for international researchers and make them available through the eScience Technology Platform.
Image: Susan Sermoneta (CC License)
Case synthesis is the method commonly applied by legal researchers and law students when analyzing court decisions. It essentially entails that case outcomes are compared with the facts of the cases, with the purpose of explaining the differences in outcomes by the differences in facts. The analysis of court decisions commonly relies on human analysis, without software or other technical aid. Consequently, case law is analyzedbased on a relatively small number of cases. In contrast, the law produces numerous cases. In 2013 alone, Dutch courts of first instance handled almost 800,000 private law cases (for example contract termination, damages, divorce).Consequently, the legal field currently studies a fraction of the cases that are out there. As a result, relationships between cases are therefore likely to be hidden.
This project aims to develop a technology that assists the legal community in analyzing case law. With tens of thousands court decisions (in the Netherlands) that are published yearly, numerous decisions remain unstudied. In this project, a technology will be built that allows analyses of which court decisions are central in a network of decisions. This will be done by focusing on citations: the number of references to a certain court decision in other court decisions (in-degree centrality).
The in-degree centrality of court decisions is likely reveal new patterns among decisions on various topics. Moreover, it will shed more light on which decisions are the most important within a certain network. The automated process that will be developed enhances the analysis of relationships among a large number of decisions. Future projects may expand the application by
- processing additional information in the decisions and coding it into researchable variables; and
- expanding the technology to other countries than the Netherlands.
This project (and future projects) can fundamentally transform the way the law is studied (researchers, students) and used (practitioners).
Image: Jeroen Bouman – ICJ – JCI hearing (CC License)
There is a growing interest from companies, governments and universities in the daily communication that takes place on online social media such as blogs, Facebook, and Twitter. Linguists and researchers in communication studies can use this data to study language variation and change. Companies may track reputation of a product after its introduction. Journalists may follow the spread of news messages and spot initial local reports of incidents. Police may monitor Twitter for suspicious behavior. However, the amount of social media data is large and obtaining specific parts that are interesting for a certain purpose, is not easy.
This project has developed a centralized service for gathering, storing, and analyzing Twitter messages and making available derived information to a consortium of researchers in communication studies and language technology throughout the Netherlands.
Blue Monday on Twitter
The service is based on an existing system set up at the ISLA (UvA) and the RUG with infrastructure from SURFsara – mapping these tweets is a very compute intensive activity. The Twitter API, providing free access to approximately 1% of all tweets worldwide, is constantly harvested and the resulting data stored. Interfaces to this data provide users with a number of analysis tools that can be run on all content and metadata.
Image: Michele Ursino (CC License)
Mental illnesses, like depression and anxiety, are among the leading causes of the global burden of disease. E-mental health (EMH) interventions, such as web-based psychotherapy treatments, are increasingly used to improve access to psychotherapy for a wider audience. Whereas different EMH interventions tend to be equally effective, the responsiveness to a specific treatment shows large individual differences. The personalization of treatments is seen as the major road for improvement.
“The personalization of treatments is the major road for improvement.”
eScience methods and tools
Because most EMH interventions use language for communication between counselors and clients, assessing language use provides an important avenue for opening the black box of what happens within therapy. EMH also makes data of the linguistic interactions between client and counselor available on an unprecedented large scale.
“EMH makes data of the linguistic interactions between client and counselor available on an unprecedented large scale.”
The objective of this interdisciplinary project is to use eScience methods and tools, in particular natural language processing, visualization and multivariate analysis methods, to analyze patterns in therapy-related textual features in e-mail correspondence between counselor and client.
Improving the effectiveness of EMH
By connecting patterns of known change indicators to therapy outcome, the question What Works When for Whom? can be answered, which will greatly improve the effectiveness of EMH. The core of the project concerns the development of integrated, modular software for the Dutch language, using data from six EMH-interventions with a total of 10.000 e-mails. These data are sufficiently large and varied to allow for computer-based modelling, and testing of use cases with varying complexity. At the end of the project, the step toward English language software will be made to increase the impact of the project.
Image: Drew Leavy (CC License)
Historical concepts (such as citizenship, democracy, evolution, health, liberty, security, trust, etc.) are essential to our understanding of the past. The history of concepts is a well-established field of research for historians, philosophers and linguists alike. However, there is little agreement on the nature of conceptual change, continuity and replacement, or on the proper methodology to distinguish between core concepts and the marginal vocabularies that are attached to them in certain historical contexts. Currently the notion that concepts are stable unit-ideas that constitute the continuous foundation of changing historical debates, just as chemical elements can form different molecules, is being revaluated.
A tool that enables humanities researchers to mine the historical development of concepts
The scientific goal of this project is to develop a tool that enables humanities researchers to mine the historical development of concepts and the vocabulary with which they are expressed in big textual data repositories. Recent research suggests that vector representations derived by neural network language models offer new possibilities for obtaining high quality semantic representations from huge data sets.
“Mining the historical development of concepts such as citizenship.”
Impressive progress has been made in tracing and mapping historical events and actors as well as past relations between actors and events. This project aims to go beyond these capabilities by establishing the structures of interpretation that emerge around these historical events, and the subsequent formation of collective meanings.
Digitized repositories of historical newspapers and magazines offer crucial empirical data to explore our historical heritage. Produced and controlled by social ‘gatekeepers’ such as journalists, editors and publishers, their information-rich content and audience-oriented nature ensure that they reflect and mediate public opinion in the societies that produce them.
Visualizing changing cultural concepts
This project aims to make historically embedded cultural concepts visible in a way that enables researchers to trace the specific historicity of cultural concepts and their changes and continuities in meaning.
Image: View of Ancient Florence by Fabio Borbottoni. Emerging cities such as Florence gave people new opportunities to be a citizen of their city. Citizenship as a concept is difficult to define because it changes over space and time.
The shift to Digital Humanities has brought humanities scholars an unprecedented amount of historical information on the Web. Events, and the associated role of perspectives in event interpretation, take a pivotal role in humanities research. We often ask ourselves whether events and narratives provide context for interpretation of cultural heritage collections. Can event perspectives gathered through crowdsourcing provide the necessary diversity of perspectives on historical events?
Similar to other users of digital data (consumers, content creators, teachers, publishers), scholars are faced with the challenge of how to find, link, and understand massive and diverse collections of historical information on the Web. Here, the theory of interpretation in the humanities sciences called hermeneutics needs to account for the interpretation of information in a digital environment.
This project provides a basis for interpretation support in searching and browsing of heritage objects, where semantic information from existing collections plus open linked data vocabularies are linking collections of objects to the events, people, locations and concepts that are depicted or associated with those objects. An innovative interface allows for browsing this network of data in an intuitive fashion supporting both digital humanities scholars and general audiences in their online explorations.
Image: Frans Francken (II) – Kunst- und Raritätenkammer