Sources of natural human language, such as emails, web pages, tweets, product descriptions, newspaper stories, social media and scientific articles are a central feature of the so-called Big Data explosion.
Within these various media is a wealth of information, connections, patterns and hidden knowledge of academic, social and commercial value. Add to this the volume of digitized historical records and texts, covering thousands of languages, formats and varieties and the potential to unlock new insights becomes almost limitless but also hugely complicated and challenging.
The science of analyzing human language is natural language processing (NLP) and its applications are already part of our everyday lives. Spelling and grammar correction in word processors, translation tools on the web, email spam detection and automatic question answering are all forms of NLP.
In humanities research (our focus for NLP applications) a number of newer NLP challenges exist. These include detecting people’s opinions (sentiment analysis), producing readable summaries of chosen text, identifying the discourse structure of connected text, identifying the relationships among named entities and selecting the correct meaning of words which intrinsically have multiple meanings.