Understanding spoken language is an important capability of intelligent systems which interact with people. Example applications which use a speech understanding component include personal assistants, search engines and others. The common way of enabling an application to understand and react to spoken language is to first transcribe speech into text using a speech recognition module, and then to process the text with a separate text understanding module.
We propose an alternative approach inspired by how humans understand speech. Speech will be processed directly by an end-to-end neural network model without first being transcribed into text, avoiding the need for large amounts of transcribed speech needed to train a traditional speech recognition system. The system will instead learn simultaneously from more easily obtained types of data: for example, it will learn to match images to their spoken descriptions, answer questions about images, or match utterances spoken in different languages.
Our proposal promises to be less reliant on strong supervision and expensive resources and thus applicable in a wider range of circumstances than traditional systems, especially when large amounts of transcribed speech re not available, for example when dealing with low-resource languages or specialized domains.