Conference Proceedings


Gareth J.F. Jones
Florian Metze
Ramon Sanabria
Yasufumi Moriya


Computer Science

integration content analysis human speech processing automatic speech recognition asr audio visual signals reference resolution multimedia data

Eyes and ears together: new task for multimodal spoken content analysis (2018)

Abstract Human speech processing is often a multimodal process combining audio and visual processing. Eyes and Ears Together proposes two benchmark multimodal speech processing tasks: (1) multimodal automatic speech recognition (ASR) and (2) multimodal co-reference resolution on the spoken multimedia. These tasks are motivated by our desire to address the difficulties of ASR for multimedia spoken content. We review prior work on the integration of multimodal signals into speech processing for multimedia data, introduce a multimedia dataset for our proposed tasks, and outline these tasks.
Collections Ireland -> Dublin City University -> Publication Type = Conference or Workshop Item
Ireland -> Dublin City University -> DCU Faculties and Centres = DCU Faculties and Schools: Faculty of Engineering and Computing: School of Computing
Ireland -> Dublin City University -> DCU Faculties and Centres = Research Initiatives and Centres: ADAPT
Ireland -> Dublin City University -> Status = Published

Full list of authors on original publication

Gareth J.F. Jones, Florian Metze, Ramon Sanabria, Yasufumi Moriya

Experts in our system

Gareth J. F. Jones
Dublin City University
Total Publications: 297