Monday June 25th

Time Session Location
11:00 – 12:00 Registration (continues till 13:00 for those not having lunch) Hamilton Ground Floor
12:00 – 13:00 Lunch Hamilton Ground Floor
13:00 – 13:15 Welcome Message Hamilton Lecture Room
13:15 – 14:15 Keynote 1 Hamilton – Joly Theatre
14:15 – 15:30 Poster Session A Hamilton Ground Floor
15:30 – 16:00 Coffee Break Hamilton Ground Floor
16:00 – 17:00 Oral session Hamilton – Joly Theatre
18:00 – 19:30 Drinks reception Trinity Old Library
19:30 til late Barbecue dinner The Pavillion Bar

Tuesday June 26th

Time Session Location
08:40 – 09:00 Registration Hamilton Ground Floor
09:00 – 10:00 Keynote 2 Hamilton – Joly Theatre
10:00 – 11:00 Poster Session B Hamilton Ground Floor
11:00 – 11:30 Coffee Break Hamilton Ground Floor
11:30 – 12:30 Oral session Hamilton – Joly Theatre
12:30 – 13:30 Lunch Hamilton Ground Floor
13:30 – 14:15 Keynote 3 Hamilton – Joly Theatre
14:15 – 15:15 Poster session C Hamilton Ground Floor
15:15 – 15:30 Final remarks and farewell Hamilton – Joly Theatre

Reception at the Long Room Library & the Book of Kells

The Book of Kells Exhibition is a must-see on the itinerary of all visitors to Dublin. Located in the heart of Dublin City, a walk through the cobbled stones of Trinity College Dublin will bring visitors back to the 18th century, when the magnificent Old Library building was constructed and which displays the Book of Kells.

The main chamber of the Old Library is the Long Room; at nearly 65 metres in length, it is filled with 200,000 of the Library’s oldest books and is one of the most impressive libraries in the world.

When built (between 1712 and 1732) it had a flat plaster ceiling and shelving for books was on the lower level only, with an open gallery. By the 1850s these shelves had become completely full; largely as since 1801 the Library had been given the right to claim a free copy of every book published in Britain and Ireland. In 1860 the roof was raised to allow construction of the present barrel-vaulted ceiling and upper gallery bookcases.

Marble busts line the Long Room, a collection that began in 1743 when 14 busts were commissioned from sculptor Peter Scheemakers. The busts are of the great philosophers and writers of the western world and also of men (and yes, they are all men) connected with Trinity College Dublin – famous and not so famous. The finest bust in the collection is of the writer Jonathan Swift by Louis Francois Roubiliac.

Other treasures in the Long Room include one of the few remaining copies of the 1916 Proclamation of the Irish Republic which was read outside the General Post Office on 24 April 1916 by Patrick Pearse at the start of the Easter Rising. The harp is the oldest of its kind in Ireland and probably dates from the 15th century. It is made of oak and willow with 29 brass strings. It is the model for the emblem of Ireland.

BBQ at the Pavillion Bar in TCD

The Pavilion Bar first opened in October 1961 and is the sports bar in Trinity College. Profits from the Pavilion go directly to support sport clubs through DUCAC. Having drinks on a sunny day outside “The Pav” as it is better known, is a great way to spend a sunny afternoon, or evening. Many students (and returning graduates) have spent evenings here watching the cricket on College Park, noting that they must one day learn the rules!

Keynote 1

Monday 13.15-14.15

“Children’s Speech Recognition – from the Lab to the Living Room” by Patricia Scanlon, Soapbox Labs

Voice is predicted to replace typing, clicks, touch and gesture as the dominant way to interface with technology in all aspects of our lives across homes, cars, office, schools. However, voice interfaces designed and built for adults using adult speech data do not perform well for children and performance deteriorates the younger the child. This is due to the fact that children’s voices differ from adults both physically and behaviours and behaviourally, these differences increase the younger the child.
Deep learning approaches to speech recognition have increased performance in recent years but require significant volumes of data to achieve such improvements. While large volumes of varied adult speech data-sets are publicly available to buy or license, in sharp contrast, only small limited children’s speech data-sets are available. Children’s speech is notoriously difficult to collect, particularly under 8 years old. Publicly available children’s speech datasets are typically recorded in clean, quiet, high quality headset microphones and in highly controlled conditions. This causes significant problems, as speech technology systems built with such data require that a child’s environment mimic conditions in order for the system to work effectively.
SoapBox Labs have been working, on the problem of children’s speech recognition, since 2013, and have built a children’s voice technology platform for children aged 4-12, which is licensed to third parties to voice-enable their products for use with children. Our high accuracy platform and uses deep learning techniques and has been built using thousands our hours of proprietary, high-quality, real-world, uncontrolled and varied speech data from young children across the globe. SoapBox Labs is currently scaling our platform to multiple new languages.
Application areas include voice control and conversational engagement for home assistants skills, gaming, AR/VR, toys, robotics as well as educational assessment for reading and language learning tutors/assistants.
Globally there is a growing concern about data privacy. Scrutiny is likely to continue and there will be further focus on children’s voice data. SoapBox Labs also helps companies take a proactive approach to privacy for children’s voice data and ensure full US COPPA and EU GDPR compliance with our patent-pending privacy-by-design approach.

Keynote 2

Tuesday 9.00-10.00

“Spoken Language Processing: Are We Nearly There Yet?” by Roger Moore, Sheffield University

Maybe, maybe not!

Keynote 3

Tuesday 13.30-14.15

“Deep Learning for End-to-End Audio-Visual Speech Recognition” by Stavros Petridis, Imperial College London

Decades of research in acoustic speech recognition have led to systems that we use in our everyday life. However, even the most advanced speech recognition systems fail in the presence of noise, e.g., giving voice commands to your mobile phone in the street does not work so well as in a quiet room. This problem can be (partially) addressed by using visual information, e.g., monitoring the lip movements which are not affected by the noise. Recent advances in deep learning have made it straightforward to extract information from the mouth region and combine it naturally with the acoustic signal in order to enhance the performance of speech recognition. In this talk, we will see how deep learning has made this possible and also present a few relevant applications like end-to-end speech-driven facial animation.

Oral Session A

Monday 16.00-16.20

“Seeing speech: ultrasound imaging for child speech therapy” by Korin Richmond, University of Edinburgh

It is estimated up to 6.5% of children (or two children in every classroom) in Britain suffer from a Speech Sound Disorder, defined as difficulty in producing one or more native language speech sounds. This can make it difficult for children to communicate normally, impacting self-esteem and leading to recognised risk of poor integration and educational attainment. Current speech therapy methods have little technological support, relying upon therapist and child client “ears”. Attractively, a medical ultrasound scanner offers the potential to visualise and monitor what is going on inside the client’s mouth. Here I will give an overview of our “Ultrax” project and its ongoing work to apply machine learning and signal processing techniques to develop ultrasound imaging as a useful technology to support child speech therapy (

Monday 16.20-16.40

“Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs” by Matthew Roddy, Trinity College Dublin

In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities (e.g. linguistic, acoustic, visual). To design spoken dialog systems (SDSs) that can conduct fluid interactions it is desirable to incorporate cues from these separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. To test this hypothesis, we develop a multiscale RNN architecture in which modalities can be flexibly modeled at separate timescales. Our results show that when using acoustic and linguistic features to model turn-taking, modeling linguistic features at a variable temporal rate yields significant improvements over using a uniform temporal rate.

Monday 16.40-17.00

“Expanding Alexa’s Knowledge Base: Relation extraction from unstructured text” by Christos Christodoulopoulos, Amazon 

These days, most general knowledge question-answering systems rely on large-scale knowledge bases comprising billions of facts about millions of entities. Having a structured source of semantic knowledge means that we can answer questions involving single static facts (e.g. “Who was the 8th president of the US?”) or dynamically generated ones (e.g. “How old is Donald Trump?). More importantly, we can answer questions involving multiple inference steps (“Is the queen older than the president of the US?”).

In this talk, I’m going to be discussing some of the unique challenges that are involved with building and maintaining a consistent knowledge base for Alexa, extending it with new facts and using it to serve answers in multiple languages. I will focus on an investigation into fact extraction from unstructured text. I will present a method for creating distant (weak) supervision labels for training a large-scale relation extraction system. I will also discuss the effectiveness of neural network approaches by decoupling the model architecture from the feature design of a state-of-the-art neural network system. Surprisingly, a much simpler classifier trained on similar features performs on par with the highly complex neural network system (at 75x reduction to the training time), suggesting that the features are a bigger contributor to the final performance.

Oral Session B

Tuesday 11.30-11.50

“What can I help you with?”: Infrequent users’ experiences of Intelligent Personal Assistants” by Benjamin R. Cowan, University College Dublin

Intelligent Personal Assistants (IPAs) are widely available on devices such as smartphones. However, most people do not use them regularly. Previous research has studied the experiences of frequent IPA users. Using qualitative methods we explore the experience of infrequent users: people who have tried IPAs, but choose not to use them regularly. Unsurprisingly infrequent users share some of the experiences of frequent users, e.g. frustration at limitations on fully hands-free interaction. Significant points of contrast and previously unidentified concerns also emerge. Cultural norms and social embarrassment take on added significance for infrequent users. Humanness of IPAs sparked comparisons with human assistants, juxtaposing their limitations. Most importantly, significant concerns emerged around privacy, monetization, data permanency and transparency. Drawing on these findings we discuss key challenges, including: designing for interruptability; reconsideration of the human metaphor; issues of trust and data ownership. Addressing these challenges may lead to more widespread IPA use.

Tuesday 11.50-12.10

“The Effect of Real-Time Constraints on Automatic Speech Animation” by Danny Websdale, University of East Anglia

Tuesday 12.10-12.30

“Deep learning for assessing non-native pronunciation of English using phone distances” by Konstantinos Kyriakopoulos, University of Cambridge


Poster Session A

Monday 14:15 – 15:30

  1. Salil Deena, Raymond W. M. Ng, Pranava Madhyastha, Lucia Specia and Thomas Hain, “Exploring the use of Acoustic Embeddings in Neural Machine Translation”
  2. Zack Hodari, Oliver Watts, Srikanth Ronanki, Simon King, “Learning interpretable control dimensions for speech synthesis by using external data”
  3. Christopher G. Buchanan, Matthew P. Aylett, David A. Braude, “ Adding Personality to Neutral Speech Synthesis Voices
  4. Avashna Govender and Simon King, “Using pupillometry to measure the listening effort of synthetic speech”
  5. Carolina De Pasquale, Charlie Cullen, Brian Vaughan, “Towards a protocol for the analysis of interpersonal rapport in clinical interviews through speech prosody
  6. Yasufumi Moriya, Gareth. J. F. Jones, “Investigating the use of a Multimodal Language Model for Re-Ranking ASR N-best Hypotheses”
  7. Jennifer Williams and Simon King, “Low-Level Prosody Control From Lossy F0 Quantization”
  8. Andy Murphy, Irena Yanushevskaya, Christer Gobl, Ailbhe Ní Chasaide, “Effects of voice source manipulation on prominence perception”
  9. Benjamin R Cowan, Holly P. Branigan, Habiba Begum, Lucy McKenna, Eva Szekely, “They Know as Much as We Do: Knowledge Estimation and Partner Modelling of Artificial Partners
  11. Xizi Wei, Peter Jančovič, Martin Russell, Khalida Ismail, Tom Marshall, “Automatic Assessment of Motivational Interviews with Diabetes Patients”
  12. K.M. Knill, M.J.F. Gales, K. Kyriakopoulos, A. Malinin, A. Ragni, Y. Wang, A.P.Caines, “Impact of ASR Performance on Free Speaking Language Assessment”
  13. Carol Chermaz , Cassia Valentini-Botinhao , Henning Schepker and Simon King, “Speech pre-enhancement in realistic environments”
  14. Wissam Jassim and Naomi Harte, “Voice Activity Detection Using Neurograms

Poster Session B

Tuesday 10:00 – 11:00

  1. Matthew P. Aylett, David A. Braude, “Grassroots: Using Speech Sythesis to Curate Audio Content for Low Power Community FM Radio”
  2. Joanna Rownicka, Steve Renals, Peter Bell, “Understanding deep speech representations
  3. Feifei Xiong, Jon Barker, Heidi Christensen, “Deep Learning of Articulatory-Based Representations for Dysarthric Speech Recognition”
  4. Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond, and Steve Renals, “Towards Robust Word Alignment of Child Speech Therapy Sessions
  5. Oliver Watts, C´assia Valentini-Botinh˜ao, Felipe Espic, and Simon King, “Exemplar-based speech waveform generation”
  6. Leigh Clark, Phillip Doyle, Diego Garaialde, Emer Gilmartin, Stephan Schlögl, Jens Edlund, Matthew Aylett, João Cabral, Cosmin Munteanu, and Benjamin Cowan, “The State of Speech in HCI: Trends, Themes and Challenges”
  7. Danny Websdale and Ben Milner, “Using Visual Speech Information for Noise and Signal-to-Noise Ratio Independent Speech Enhancement”
  8. Eva Fringi and Martin Russell, “Analysis of phone errors attributable to phonological effects associated with language acquisition through bottleneck feature visualisations”
  9. Maria O’Reilly, Amelie Dorn, and Ailbhe Ní Chasaide, “Intonation of declaratives and questions in South Connaught and Ulster Irish”
  10. Ilaria Torre, Emma Carrigan, Killian McCabe, Rachel McDonnell, and Naomi Harte, “Mismatched audio-video smiling in an avatar and its effect on trust”
  11. Jeremy H. M. Wong and Mark J. F. Gales, “Teacher-student learning and ensemble diversity”
  12. Emer Gilmartin, Brendan Spillane,Maria O’Reilly, Ketong Su, Christian Saam, Benjamin R. Cowan,Carl Vogel, Nick Campbell, and Vincent Wade, “Dialog Acts in Greeting and Leavetaking in Social Talk”
  13. Brendan Spillane, Emer Gilmartin, Christian Saam, Leigh Clark, Benjamin R. Cowan, and Vincent Wade “Introducing ADELE: A Personalized Intelligent Companion”
  14. George Sterpu, Christian Saam, and Naomi Harte, “Progress on Lip-Reading Sentences”

Poster Session C

Tuesday 14:15 – 15:15

  1. Christopher J. Pidcock, Blaise Potard, and Matthew P. Aylett, “Creating a New JFK Speech 55 Years Later”
  2. Mark Huckvale, András Beke & Iya Whiteley “Longitudinal study of voice reveals mood changes of cosmonauts on a 500 day simulated mission to Mars”
  3. Gerardo Roa and Jon Barker “Automatic Speech Recognition in Music using ACOMUS Musical Corpus
  4. Catherine Lai and Gabriel Murray, “Predicting Group Satisfaction in Meeting Discussions”
  5. Mengjie Qian, Xizi Wei, Peter Janˇcoviˇc, and Martin Russell, “The University of Birmingham 2018 Spoken CALL Shared Task Systems”
  6. Ailbhe Ní Chasaide, Neasa Ní Chiaráin, Harald Berthelsen, Christoph Wendler, Andrew Murphy, Emily Barnes, Irena Yanushevskaya, and Christer Gobl, “Speech technology and resources for Irish: the ABAIR initiative”
  7. Emer Gilmartin, Carl Vogel, Nick Campbell, and Vincent Wade, “Chats and Chunks: Annotation and Analysis of Multiparty Long Casual Conversations”
  8. Emma O’Neill, Mark Kane, and Julie Carson-Berndsen, “Two Data-Driven Perspectives on Phonetic Similarity”
  9. Brendan Spillane, Emer Gilmartin, Christian Saam, Leigh Clark, Benjamin R. Cowan, and Vincent Wade, “Identifying Topic Shift and Topic Shading in Switchboard”
  10. Andrea Carmantini, Simon Vandieken, Alberto Abad, Julie-Anne Meaney, Peter Bell, and Steve Renals, “Automatic speech recognition for cross-lingual information retrieval in the IARPA MATERIAL programme”
  11. Felipe Espic and Simon King, “The Softmax Postfilter for Statistical Parametric Speech Synthesis”
  12. Joao P. Cabral, “Estimation of the assymetry parameter of the glottal flow waveform using the Electroglottographic signal”
  13. Jason Taylor, and Korin Richmond “Combilex G2P with OpenNMT”