Sign Language Production using Neural Machine Translation and Generative Adversarial Networks

Stephanie Stoll, Necati Cihan Camgöz, Simon Hadfield, Richard Bowden: Sign Language Production using Neural Machine Translation and Generative Adversarial Networks. In: 29th British Machine Vision Conference (BMVC 2018), British Machine Vision Association, 2018.

Abstract

We present a novel approach to automatic Sign Language Production using stateof-
the-art Neural Machine Translation (NMT) and Image Generation techniques. Our
system is capable of producing sign videos from spoken language sentences. Contrary to
current approaches that are dependent on heavily annotated data, our approach requires
minimal gloss and skeletal level annotations for training. We achieve this by breaking
down the task into dedicated sub-processes. We first translate spoken language sentences
into sign gloss sequences using an encoder-decoder network. We then find a data driven
mapping between glosses and skeletal sequences. We use the resulting pose information
to condition a generative model that produces sign language video sequences. We
evaluate our approach on the recently released PHOENIX14T Sign Language Translation
dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of
16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities
of our approach by sharing qualitative results of generated sign sequences given their
skeletal correspondence.

BibTeX (Download)

@inproceedings{surrey848809,
title = {Sign Language Production using Neural Machine Translation and Generative Adversarial Networks},
author = {Stephanie Stoll and Necati Cihan Camgöz and Simon Hadfield and Richard Bowden},
url = {http://epubs.surrey.ac.uk/848809/},
year  = {2018},
date = {2018-09-01},
booktitle = {29th British Machine Vision Conference (BMVC 2018)},
journal = {Proceedings of the 29th British Machine Vision Conference (BMVC 2018)},
publisher = {British Machine Vision Association},
abstract = {We present a novel approach to automatic Sign Language Production using stateof- 
the-art Neural Machine Translation (NMT) and Image Generation techniques. Our 
system is capable of producing sign videos from spoken language sentences. Contrary to 
current approaches that are dependent on heavily annotated data, our approach requires 
minimal gloss and skeletal level annotations for training. We achieve this by breaking 
down the task into dedicated sub-processes. We first translate spoken language sentences 
into sign gloss sequences using an encoder-decoder network. We then find a data driven 
mapping between glosses and skeletal sequences. We use the resulting pose information 
to condition a generative model that produces sign language video sequences. We 
evaluate our approach on the recently released PHOENIX14T Sign Language Translation 
dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 
16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities 
of our approach by sharing qualitative results of generated sign sequences given their 
skeletal correspondence.},
keywords = {University of Surrey},
pubstate = {published},
tppubtype = {inproceedings}
}