Abstract
We present a novel approach to automatic Sign Language Production using stateof-
the-art Neural Machine Translation (NMT) and Image Generation techniques. Our
system is capable of producing sign videos from spoken language sentences. Contrary to
current approaches that are dependent on heavily annotated data, our approach requires
minimal gloss and skeletal level annotations for training. We achieve this by breaking
down the task into dedicated sub-processes. We first translate spoken language sentences
into sign gloss sequences using an encoder-decoder network. We then find a data driven
mapping between glosses and skeletal sequences. We use the resulting pose information
to condition a generative model that produces sign language video sequences. We
evaluate our approach on the recently released PHOENIX14T Sign Language Translation
dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of
16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities
of our approach by sharing qualitative results of generated sign sequences given their
skeletal correspondence.
Links
BibTeX (Download)
@inproceedings{surrey848809, title = {Sign Language Production using Neural Machine Translation and Generative Adversarial Networks}, author = {Stephanie Stoll and Necati Cihan Camgöz and Simon Hadfield and Richard Bowden}, url = {http://epubs.surrey.ac.uk/848809/}, year = {2018}, date = {2018-09-01}, booktitle = {29th British Machine Vision Conference (BMVC 2018)}, journal = {Proceedings of the 29th British Machine Vision Conference (BMVC 2018)}, publisher = {British Machine Vision Association}, abstract = {We present a novel approach to automatic Sign Language Production using stateof- the-art Neural Machine Translation (NMT) and Image Generation techniques. Our system is capable of producing sign videos from spoken language sentences. Contrary to current approaches that are dependent on heavily annotated data, our approach requires minimal gloss and skeletal level annotations for training. We achieve this by breaking down the task into dedicated sub-processes. We first translate spoken language sentences into sign gloss sequences using an encoder-decoder network. We then find a data driven mapping between glosses and skeletal sequences. We use the resulting pose information to condition a generative model that produces sign language video sequences. We evaluate our approach on the recently released PHOENIX14T Sign Language Translation dataset. We set a baseline for text-to-gloss translation, reporting a BLEU-4 score of 16.34/15.26 on dev/test sets. We further demonstrate the video generation capabilities of our approach by sharing qualitative results of generated sign sequences given their skeletal correspondence.}, keywords = {University of Surrey}, pubstate = {published}, tppubtype = {inproceedings} }