POS-Tagging based Neural Machine Translation System for European Languages using Transformers

3db4b26e-57b4-4ed9-8e1f-23e620908be420210524042524439wseas:wseasmdt@crossref.orgMDT DepositWSEAS TRANSACTIONS ON INFORMATION SCIENCE AND APPLICATIONS2224-34021790-083210.37394/23209http://wseas.org/wseas/cms.action?id=40463220213220211810.37394/23209.2021.18https://www.wseas.org/cms.action?id=23296POS-Tagging based Neural Machine Translation System for European Languages using TransformersPreethamGaneshUniversity of Texas at Arlington, Arlington, Tx, UsaBharat S.RawalGannon University, Erie, Pa, UsaAlexanderPeterSoftsquare, Silver Spring, Md, UsaAndiGiriSoftsquare, Toronto, on, CanadaThe interaction between human beings has always faced different kinds of difficulties. One of those difficulties is the language barrier. It would be a tedious task for someone to learn all the syllables in a new language in a short period and converse with a native speaker without grammatical errors. Moreover, having a language translator at all times would be intrusive and expensive. We propose a novel approach to Neural Machine Translation (NMT) system using interlanguage word similaritybased model training and Part-Of-Speech (POS) Tagging based model testing. We compare these approaches using two classical architectures: Luong Attention-based Sequence-to-Sequence architecture and Transformer based model. The sentences for the Luong Attention-based Sequence-to-Sequence were tokenized using SentencePiece tokenizer. The sentences for the Transformer model were tokenized using Subword Text Encoder. Three European languages were selected for modeling, namely, Spanish, French, and German. The datasets were downloaded from multiple sources such as Europarl Corpus, Paracrawl Corpus, and Tatoeba Project Corpus. Sparse Categorical CrossEntropy was the evaluation metric during the training stage, and during the testing stage, the Bilingual Evaluation Understudy (BLEU) Score, Precision Score, and Metric for Evaluation of Translation with Explicit Ordering (METEOR) score were the evaluation metrics.524202152420212633https://www.wseas.org/multimedia/journals/information/2021/a105109-004(2021).pdf10.37394/23209.2021.18.5https://www.wseas.org/multimedia/journals/information/2021/a105109-004(2021).pdfEthnologue. How many languages are there in the world?, Feb 2020. Worldometer. World population (live), Oct 2020. Wikipedia contributors. List of languages by total number of speakers — Wikipedia, the free encyclopedia. [Online; accessed 21-September2020]. Wikipedia contributors. List of languages by the number of countries in which they are recognized as an official language — Wikipedia, the free encyclopedia, 2020. Wikipedia contributors. Official language — Wikipedia, the free encyclopedia, 2020. 10.1038/scientificamericanmind0718-5Dana Smith. At what age does our ability to learn a new language like a native speaker disappear?, May 2018. Steffy Zameo. Neural machine translation: Tips & advantages for digital translations — textmaster, May 2019. Delip Rao. The real problems with neural machine translation, Jul 2018. Wikipedia contributors. Statistical machine translation — Wikipedia, the free encyclopedia. [Online; accessed 21-September-2020]. Wikipedia contributors. Neural machine translation — Wikipedia, the free encyclopedia. [Online; accessed 21-September-2020]. Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112, 2014. Wikipedia contributors. Seq2seq — Wikipedia, the free encyclopedia, 2020. [Online; accessed 21-September-2020]. 10.3115/v1/w14-4009Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. 10.18653/v1/d15-1166Minh-Thang Luong, Hieu Pham, and Christopher D Manning. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025, 2015. 10.18653/v1/p16-1162Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015. 10.18653/v1/d18-2012Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71. Association for Computational Linguistics, November 2018. 10.1162/tacl_a_00065Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. 10.18653/v1/p17-1012Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344, 2016. A Vaswani, N Shazeer, N Parmar, J Uszkoreit, L Jones, AN Gomez, L Kaiser, and I Polosukhin. Attention is all you need. arxiv 2017. arXiv preprint arXiv:1706.03762, 2017. 10.1145/3321124Yongjing Yin, Jinsong Su, Huating Wen, Jiali Zeng, Yang Liu, and Yidong Chen. Pos tag-enhanced coarse-to-fine attention for neural machine translation. ACM Trans. Asian Low-Resour. Lang. Inf. Process., 18(4), April 2019. 10.18653/v1/w17-4708Jan Niehues and Eunah Cho. Exploiting linguistic resources for neural machine translation using multi-task learning, 2017. Kelly, Charles. English-spanish sentences from the tatoeba project, 2020. [Online; Accessed 20 September 2020.]. Philipp Koehn. Europarl: A parallel corpus for statistical machine translation. Citeseer, 2005. Paracrawl, 2018. 10.3115/1118108.1118117Edward Loper and Steven Bird. Nltk: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, ETMTNLP ’02, page 63–70, USA, 2002. Association for Computational Linguistics. Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial-strength Natural Language Processing in Python, 2020. J. F. Kolen and S. C. Kremer. Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies, pages 237–243. 2001. Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Understanding the exploding gradient problem. ArXiv, abs/1211.5063, 2012. 10.3115/1073083.1073135Kishore Papineni, Salim Roukos, Todd Ward, and Wei jing Zhu. Bleu: a method for automatic evaluation of machine translation. pages 311–318, 2002. 10.3115/1626355.1626389Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65– 72, Ann Arbor, Michigan, June 2005. Association for Computational Linguistics. 10.1145/3190508.3190551Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mane, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viegas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems, 2016. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.