git-research.com - Collobrative Research Platform

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.

Key Contributions

Introduction of the Transformer architecture based entirely on attention mechanisms

Elimination of recurrence and convolutions for improved parallelization

State-of-the-art results on machine translation benchmarks

Foundation for subsequent models like BERT, GPT, and T5

Attention Is All You Need

Abstract

Key Contributions

Attention Is All You Need

Abstract

Key Contributions