git-researchGit-Research
Sign in

Quick Links

DashboardNamespacesPull RequestsIssuesStarred
ModelsDatasetsPapersSpacesCollectionsExperimentsNotebooks
Settings

Quick Links

DashboardNamespacesPull RequestsIssuesStarred
ModelsDatasetsPapersSpacesCollectionsExperimentsNotebooks
Settings
Back to Papers

Attention Is All You Need

Vaswani et al.ยทNeurIPS 2017arXiv:1706.03762
transformersattentionnlpdeep-learning
98,234 citations15.6K likes245 implementations

Abstract

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU.

Key Contributions

  • Introduction of the Transformer architecture based entirely on attention mechanisms
  • Elimination of recurrence and convolutions for improved parallelization
  • State-of-the-art results on machine translation benchmarks
  • Foundation for subsequent models like BERT, GPT, and T5