.. TransformerNNX documentation master file, created by sphinx-quickstart on Mon Dec 16 12:46:55 2024. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. TransformerNNX documentation ============================ **Ripository:** https://github.com/mohsenh17/TransformerNNX =========================== **What are Transformers?** =========================== Transformers are a class of deep learning models that revolutionized natural language processing (NLP) and other fields by introducing a highly effective mechanism called attention. Unlike earlier models such as recurrent neural networks (RNNs) and long short-term memory networks (LSTMs), transformers process input data in parallel, making them faster and more efficient, especially for long sequences. Since their introduction in "Attention is All You Need" :cite:`vaswani2017attention`, transformers have become the backbone of most state-of-the-art models in NLP, computer vision, and beyond. Before transformers, sequence-to-sequence models relied on RNNs, which struggled with long-range dependencies and computational inefficiencies due to their sequential nature. Transformers addressed these issues by: * Replacing recurrence with self-attention mechanisms. * Allowing global dependencies within a sequence to be captured effectively. * Utilizing parallel computation for faster training. Transformers introduced the attention mechanism as a core component, enabling models to: * Dynamically focus on the most relevant parts of the input sequence. * Handle tasks like translation, summarization, and question answering with unparalleled performance. * Scale to larger datasets and models, leading to breakthroughs like BERT, GPT, and T5. ----------------------------------- **Core Ideas Behind Transformers** ----------------------------------- **Attention Mechanism:** At the heart of the transformer is the attention mechanism, which computes a weighted representation of the input sequence, allowing the model to focus on the most important tokens for a given task. Self-attention enables the model to understand relationships between words irrespective of their position in the sequence. **Parallelism:** Unlike RNNs, transformers process entire sequences simultaneously, leveraging GPUs and TPUs more effectively. This design eliminates the sequential bottleneck and significantly accelerates training and inference. =========================================== **High-Level Transformer Architecture** =========================================== The transformer architecture consists of two main components: the **encoder** and the **decoder**. Each plays a distinct role in sequence-to-sequence tasks like machine translation. **Encoder:** * The encoder takes an input sequence and converts it into a series of context-aware representations. * It is composed of multiple identical layers, each containing two primary sublayers: * A multi-head self-attention mechanism. * A position-wise feedforward network. **Decoder:** * The decoder generates the output sequence by attending to the encoder’s representations and its own previously generated tokens. * Like the encoder, it consists of stacked layers with three sublayers: * A masked multi-head self-attention mechanism (to ensure causal order). * A multi-head cross-attention mechanism (attending to the encoder’s outputs). * A position-wise feedforward network. ------------------- **Flow of Data** ------------------- 1. **Input Representation:** The input tokens are embedded into dense vectors, and positional encodings are added to introduce information about the order of tokens. 2. **Encoder Processing:** The input embeddings pass through the encoder layers, resulting in context-aware representations. 3. **Decoder Processing:** The decoder uses the encoder’s outputs and previously generated tokens to produce the final sequence. 4. **Output Generation:** The decoder’s outputs are passed through a linear layer and softmax function to generate probabilities for the next token. ------------------------------- **Diagram of the Transformer** ------------------------------- .. image:: transformer.png :width: 50% :align: center -------------------------------- **Role of Attention Mechanism** -------------------------------- * **Self-Attention**: Helps encode contextual relationships within the same sequence (e.g., understanding dependencies in a sentence). * **Cross-Attention**: Allows the decoder to align and extract relevant information from the encoder’s outputs. ---------------------------------- **Strengths of the Architecture** ---------------------------------- * **Scalability**: Handles large datasets and models efficiently. * **Flexibility**: Extensible to tasks beyond NLP, such as vision and reinforcement learning. * **Effectiveness**: Achieves state-of-the-art results in diverse tasks by capturing both local and global dependencies in data. .. toctree:: :maxdepth: 2 :caption: Core Building Blocks api/Positional_encoding.ipynb api/Encoder *************** Bibliography *************** .. bibliography:: :cited: :style: plain