BigBird

Introduction to BigBird BigBird is one of the latest breakthroughs in natural language processing. It is a transformer-based model that uses a sparse attention mechanism to reduce the quadratic dependency of self-attention to linear in the number of tokens, making it possible for the model to scale to much longer sequence lengths (up to 8 times longer) while maintaining high performance. The model was introduced by researchers at Google Research in 2020 and has since generated significant excit

Dilated Sliding Window Attention

Dilated Sliding Window Attention: An Overview Attention-based models have become increasingly popular in natural language processing and other fields. However, there is a problem with the original Transformer formulation in that the self-attention component is not efficient when it comes to scaling to long inputs. This is where Dilated Sliding Window Attention comes in. What is Dilated Sliding Window Attention? Dilated Sliding Window Attention is an attention pattern that was proposed as par

Fixed Factorized Attention

Fixed Factorized Attention: A More Efficient Attention Pattern When working with natural language processing, neural networks have to process large amounts of data. One way to do this is to use an attention mechanism that focuses on certain parts of the input. Fixed factorized attention is a type of attention mechanism that does just that. Self-Attention A self-attention layer is a foundational part of many neural networks that work with natural language. This layer maps a matrix of input em

Global and Sliding Window Attention

Overview of Global and Sliding Window Attention Global and Sliding Window Attention is a pattern used in attention-based models to improve efficiency when dealing with long input sequences. It is a modification of the original Transformer model which had non-sparse attention with a self-attention component. The self-attention component had a time and memory complexity of O(n^2) which made it difficult to scale to longer input sequences. Global and Sliding Window Attention overcomes this issue b

Neighborhood Attention

Understanding Neighborhood Attention Neighborhood Attention is a concept used in Hierarchical Vision Transformers, where each token has its receptive field restricted to its nearest neighboring pixels. It is a type of self-attention pattern proposed as an alternative to other local attention mechanisms. The idea behind Neighborhood Attention is that a token can only attend to the pixels directly surrounding it, rather than all of the pixels in the image. This concept is similar to Standalone S

Routing Attention

Routing Attention: A New Attention Pattern Proposal If you've ever used a search engine or tried to teach a computer to recognize objects in pictures, you know the power of attention. It's the ability to focus on certain parts of a dataset, whether that be text or images, that allows computers to quickly and accurately perform complex tasks. One recent proposal in attention patterns is called Routed Attention, which is part of the Routing Transformer architecture. In simple terms, Routed Atten

Sliding Window Attention

Sliding Window Attention is a way to improve the efficiency of attention-based models like the Transformer architecture. It uses a fixed-size window of attention around each token to reduce the time and memory complexity of non-sparse attention. This pattern is especially useful for long input sequences where non-sparse attention can become inefficient. The Sliding Window Attention approach employs multiple stacked layers of windowed attention, resulting in a large receptive field. Motivation

Strided Attention

Strided Attention: Understanding its Role in Sparse Transformers Many machine learning models and architectures rely on the concept of attention, which allows the model to focus on specific parts of the input when making predictions. One type of attention is known as self-attention, which is commonly used in natural language processing tasks. One variant of self-attention is called strided attention, which has been proposed as part of the Sparse Transformer architecture. In this overview, we wi

1 / 1