stat940W25-presentation: Difference between revisions

Revision as of 20:45, 26 March 2025

Notes on Presentations

Group 1 Presentation: Universal Physics-Informed Neural Networks: Symbolic Differential Operator Discovery with Sparse Data

Paper Citation

Podina, L., Eastman, B., & Kohandel, M. (2023). Universal Physics-Informed Neural Networks: Symbolic Differential Operator Discovery with Sparse Data. In Proceedings of the 40th International Conference on Machine Learning (Vol. 202). PMLR, Honolulu, Hawaii, USA.

Background

Differential equations

Examples of differential equations in physics include Newton's second law (which is an ordinary differential equation), the Navier-Stokes equations (which are partial differential equations), etc.

Existing methods of solving differential equations:

Analytical methods, such as integration or separation of variables.
Numerical methods, such as finite difference, finite volume, or finite elements.
Data-driven approaches: these involve Universal Differential Equations (UDEs) and Physics-Informed Neural Networks (PINNs), which are the focus of this paper.

Introduction to PINNs

With (many) machine learning approaches, the goal is to approximate the solution to a DE using a feed-forward neural network, optimized with MSE loss. The key difference that makes it physics-informed is an extra term in the loss, which penalizes the model for deviating from the governing DE.

Introduction to UDEs

Here, the differential equation is expressed as a sum of two terms: the known physics-based model and an unknown neural network.

Paper Contributions

Universal Physics-Informed Neural Networks (UPINNs)

PINNs and UDEs are combined, addressing the limitations of the original methods, while sharing their benefits.

The model integrates three network components:

Surrogate Solution Network U: links to the measurement loss
Unknown Differential Operator Network F: with with U within the PINN loss
Boundary Condition Network B: links to the boundary loss

The loss function contains three terms:

MSE
Boundary loss, if you were to provide boundary conditions with the problem.
PINN loss: ensure the model respects the differential conditions.

Experimental Validation

1. Lotka-Volterra Model

They first experimented with the UPINN on the Lotka-Volterra system of differential equations, which are used to model predator-prey dynamics:

[math]\displaystyle{ \frac{dx}{dt} = \alpha x - \beta xy }[/math]

[math]\displaystyle{ \frac{dy}{dt} = -\delta y + \gamma xy }[/math]

The UDE and PINN were individually tested on two scenarios: sparse data (where there are very few input data points) and noisy data. Alone, each model did not do very well, especially when the data was very sparse or very noisy. When the UPINN was used, the solution was quite good, even with high sparsity or noise.

2. Viscous Burgers’ Equation

Their next experiment was used Burger's equation, a system in fluid dynamics.

[math]\displaystyle{ \frac{\partial u}{\partial t} = -u \frac{\partial u}{\partial x} + \nu \frac{\partial^2 u}{\partial x^2} }[/math]

3. Cell Apoptosis Model

Group 2 Presentation: EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Presented by:

Kareena Bhalla and Chelsea Huffman

Paper Citation

Editing in progress

Background

In this paper, we are looking at Large Language Models (LLMs). Autoregressive decoding in LLMs refers to generating text one token at a time, the model basing its predictions on the tokens that came before it. It's inefficient and costly.

Speculative Sampling

Speculative sampling is a technique meant to reduce the computational cost and runtime of autoregressive decoding. The process consists of two main parts:

Draft stage: A small, fast model suggests some tokens.
Verification stage: In parallel, the large main LLM verifies these tokens and selects the best ones.

How do we choose a draft model that functions like the big LLM, but faster? One approach is to use a reduced version of the main LLM, but this doesn't work when there is no small version available, also using a model with fewer parameters often come with high overhead and reduced accuracy.

Technical Contributions

Extrapolation Algorithm for Greater Language Model Efficiency (EAGLE)

Before making a prediction, the model looks ahead by one token/word in the sequence.

One advantage of the EAGLE method is that only a single decoder layer needs to be trained rather than an entire draft model. This makes the training process extremely efficient in terms of runtime/computational cost.

The EAGLE method makes improvements to the process based on the following 2 observations:

1. Autoregression is simpler at the feature level than the token level.

2. The uncertainties from the sampling process negatively affect the performance.

Experimental Results

1. EAGLE is much faster than ordinary autoregressive decoding.

2. EAGLE can be applied to various LLMs without reducing the model's generation quality.

3. EAGLE can generate approximately 3.2–4.5 tokens per pass.

Related Works

There is now a new version of Eagle, Eagle 2. Eagle 2 improves from this version by introducing dynamically adjustable draft tree. It would adjust the draft tree based on the context and position, which is built based on speculative sampling. Upon some testing, Eagle can be 20% - 40% faster than EAGEL-1

Group 3 Presentation: Mamba: Linear-Time Sequence Modelling with Selective State Spaces

Presented by:

Liang Wu, Jingcheng Yu, Candace Ng

Paper Citation

Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv. https://arxiv.org/abs/2312.00752.

Background

Technical Contributions

- Introduce a selection mechanism for state space models that enables input-dependent state updates

- Develop a hardware-aware recurrent scan algorithm for efficient computation

- Propose Mamba, an attention-free, linear-time architecture that achieves state-of-the-art performance across multiple modalities

Architecture

Mamba integrates selective SSMs into a single homogeneous block. Each block comprises: a linear projection, a selective SSM layer, and an MLP block. This architecture is attention-free and scales linearly in sequence length.

Related Work

- Builds on structured SSMs (S4, S5) and architectures such as H3 and Hyena.

- Demonstrates theoretical and empirical connections to classical RNN gating.

Future Directions:

- Scale to larger models and refine training recipes

- Extend Mamba to multimodal tasks (e.g. video)

- Explore additional downstream affordances (fine-tuning, in-context learning, quantization)

Group 4 Presentation: Learning spatiotemporal dynamics with a pretrained generative model

Presented by:

- Karolina Suszek

- Negin Amou

- Muhammad Azeem

Paper Citation

Z. Li et al., “Learning spatiotemporal dynamics with a pretrained generative model,” Nature Machine Intelligence, vol. 6, no. 12. Springer Science and Business Media LLC, pp. 1566–1579, Dec. 06, 2024. doi: 10.1038/s42256-024-00938-z.

Background

- Spatiotemporal dynamics: how the state of a physical system varies with space and time

- Real datasets often contain data with sparse measurements where there are a limited number of sensors available. There needs to be a way to convert the sparse measurement data into a full spatiotemporal field.

- Existing solutions learn to map the input to output and ignores missing data, but this reduces the models ability to generalize.

- Paper proposes the use of Sparse-Sensor-Assisted Score-Based Generative Model (S3GM) which uses unlabeled data durring training and can reconstruct incomplete data after training to make accurate predictions even when there isnt much information available.

- Key Idea: Learn the probabilith distribution of spatiotemporal data using score-based generative model and refine the samples via schochastic sampling

Technical Contributions

Core Components:

- Pre Training Stage: Learns the joint probability distribution of the data

- Generating Stage: Use a stochastic differential equation to refine and generate full field predictions

- Refinement Mechanism: Ensure Allignment with observations and enforce sequence consistency

Group 5 Presentation: Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Presented by:

Paper Citation

De, S., Smith, S., Fernando, A., Botev, A., Cristian-Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y., Srinivasan, S., Desjardins, G., Doucet, A., Budden, D., Teh, Y. W., Pascanu, R., De Freitas, N., Gulcehre, C. (2024). Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. arXiv. arxiv.org/pdf/2402.19427

Background

RNNs and transformers

Recurrent neural networks (RNNs) are a basic architecture for handling sequential data; they are good for short sequences but can struggle with long ones.

Transformers generally perform better than RNNs and have become dominant in recent years. However, for large sequences, they become computationally expensive for a couple of reasons. Their global attention mechanism has a quadratic complexity. Furthermore, the key-value cache and the multi-query attention cache grows linearly with sequence length.

Technical Contributions

Model architecture

Their proposed models -- Hawk and Griffin -- use these following structures:

Residual block: This is the main structure in the architecture. It starts with an RMSNorm layer being applied to the hidden state, followed by a temporal mixing block. There's a residual connection. Next, there is another RMSNorm layer followed by an MLP (multi-layer perceptron) block.
Gated MLP block: This is a gated feedforward block. There are 2 branches: one is linear and one uses a GeLU activation. This part of the structure is used for feature selection.
Temporal mixing block: They used 3 different kinds of temporal mixing blocks in their models: global multi-query attention, local sliding window attention, and recurrent blocks (what they proposed). The recurrent block contains 2 parallel branches. One has a GeLU activation, while the other has a temporal 1D convolutional layer followed by a RG-LRU layer (a proposal in this paper).

Real-Gated Linear Recurrent Unit (RG-LRU)

RG-LRUs are inspired by regular Linear Recurrent Units (LRUs), but with a gating mechanism. They are more computational efficient, as they avoid the quadratic complexity that transformers have. Mathematically, the layer is defined as follows.

The recurrence gate:

[math]\displaystyle{ r_t = \sigma (W_a x_t + b_a) }[/math]

The input gate:

[math]\displaystyle{ i_t = \sigma (W_x x_t + b_x) }[/math]

[math]\displaystyle{ a_t = a^{cr_t} }[/math]

The output:

[math]\displaystyle{ h_t = a_t \odot h_{t-1} + \sqrt{1-a_t^2} \odot (i_t \odot x_t) }[/math]

Group 6 Presentation: Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Presented by:

- Pingchu Zhang

- Zhiyang Cheng

Paper Citation

Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., Hashimoto, T., & Guestrin, C. (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv. https://doi.org/10.48550/arXiv.2407.04620

Background

For modern RNNs, performance in long context is limited by the expressive power of their hidden state of fixed size. Hence the authors introduced test-time training (TTT)

Technical Contributions

- Introduce TTT layers, where the hidden state is a model and the update rule is self-supervised learning, offering a new research direction.

- TTT-Linear, a simple implementation of TTT layers, outperforms Transformers and Mamba in evaluations.

- Improve the hardware efficiency of TTT layers through mini-batch TTT and the dual form, making TTT-Linear already a practical building block for LLMs.

Methodology

The key idea is to make the hidden state itself a model with weights, and the update rule a gradient step on the self-supervised loss. Then updating the hidden state on a test sequence is equivalent to training the model at test time.

TTT as updating a hidden state

Training a network with TTT layers

- Training the larger network as the outer loop and training weights within each TTT layer as the inner loop is preferred.

- TTT layers can replace RNN or self-attention layers in any network architecture. Training a network with TTT layers also works the same way as training any other language model.

Learning a self-supervised task for TTT

Add some outer-loo parameters to make this task learnable.

The input [math]\displaystyle{ x_t }[/math] is transformed using a learnable matrix [math]\displaystyle{ \theta_K }[/math] to create a projection [math]\displaystyle{ \tilde x_t = \theta_k x_t }[/math]

The reconstruction label is another low-rank projection [math]\displaystyle{ \theta_V x_t }[/math] which can differ from the input. Then we can create a test view [math]\displaystyle{ \theta_Q x_t }[/math]

Now the new self-supervised loss is: [math]\displaystyle{ l(W,;x_t) = \|f(\theta_k x_t; W)-\theta_V x_t\|^2 }[/math] and the output rule is modified to [math]\displaystyle{ z_t = f(\theta_q x_t;W_t) }[/math]

Group 7 Presentation: Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains

Presented by:

- Jonathan Gallagher

- Mariya Anashkina

Paper Citation

A. V. Makkuva et al., “Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains,” 2024, arXiv. doi: 10.48550/ARXIV.2402.04161.

Background

- Markov chains: probability models that predict the future state based on the current and prior states of a system.

- Language can be modeled as a markov process, and token predictions from transformers follow closley to markov chains

- The paper introduces a framework to systematically analyze (through the lense of a markov chain) how transformers learn to model sequential data

- The objective is to explore how a transformer succedes or struggles on first order Markov chains, a distribution that only needs one step of memory.

Theoretical Results

Theorem 1 (Global minimum). Let the input sequence be [math]\displaystyle{ \{x_n\}_{n=1}^N \sim \bigl(\pi(p,q), P(p,q)\bigr) }[/math] for some fixed [math]\displaystyle{ (p,q)\in(0,1)^2. }[/math] Then for all [math]\displaystyle{ (p,q), }[/math] there exists a [math]\displaystyle{ \theta_{\star}\in\mathbb{R}^{D-d} }[/math] with an explicit construction such that it is a global minimum for the population loss [math]\displaystyle{ L(\cdot) }[/math].

Therom 2 (Bad local minimum). Let the input sequence be [math]\displaystyle{ \{x_n\}_{n=1}^N \sim \bigl(\pi(p,q), P(p,q)\bigr) }[/math] for some fixed [math]\displaystyle{ (p,q)\in(0,1)^2. }[/math] If [math]\displaystyle{ p+q\gt 1, }[/math] there exists an explicit [math]\displaystyle{ \theta_{\pi}\in\mathbb{R}^{D-d} }[/math] such that it is a bad local minimum for the loss [math]\displaystyle{ L(\cdot) }[/math]

Theorem 3 (Global minimum). Consider the same setting as in Thm.~1. Then for all [math]\displaystyle{ (p, q), }[/math] if [math]\displaystyle{ \theta_{\star} = (e_{\star} = a_{\star}, \dots, b_{\star}) \in \mathbb{R}^{D-d} }[/math] is a global minimum for the loss [math]\displaystyle{ L(\cdot) }[/math] in the weight-tied scenario, then its extension [math]\displaystyle{ \bar{\theta}_{\star} = (\bar{e}_{\star}, \bar{a}_{\star}) \in \mathbb{R}^D }[/math] is also a global minimum for [math]\displaystyle{ L(\cdot) }[/math] in [math]\displaystyle{ \mathbb{R}^D }[/math] in the non-weight-tied case. Further, [math]\displaystyle{ \bar{\theta}_{\star} }[/math] satisfies the same properties (ii)--(iv) as in Thm.~1.

Theorem 4 (Saddle point). Consider the same setting as in Thm.~3. For [math]\displaystyle{ p + q \gt 1, }[/math] let [math]\displaystyle{ \theta_{\pi} = (e_{\pi} = a_{\pi}, \dots, b_{\pi}) \in \mathbb{R}^{D-d} }[/math] be the corresponding bad local minimum for the loss [math]\displaystyle{ L(\cdot) }[/math] in the weight-tied scenario. Then its extension [math]\displaystyle{ \bar{\theta}_{\pi} = (\bar{e}_{\pi}, \bar{a}_{\pi}) \in \mathbb{R}^D }[/math] is a saddle point for [math]\displaystyle{ L(\cdot) }[/math] in [math]\displaystyle{ \mathbb{R}^D }[/math] in the non-weight-tied case. Further, [math]\displaystyle{ \bar{\theta}_{\pi} }[/math] satisfies the same properties (ii)--(iv) as in Thm.~2.

Main Contributions

1. The authors introduce a theoretical framework that models the data source as a Markov process. This allows them to study how transformers learn sequential structure, contrasting it with other approaches that treat training data simply as i.i.d.

2. Focusing on single-layer transformers trained for next-token prediction on first-order Markov data, the paper gives a detailed analysis of the cross-entropy loss surface: There is always a set of parameters that perfectly recover the true Markov transition probabilities (thus achieving the global minimum).When the sum of the Markov chain's flipping probabilities exceeds one, the model can converge to parameters that simply predict the stationary distribution rather than the true transition probabilities. This phenomenon does not arise (or becomes only a saddle) when weights are untied or when the transformer has multiple layers.

3. Through experiments, they show that the theoretical findings match empirical behaviors. In particular: When weights are tied, the model may learn a constant (stationary) probability and fail to leverage sequential context if the transition probabilities are above a certain threshold. Removing weight tying—or increasing the transformer's depth—helps avoid such bad local minima.

4. The authors extend the analysis to higher-order processes. Surprisingly, simply increasing the transformer depth does not guarantee learning of higher-order transitions. They find, however, that restricting the attention window (rather than letting the network attend to all past tokens) dramatically improves learning of higher-order Markov patterns.

Group 8 Presentation: MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Presented by:

- Nana Ye

- Xingjian Zhou

Paper Citation

T. Cai et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads,” 2024, arXiv. doi: 10.48550/ARXIV.2401.10774.

https://arxiv.org/abs/2401.10774

Background

- As the size of LLMs grow, the speed at which they can generate tokens decreses. The bottleneck is primairly the transfer of data to/from the GPU

- Speculative Sampling is an existing solution that predicts multiple tokens in the future at once using smaller "draft" models

- Medusa instead solves this problem by adding multiple decoding heads and a tree based attention mechanism to existing LLMS

- Paper discusses the implementations of Medusa1 and Medusa2

Main Idea

The idea is that by replacing the draft model and use the heads within our own model since the best representation of a model is itself. The multiple heads predicts multiple tokens at once to leverage parallelism. This allows it to be more efficient and provide more tokens for the tree-based attention to choose. The tree-based attention is used is simulate the idea as if the tokens are being generated sequentially, by traversing through a tree from top to bottom, where top is the initial word.

Technical Contributions

Medusa 1:

- Uses a frozed pre-trained LLM and trains extra decoding heads on top

- Each additional decoding head predicts a token K time steps in the future

- Uses a probability loss function that scales based on the number of steps into the future

- Reduces memory usage because the backbone model is only used for hidden state extraction

- In simple terms, Medusa adds additional linear layers on top of the last hidden layer from the transformer output which are training to predict the tokens in future positions, rather than just the next token like a conventional transformer mechanism does in a typical auto-regressive manner.

Medusa 2:

- Fine tunes the LLM and trains the decoding heads at the same time.

- Encountered problems with high losses, switched to a two-stage training process:

- Stage1: train only the Medusa heads (simillar to Medusa1)

- Stage2: Train both the backbone model and the medusa heads together

Tree Attention

- Tree attention is used to enable the heads predicting later tokens to include the additional context which may have been created by the earlier medusa heads in the pipeline

- This tree structure does not occur autoregressively, however

- The top predictions from each head are fed into the tree structure as candidate tokens

- An attention mask is used to ensure that the future token prediction from the tree is based on prior tokens, not future ones past the one being dedicated

- Multiple future candidate tokens can be predicted with context-aware attention simultaneously

Self Distillation

- A dataset with prompts relevant to the desired model are created

- The full large language model predicts outputs to these prompts in a typical auto regressive manner. These prompts are used to form a training dataset for the self-distillation step

- Medusa Heads are trained on the generated training dataset

Tree Construction

Branches with low probability of containing the next token are pruned from the tree of candidate tokens in tree attention, this reduces the computational expensiveness of MEDUSA 2

Empirical Evaluation Experiments on various LLMs show consistent 2–3 times speedups in practice without harming output quality (assessed by GPT-4 and other metrics). The authors also include ablation studies on key design choices (number of heads, attention structure, sampling thresholds), confirming the effectiveness and generality of the proposed framework.

Group 9 Presentation: Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Presented by:

- Kaiyue Ma

- Wenzhe Wang

Paper Citation

T. Dao and A. Gu, “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality,” 2024, arXiv. doi: 10.48550/ARXIV.2405.21060.

Background

- Transformers are effective, but computationally expensive and suffer from quadratic complexity

- Structured state space models (SSMs) are an alternative that scales linearly instead and works for long range modeling

- SSMs have not recieved the same main stream improvements as transformers, and lack support for parallelization and hardware acceleration

- Structured state space duality (SSD) bridges the gaps between transformers and SSMs

Additional Background

SSM(State Space Models) are traditionally used in control theory to model a dynamic system via variables. But then from this paper https://compneuro.uwaterloo.ca/files/publications/voelker.2018.pdf, they discovered that SSM is great for describing the time cells in the brain. A useful diagram of a SSM can be found here https://cdn-uploads.huggingface.co/production/uploads/613b0a62a14099d5afed7830/G7icfkYoxIqHZcJGHM7UD.png, where, n state variables, u, m state inputs, and y, p outputs.

Technical Contributions

- Represents SSMs as semiseparable matrices and uses semiseparable matrices for efficient matrix operations.

- Uses generalized linear attention mechanism with structured masks.

- Refines the original Mamba model to yield Mamba-2, which incorporates the new structured-state-space-duality algorithms. Mamba-2 is easily parallelizable and scales better to large state sizes. Empirical results demonstrate strong performance on language modeling benchmarks, surpassing older SSM models and matching or outperforming Transformers at various model scales.

Group 10 Presentation: Accelerating Large Language Model Decoding with Speculative Sampling

Paper Citation

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, ‘Accelerating Large Language Model Decoding with Speculative Sampling’, Feb. 02, 2023, arXiv: arXiv:2302.01318. doi: 10.48550/arXiv.2302.01318.

https://arxiv.org/abs/2302.01318

Background

- Traditional autoregressive decoding for large language model is computationally expensive, as the entire model has to run for each additional token which is generated

- Transformer neural networks are typically memory bandwidth limted, and using quantization or distillation to make smaller models are solutions which have been used in the past to improve the performance of LLMs

Technical Contributions

- Speculative sampling was developed to increase the speed of LLM decoding without meaningfully reducing it's performance on predicting future tokens compared to the base model

- Generates a draft of k future tokens using a smaller model

- Score the proposed tokens using the base model

- A modified rejection sampling scheme was developed by the authors in the paper

- The acceptance of a draft token is based on the minimum of 1 and the ratio of the target model's probability to the draft model's probability for that token.

Explanations to Aid Understanding

- Transformer Decoding: This is the process by which a large language model generates text. Given an input prompt, the model sequentially selects the most probable next token, then uses that token to predict the subsequent one, and so on, and this process is computationally intensive.

- Speculative Sampling: Unlike traditional decoding where one token is generated at a time by the target model, speculative sampling aims to generate multiple tokens in parallel by using a faster, potentially smaller "draft model" to propose a short sequence (the draft). The target model evaluates these drafted tokens, and a rejection sampling mechanism decides which ones to accept, ensuring the output remains consistent with the target model's capabilities.

- Parallel Scoring: Instead of computing the logits for each drafted token sequentially, the method computes the logits for all (K+1) tokens in the draft at the same time. The presentation notes that the computing time for this parallel process is similar to sampling a single token with the target model, which is a key factor in the potential speedup.

Group 11 Presentation: Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff

Presented by:

Yiyuan Yang, Anar Kuatzhan, Chuan Zhang

Paper Citation

Arora, S., Eyuboglu, S., Zhang, M., Timalsina, A., Alberti, S., Zinsley, D., ... & Ré, C. (2024). Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668. https://arxiv.org/pdf/2402.18668

Background

Nowadays large language models still struggle with efficiency. In attention-based models, attention requires a huge number of calculations as the input gets longer. Attention also stores every previous word in memory, which makes it memory-intensive. New language models developed in recent years can generate text faster while maintaining low perplexity. Low perplexity doesn't necessarily mean good recall. Gated convolutional models also struggle with recall. Attention-based models excels in recall tasks. To address these problems, the authors introduced the Based model.

Technical Contributions

Memory-Recall Tradeoff: Observed both within and across architecture classes.

Performance with Fixed Recurrent State: Not all architectures have the same recall capacity. Mamba optimally utilizes limited memory budgets while convolutional architectures underperform with memory constraints.

The Based model combines local fine-grained attention + long-range linear attention via Taylor approximation of softmax exponential function that are sub-quadratic complexity during training and permit an efficient recurrent inference view. Based outperforms prior sub-quadratic architectures in recall quality by up to 6.2 accuracy points.

Architecture

Softmax-approximating linear attention (applied globally) + exact softmax attention with sliding windows (applied locally)

This combination achieves 90.8% of full softmax attention's recall accuracy while reducing latency by a factor of 100,000.

Group 12 Presentation: EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

Paper Citation

Y. Li, F. Wei, C. Zhang, and H. Zhang, ‘EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees’, Jun. 30, 2024, arXiv: arXiv:2406.16858. doi: 10.48550/arXiv.2406.16858.

https://arxiv.org/abs/2406.16858

Background

- LLMs to date have shown great performance, but they are slow and computationally expensive

- Speculative sampling - small model generates candidate tokens, whereas large model then evaluates those tokens, reducing the number of times the expensive computations of the large model have to occur

- EAGLE 1 used a draft tree to improve performance of speculative execution, and this builds on the author's prior work

Paper Contributions

- EAGLE 1 doesn't directly predict tokens, and rather maps tokens to features, predicts features, and then predicts tokens from those features

- This work was shown to improve upon previous speculative execution methods

- Eagles uses a tree structure to propose alternative token when speculative sampling draft token is rejected by the full model

- EAGLE 2 noted that token acceptance is dependant on context and position. The first token has a high acceptance rate, and later tokens have lower acceptance rates

- EAGLE 2 uses a dynamic draft trees which incorporate "tree attention" to incorporate context information into selecting the next candidate token, increasing the acceptance rate of the token, as it depends on context as well as not only position

Summaries of Key Points

- Problem Addressed: Large language model inference is slow and computationally demanding due to the large parameter sizes and the sequential nature of token generation.

- Foundation: EAGLE-2 builds upon speculative sampling, where a smaller "draft model" proposes candidate tokens that are then verified by the full large language model.

- Improvement over Eagle One: EAGLE-1 applied static draft tree based on the assumption that draft token acceptance rates depend only on their position in the tree. While EAGLE-2 recognizes that acceptance rates are also highly context-dependent and introduces a dynamic adjustment mechanism for the draft tree.

- Key Stages

1. Drafting: EAGLE applies the draft model to produce feature representations instead of directly predicting tokens. The feature representations are passed to the head of large language model to generate token predictions.

2. Verification: Tree-structured verification is employed by EAGLE. This is more efficient than chain-structured verification in standard speculative sampling.

- Dynamic Draft Tree Adjustment: EAGLE-2 dynamically adjusts the structure of the draft tree based on confidence scores of the draft model. This addresses the context-dependent nature of token acceptance rates.

- Dynamic Expansion and Re-ranking

1. Expansion Phase: EAGLE-2 introduces an expansion phase that utilizes tree attention to process all tokens in a layer simultaneously which can improve efficiency. It also employs selective expansion, prioritizing only the top-k tokens with the highest estimated global acceptance probabilities based on confidence scores to avoid exponential growth.

2. Re-ranking Phase: EAGLE-2 reranks all draft tokens by selecting the top-m tokens with the highest global acceptance probabilities and prioritizing shallower nodes in case of ties.

- Experimental Results: EAGLE-2 achieved acceleration ratios of 3.05x - 4.26x across various tasks and large language model series such as Kuna, Llama 2 and Llama 3, making it 20% - 40% faster than EAGLE-1. It is also approximately 2 times faster than Medusa and 2.3 times faster than Lookahead. On token throughput, EAGLE-2 processes 4-5.5 tokens per verification cycle, about twice as many as traditional speculative sampling.

- Major Advantages

1. Out-of-the-box Usability: EAGLE-2 does not require additional model training as it leverages the pre-trained draft model from EAGLE-1.

2. Reliability: EAGLE-2 does not modify original model parameters or relax acceptance conditions. It also maintains consistency.

3. Task and Model Generalization: EAGLE-2 generalizes effectively across multiple tasks and model architectures and demonstrates strong robustness in diverse applications.

Group 13 Presentation:

Presented By

Yuke Liu, Mei Si

Paper Citation

R. Li, J. Su, C. Duan, and S. Zheng, ‘Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation’, Aug. 20, 2020, arXiv: arXiv:2007.14902. doi: 10.48550/arXiv.2007.14902.

https://arxiv.org/abs/2007.14902

Background

- Existing transformer models have [math]\displaystyle{ \mathcal{O} (n^2) }[/math] complexity, which is problematic as model size grows

- This limits the growth of model sizes due to computational resources constraints

- This paper focused on an alternative method to conventional dot product attention that is more computationally efficient

- Standard attention required the computation of [math]\displaystyle{ Q K^\top }[/math] which requires [math]\displaystyle{ \mathcal{O} (n^2) }[/math] complexity

Technical Contributions

Rather than doing the full computation for the softmax in the transformer architecture, the authors instead compute

[math]\displaystyle{ D(\mathbf{Q}, \mathbf{K}, \mathbf{V})_i = \frac{\sum_{j=1}^{N} e^{\mathbf{q}_i^{T} \mathbf{k}_j} \mathbf{v}_j}{\sum_{j=1}^{N} e^{\mathbf{q}_i^{T} \mathbf{k}_j}} = \frac{\sum_{j=1}^{N} \text{sim}(\mathbf{q}_i, \mathbf{k}_j) \mathbf{v}_j}{\sum_{j=1}^{N} \text{sim}(\mathbf{q}_i, \mathbf{k}_j)} }[/math]

and define the transformation function as

[math]\displaystyle{ \text{sim}(\mathbf{q}_i, \mathbf{k}_j) = \phi(\mathbf{q}_i)^{T} \phi(\mathbf{k}_j) }[/math]

The authors apply a first-order Taylors series expansion, and after some rearranging and substiution arrive to their final formula (full derivation not shown)

[math]\displaystyle{ D(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \frac{ \sum_j \mathbf{v}_{i,j} + \left( \frac{\mathbf{Q}}{\lVert \mathbf{Q} \rVert_2} \right) \left( \left( \frac{\mathbf{K}}{\lVert \mathbf{K} \rVert_2} \right)^{T} \mathbf{V} \right) }{ N + \left( \frac{\mathbf{Q}}{\lVert \mathbf{Q} \rVert_2} \right) \sum_j \left( \frac{\mathbf{K}}{\lVert \mathbf{K} \rVert_2} \right)_{i,j}^{T} } }[/math]

The authors form of the attention mechanism can be solved in [math]\displaystyle{ \mathcal{O} (n) }[/math] complexity, reducing the computational scaling that existing with conventional transformers, and enabling the creation of larger models using the underlying attention mechanism

Model Performance Evaluation

The model performance evaluation primarily focused on assessing how well the proposed linear attention mechanism enhances semantic segmentation performance while maintaining computational efficiency.

- Dataset: The experiments were conducted using the Fine Grained Image Data set, which consists of high-resolution satellite images. This dataset was selected due to its complex landscapes and varied environmental conditions, presenting significant challenges.

- Data Preprocessing: Due to the large size of the images in the dataset, each image was divided into smaller patches of 256x256 pixels, resulting in a total of 7,280 patches.

- Data Split: These batches were methodically partitioned into 60% for training, 20% for validation, and 20% for testing. This split ensured a rigorous evaluation of the model's performance across different scenarios.

- Implementation Details: The experiments were implemented using PyTorch framework and trained on an NVIDIA RTX 2080Ti GPU.

- Evaluation Metrics: OA, AA, K, mloU, F1.

Comparative Analysis

The proposed linear attention mechanism achieves a similar level of accuracy in semantic segmentation tasks compared to traditional dot product attention.

Linear attention maintains comparable performance metrics while drastically reducing both memory footprint and processing time required.

The efficiency gains of linear attention become more apparent in large-scale segmentation tasks, making it suitable for high-resolution images and long text sequences.

Future work includes extending the linear attention mechanism to more complex network designs, integrating with state-of-the-art deep learning models, and optimizing for real-time processing in demanding applications.

Group 14 Presentation: Scalable Watermarking for Identifying Large Language Model Outputs

Presented by:

Ryan Tymkow, Benjamin Schnapp

Paper Citation

Dathathri, S., See, A., Ghaisas, S. et al. Scalable watermarking for identifying large language model outputs. Nature 634, 818–823 (2024). https://doi.org/10.1038/s41586-024-08025-4

Background

With the rise of LLMs and generative AI, there is a risk of spreading AF generated misinformation as if it were from a human. It is important to distinguish AI-generated text from human writing. There exists solutions to address this problem, but they all have limitations involving privacy and computational costs. For example, the traditional watermarking can result in unwanted artifacts in the text.

Technical Contributions

This paper introducted SynthID-Text, a watermarking method for large language models (LLMs). The method uses a Tournament sampling approach, which ensures that generated text contains a detectable watermark with minimal computational overhead. SynthID-Text incorporates a random seed generator and scoring functions to embed a watermark into the model's output. This technique enhances the ability to identify if text originates from a specific LLM, while preserving the text's quality and minimizing distortion. SynthID-Text does not affect LLM training. It can be configured as distortionary or non-distortionary.

Central to SynthID-Text is the novel Tournament sampling procedure. Rather than sampling each token directly from the LLM's distribution, multiple candidate tokens compete in a multi-layer "tournament" based on pseudorandom watermarking scores, embedding a statistical “signature” that can later be detected.

Results

Synth ID achieved a 95% true positive detection rate with 1% false positives in a 20 million interaction test on Google's Gemini chatbot. It offers high detection accuracy with minimal computational cost and can be configured for non-distortionary or distortionary watermarking.

The benefits are as follows:

Minimal impact on large language model training:

-Synth ID text can be applied to any large language model with minimal modifications, as it only alters the sampling step.

High detection accuracy with low computational cost:

-Outperforms retrieval-based tracking, post hoc detection, and traditional watermarking methods.

-Offers the best balance between computational cost, scalability, and accuracy.

-Can be integrated into production environments using speculative sampling, where smaller models suggest tokens and the main model verifies their validity.

Configurable distortion levels:

-Allows for distortionary or non-distortionary configurations, enabling better control over the quality of generated text versus detection accuracy.

-In non-distortionary watermarking, the average token distribution of the generated text matches the original model's token distribution.

Group 15 Presentation: DiGress: Discrete denoising diffusion for graph generation

Presented by:

Sean Tang, Buji Wong

Paper Citation

Vignac, C., Krawczuk, I., Siraudin, A., Wang, B., Cevher, V., & Frossard, P. (2022). Digress: Discrete denoising diffusion for graph generation. arXiv preprint arXiv:2209.14734.

Background

Graph generation

The goal of this project is to generate graphs, which are represented by node matrices and edge matrices. Edges and nodes can also have their own categories. One application of this is molecule generation: atoms would be represented by nodes and the chemical bonds would be represented by edges.

The challenge of graph generation is a complex task due to the unordered nature and sparsity of graphs. While denoising diffusion models have been successful in other domains like images, they struggle with graphs due to their structural properties. Existing approaches that use continuous noise models disrupt the sparsity and connectivity crucial for graphs.

Technical Contributions

DiGress

The authors introduce DiGress, a discrete denoising diffusion model designed specifically for graph generation with categorical node and edge attributes. DiGress improves graph generation by using a discrete noise model that preserves graph sparsity and structural properties. The model involves progressively editing graphs through edge addition/removal and attribute changes. A graph transformer network is used to reverse this noisy process using cross-entropy loss, sampling from the trained model by iteratively updating the noise level and computing structural features.

Key enhancements include a noise model that maintains node and edge type distributions, a guidance procedure for conditioning on graph-level features, and the use of auxiliary graph-theoretic features. DiGress achieves state-of-the-art performance on both molecular and non-molecular graph datasets.

Results showed Digress outperformed continuous diffusion methods on various metrics, including degree distribution, clustering, and novelty, and was more scalable for larger graphs. Moreover, in the creation of novel molecules, discrete diffusion aids scalability for larger graphs and molecules, making it more efficient compared to continuous diffusion. DiGress is the first one-shot graph-based model that feasibly trains on over a million molecules without fragment-based heuristics. Its performance on drug-like molecule benchmarks reaches or exceeds that of specialized autoregressive or SMILES-based baselines.

Group 16 Presentation: Machine Learning and Hamilton-Jacobi-Bellman Equation for Optimal Decumulation: a Comparison Study

Presented by:

Zeyu Zhang

Paper Citation

Chen M, Shirazi M, Forsyth PA, Li Y. Machine Learning and Hamilton-Jacobi-Bellman Equation for Optimal Decumulation: a Comparison Study. Published online 2023. doi:10.48550/arxiv.2306.10582

Background

The paper is based on computational finance, focusing on the optimization problem related to "defined benefit" and "defined contribution plans". The main focus is on the challenge of ensuring retirees have enough funds for their retirement. Two key plans were discussed:

"Defined benefit plans" guarantee fixed monthly payments based on factors like tenure and salary but are cost-prohibitive and risky.

"Contribution plans" shift the investment and withdrawal strategy burden to individual investors, but they struggle to balance maximizing withdrawals and minimizing risk.

This problem, often called the "Nazi's hardest problem in finance," highlights the complexity of balancing risk and reward in financial planning for retirement.

The 4% rule is a traditional method recommending a constant 4% withdrawal each year, adjusted for inflation, and investing in stocks and bonds.

Despite its popularity, the 4% rule is suboptimal and not globally optimal

Peter Fauci proposed the HJB PDE method to maximize expected withdrawal and minimize the risk of running out of savings.

The HJB PDE method uses scalarization techniques to achieve Pareto optimal points, but it has limitations.

Technical Contributions

- The problem formulation involves complex mathematical equations related to computational finance.

- The paper assumes stock and bond prices follow a jump diffusion model.

- The investors' total wealth at time [math]\displaystyle{ t }[/math] is defined as the sum of stock price and bond price at that time.

- The capital [math]\displaystyle{ T }[/math] is set to 30 years, and rebalancing times are defined with discrete withdrawal amounts and allocation for stocks and bonds.

Control and Objective Function:

- The control at time [math]\displaystyle{ T_i }[/math] includes the withdrawal amount [math]\displaystyle{ Q_i }[/math] and allocation for the wealth at time [math]\displaystyle{ T_i^- }[/math].

- The admissible control set is defined, and the expected shortfall is introduced as a measure of risk.

- The expected total withdrawal is used as a measure of reward, aiming to maximize the expected total withdrawal while minimizing the expected shortfall.

- The pre-commitment in the expected shortfall problem is defined, focusing on maximizing the expected total withdrawal and minimizing the expected shortfall.

Group 18 Presentation

Presented by:

- Shiyu Zhu

- Jesse Xue

Paper Citation

M. Karami, “HiGen: Hierarchical Graph Generative Networks,” 2023, arXiv. doi: 10.48550/ARXIV.2305.19337.

Background

- Hierarchical or Multi-Scale Structure: Captures high level interactions or relationships between objects or groups, while also representing the lower level structures. One example is a company org chart.

- Existing graph generating models include: Variational Autoencoders, Generative Adversarial Networks, Autoregressive Models (GNN, Graph RNN, GRAN) and Diffusion Models

- Paper introduces HIGEN, Hierarchical Graph Generative Networks to address problems with existing models.

- Experiments were conducted on 5 datasets, each with increasing size and scale. The GraphRNN, GRAN, DiGress, GDSS, and SPEC

Technical Contributions

- Related Hierarchical Methods: The presentation discusses several recent hierarchical methods in specific domains like chemistry, highlighting HiGen's broader applicability, multi-level approach, and parallel generation as advantages over these more specialized techniques. These include a multi-based generation for molecular graphs (2020) relying on domain knowledge, a hierarchical normalizing flow model (2021) based on local neighborhoods, and a tree decomposition framework (2022) limited to a single abstraction level and medium-sized graphs.

Group 20 linear attention transformers with hardware efficient training

Reference

arXiv:2312.06635

Background

The paper discusses Gated Linear Attention (GLA) Transformers, addressing the computational inefficiency of traditional transformers with softmax attention. Regular transformers have a quadratic computational complexity with sequence length, which becomes extremely expensive for long sequences.

Technical Contributions

It proposes using a linear kernel as an alternative to the softmax function, which allows attention to be formulated as a two-dimensional RNN. The key innovations include:

1. Introducing a data-dependent gating mechanism to improve model performance

2. Developing a linear attention approach that reduces computational complexity

3. Creating a hardware-efficient training method that can handle long sequences more effectively

The main goal was to create a more efficient transformer model that can: - Reduce computational expenses

- Maintain competitive performance across different tasks

- Handle long sequences more effectively

- Leverage modern GPU architectures for improved training and inference

The approach addresses the fundamental challenge of making transformer models more scalable and computationally efficient, particularly for tasks involving long sequences like processing books, dialogues, or complex scientific texts.

Results

The results and conclusions of the paper showed:

Performance Results:

- For the 340 million parameter model:

 - Achieved competitive performance
 - Close to transformer performance
 - Slightly better or comparable to Rednet
 - Slightly below Mamba on some tasks

- For the 1.3 billion parameter model:

 - Beat most benchmarks in average accuracy
 - Slightly behind transformer++ in perplexity
 - Showed impressive accuracy across tasks

Key Findings:

1. Gating mechanism is crucial for model performance

  - Removing it significantly increased perplexity
  - Data-dependent scalar decay improved results

2. Recall-intensive tasks:

  - Smaller model: Transformer still led
  - Larger model: GLA closed performance gap considerably
  - Competitive with Mamba and Rednet

3. Computational Efficiency:

  - Higher training throughput for larger batch sizes
  - Slight increase in GPU memory usage
  - More efficient for bigger batches

Conclusions:

- GLA is highly effective for handling long sequences - Hardware-efficient design reduces computational costs - Gating mechanism significantly enhances model performance - Promising approach for making transformers more accessible and efficient

The paper suggests future research should focus on optimizing the balance between performance and efficiency.

Group 23 Presentation: Discrete Diffusion Modelling By Estimating the Ratios of the Data Distribution

Presented By

Chenxin Lyu, Yixuan Zeng

Paper Citation

A. Lou, C. Meng, and S. Ermon, ‘Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution’, Jun. 06, 2024, arXiv: arXiv:2310.16834. doi: 10.48550/arXiv.2310.16834.

https://arxiv.org/abs/2310.16834

Background

- Diffusion models have shown great performance for generative artifical intelligence when applied to domains with continuous data

- Diffusion models are more difficult to implement for data in the discrete domain, such as tokenized texts

- Prior attempts at applying diffusion to text generations have performed worse than autoregressive models

Paper Contributions

- Developed a method called Score Entropy Discrete Diffusion (SEDD)

- Parameterizes the diffusion process for discrete data using data distribution ratios, rather than dealing with the tokenized data directly

Discrete Diffusion Processes

Models probability distributions over a finite discrete space [math]\displaystyle{ \mathcal{X} = \{1, \ldots, N\} }[/math], using probability mass vectors [math]\displaystyle{ p_t \in \mathbb{R}^N }[/math].
Evolution of [math]\displaystyle{ p_t }[/math] follows a linear ODE:
[math]\displaystyle{ \frac{dp_t}{dt} = Q_t p_t,\quad p_0 \approx p_{\text{data}} }[/math]
[math]\displaystyle{ Q_t }[/math] is a diffusion matrix with non-negative off-diagonal entries and column sums equal to 0 (mass is preserved).
Often simplified as [math]\displaystyle{ Q_t = \sigma(t) Q }[/math], driving [math]\displaystyle{ p_t }[/math] toward a base distribution as [math]\displaystyle{ t \to \infty }[/math].
Simulated using Euler steps with small [math]\displaystyle{ \Delta t }[/math]. Transition probability:
[math]\displaystyle{ p(x_{t+\Delta t} = y \mid x_t = x) = \delta_{xy} + Q_t(y, x) \Delta t + O(\Delta t^2) }[/math]
Time Reversal: Reverse process uses another matrix [math]\displaystyle{ \overline{Q}_t }[/math] with:
[math]\displaystyle{ \overline{Q}_t(y, x) = \frac{p_t(y)}{p_t(x)} Q_t(x, y) }[/math]
Reverse ODE: [math]\displaystyle{ \frac{dp_{T-t}}{dt} = \overline{Q}_{T-t} p_{T-t} }[/math]
This connects to the concrete score, generalizing the score function [math]\displaystyle{ \nabla_x \log p_t }[/math].

Group 24 Presentation: Mitigating the Missing Fragmentation Problem in De Novo Peptide Sequencing With A Two-Stage Graph-Based Deep Learning Model

Paper Citation

Mao, Z., Zhang, R., Xin, L. et al. Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model. Nat Mach Intell 5, 1250–1260 (2023). https://doi.org/10.1038/s42256-023-00738-x

https://www.nature.com/articles/s42256-023-00738-x#citeas

Background

- Proteins are crucial for biological functions

- Proteins are formed from peptides which are sequences of amino acids

- Mass spectrometry is used to analyze peptide sequences

- De Novo sequencing is used to piece together peptide sequences when the sequences are missing from existing established protein databases

- Deep learning has become commonly implimented to solve the problem of de-novo peptide sequencing

- When a peptide fails to fragment in the expected manner, it can make protein reconstruction difficult due to missing data

- One error in the protein can propogate to errors throughout the entire sequence

Paper Contributions

- Graph Novo was developed to handle incomplete segments

- GraphNovo-PathSearcher instead of directly predicting, does a path search method to predict the next peptide in a sequence

- A graph neural network is used to find the best path from the graph generated from the mass spectrometry input

- GraphNovo-SeqFiller instead of directly predicting, does a path search method to predict the next peptide in a sequence.

- It's expected that some peptides/ amino acids may have been missed, SeqFiller uses a transformer to add in amino acids which have been missed from PathSearcher

- Input is mass spectrum from mass spectrometry

- Graph construction is done where nodes represent possible fragments, and edges represent possible peptides (PathSearcher module)

- PathSearcher uses machine learning to find the optimal path on the generated graph

- SeqFiller fills in missing amino acids that may have not been included in the PathSearcher module due to lacking data from the mass spectrometry inputs

Peptide Sequencing in AlphaPeptDeep

- Peptide sequencing determines amino acid sequences of peptides, crucial for understanding proteins.

- Mass spectrometry (MS) is used to fragment peptides and analyze the resulting spectra.

Methods referenced in the presentation:

Database Search & Spectral Library Search: AlphaPeptDeep improves prediction of MS spectra and retention time, boosting accuracy of both methods.
de novo Sequencing: Enhanced spectral prediction from AlphaPeptDeep supports building peptide sequences without prior knowledge.
AlphaPeptDeep predicts peptide properties (e.g., fragmentation patterns) to improve spectrum matching and sequence inference.

Group 47 Presentation: Jamba: A Hybrid Transformer - Mamba Language Model

Paper Citation

Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedigos, I., Safahi, E., Meirom, S., Belinkov, Y., Shalev-Shwartz, S., Abend, O., Alon, R., Asida, T., Bergman, A., Glozman, R., Gokhman, M., Manevich, A., Ratner, N., Rozen, N., Shwartz, E., Zusman, M., Shoham, Y. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. arXiv. https://arxiv.org/abs/2403.19887

https://doi.org/10.48550/arXiv.2403.19887

Summaries of Key Points

- Jamba is a novel hybrid language model that combines Transformer and Mamba architectures with a Mixture of Experts (MoE) module. This combination aims to leverage the strengths of both architectures: the expressiveness of Transformers and the efficiency of Mamba for long sequences.

- The main features of Jamba include hybrid Transformer-Mamba layers for memory efficiency and high throughput, a Mixture of Experts to increase capacity without excessive compute cost, and optimization for long context (up to 256,000 tokens on a single GPU).

- Jamba's hybrid architecture integrates three key components

1. Transformer layers: Applied self-attention to capture complex relationships between tokens, crucial for tasks requiring rich contextual understanding and reasoning. However, they suffer from high memory and compute costs with long sequences.

2. Mamba layers: Based on state-space models, they efficiently process long sequences without storing extensive key-value caches, making Jamba more memory-efficient for long context tasks. They maintain a hidden state to summarize prior information.

3. Mixture of Experts (MoE): Introduces sparsity by dynamically selecting a small subset of experts per token, scaling model capacity while keeping computational costs manageable. Jamba applies MoE every other layer, using 16 experts in total but with only two active per token, balancing efficiency and performance.

The architecture of a single Jamba block consists of a sequence of Transformer layers, Mamba layers, and a Mixture of Experts layer. The structure follows a 1:7 ratio of Transformer layers to Mamba layers. MOE layers replace every second multi-layer perception layer to increase model capacity without significantly increasing the number of parameters.

Performance and Benefits

You can fill in!

@@ Line 215: / Line 215: @@
 === Paper Citation ===
-Editing in progress...
+De, S., Smith, S., Fernando, A., Botev, A., Cristian-Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y., Srinivasan, S., Desjardins, G., Doucet, A., Budden, D., Teh, Y. W., Pascanu, R., De Freitas, N., Gulcehre, C. (2024). Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. ''arXiv''. arxiv.org/pdf/2402.19427
 === Background ===
-Editing in progress...
+'''RNNs and transformers'''
+Recurrent neural networks (RNNs) are a basic architecture for handling sequential data; they are good for short sequences but can struggle with long ones.
+Transformers generally perform better than RNNs and have become dominant in recent years. However, for large sequences, they become computationally expensive for a couple of reasons. Their global attention mechanism has a quadratic complexity. Furthermore, the key-value cache and the multi-query attention cache grows linearly with sequence length.
 === Technical Contributions ===
+'''Model architecture'''
+Their proposed models -- Hawk and Griffin -- use these following structures:
+* '''Residual block:''' This is the main structure in the architecture. It starts with an RMSNorm layer being applied to the hidden state, followed by a temporal mixing block. There's a residual connection. Next, there is another RMSNorm layer followed by an MLP (multi-layer perceptron) block.
+* '''Gated MLP block:''' This is a gated feedforward block. There are 2 branches: one is linear and one uses a GeLU activation. This part of the structure is used for feature selection.
+* '''Temporal mixing block:''' They used 3 different kinds of temporal mixing blocks in their models: global multi-query attention, local sliding window attention, and recurrent blocks (what they proposed). The recurrent block contains 2 parallel branches. One has a GeLU activation, while the other has a temporal 1D convolutional layer followed by a RG-LRU layer (a proposal in this paper).
+'''Real-Gated Linear Recurrent Unit (RG-LRU)'''
+RG-LRUs are inspired by regular Linear Recurrent Units (LRUs), but with a gating mechanism. They are more computational efficient, as they avoid the quadratic complexity that transformers have. Mathematically, the layer is defined as follows.
+The recurrence gate:
+<math> r_t = \sigma (W_a x_t + b_a) </math>
+The input gate:
+<math> i_t = \sigma (W_x x_t + b_x) </math>
+<math> a_t = a^{cr_t} </math>
+The output:
+<math> h_t = a_t \odot h_{t-1} + \sqrt{1-a_t^2} \odot (i_t \odot x_t) </math>
 </div>