stat940W25-presentation
Notes on Presentations
Group 1 Presentation: Universal Physics-Informed Neural Networks: Symbolic Differential Operator Discovery with Sparse Data
Paper Citation
Podina, L., Eastman, B., & Kohandel, M. (2023). Universal Physics-Informed Neural Networks: Symbolic Differential Operator Discovery with Sparse Data. In Proceedings of the 40th International Conference on Machine Learning (Vol. 202). PMLR, Honolulu, Hawaii, USA.
Background
Differential equations
Examples of differential equations in physics include Newton's second law (which is an ordinary differential equation), the Navier-Stokes equations (which are partial differential equations), etc.
Existing methods of solving differential equations:
- Analytical methods, such as integration or separation of variables.
- Numerical methods, such as finite difference, finite volume, or finite elements.
- Data-driven approaches: these involve Universal Differential Equations (UDEs) and Physics-Informed Neural Networks (PINNs), which are the focus of this paper.
Introduction to PINNs
With (many) machine learning approaches, the goal is to approximate the solution to a DE using a feed-forward neural network, optimized with MSE loss. The key difference that makes it physics-informed is an extra term in the loss, which penalizes the model for deviating from the governing DE.
Introduction to UDEs
Here, the differential equation is expressed as a sum of two terms: the known physics-based model and an unknown neural network.
Paper Contributions
Universal Physics-Informed Neural Networks (UPINNs)
PINNs and UDEs are combined, addressing the limitations of the original methods, while sharing their benefits.
The model integrates three network components:
- Surrogate Solution Network U: links to the measurement loss
- Unknown Differential Operator Network F: with with U within the PINN loss
- Boundary Condition Network B: links to the boundary loss
The loss function contains three terms:
- MSE
- Boundary loss, if you were to provide boundary conditions with the problem.
- PINN loss: ensure the model respects the differential conditions.
Experimental Validation
1. Lotka-Volterra Model
They first experimented with the UPINN on the Lotka-Volterra system of differential equations, which are used to model predator-prey dynamics:
[math]\displaystyle{ \frac{dx}{dt} = \alpha x - \beta xy }[/math]
[math]\displaystyle{ \frac{dy}{dt} = -\delta y + \gamma xy }[/math]
The UDE and PINN were individually tested on two scenarios: sparse data (where there are very few input data points) and noisy data. Alone, each model did not do very well, especially when the data was very sparse or very noisy. When the UPINN was used, the solution was quite good, even with high sparsity or noise.
2. Viscous Burgers’ Equation
Their next experiment was used Burger's equation, a system in fluid dynamics.
[math]\displaystyle{ \frac{\partial u}{\partial t} = -u \frac{\partial u}{\partial x} + \nu \frac{\partial^2 u}{\partial x^2} }[/math]
3. Cell Apoptosis Model
Summaries of key points
This paper introduces Universal Physics-Informed Neural Networks (UPINNs) for discovering unknown terms in differential equations (ODEs/PDEs) from sparse and possibly noisy data. It combines the strengths of standard Physics-Informed Neural Networks (PINNs)—which incorporate prior knowledge of the governing equations—while still allowing parts of the underlying model to remain unknown and be learned from the data. Unlike previous methods such as Universal Differential Equations (UDEs), which can falter in noisy and small-data regimes, UPINNs maintain good accuracy by:
1. Leveraging collocation points in the loss function to incorporate the differential equation constraints ("physics").
2. Adding a neural network component to represent the unknown terms of the operator.
3. Applying symbolic regression (e.g., AI Feynman) to convert the neural approximation of the hidden terms into interpretable, closed-form expressions.
Extensive experiments on the Lotka–Volterra system, a viscous Burgers’ PDE, and a cell apoptosis ODE show that UPINNs outperform UDEs in handling higher noise and fewer data points, while still recovering the hidden differential-operator terms accurately.
Furthermore, symbolic regression improves interpretability by converting neural outputs into explicit equations. This interpretability, combined with robustness to sparse and noisy data, makes UPINNs especially promising for scientific discovery. Potential applications include systems biology, fluid dynamics, and environmental modeling. Future research directions could address scalability to higher-dimensional PDEs and uncertainty quantification.
Related work
Nonlocal Physics-Informed Neural Networks (nPINNs): nPINNs introduce a universal nonlocal Laplace operator that encompasses classical and fractional Laplacians. This framework is utilized for parameter identification in nonlocal models, demonstrating consistency and accuracy in capturing operator behaviours.
Group 2 Presentation: EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Presented by:
Kareena Bhalla and Chelsea Huffman
Paper Citation
Editing in progress
Background
In this paper, we are looking at Large Language Models (LLMs). Autoregressive decoding in LLMs refers to generating text one token at a time, the model basing its predictions on the tokens that came before it. It's inefficient and costly.
Speculative Sampling
Speculative sampling is a technique meant to reduce the computational cost and runtime of autoregressive decoding. The process consists of two main parts:
- Draft stage: A small, fast model suggests some tokens.
- Verification stage: In parallel, the large main LLM verifies these tokens and selects the best ones.
How do we choose a draft model that functions like the big LLM, but faster? One approach is to use a reduced version of the main LLM, but this doesn't work when there is no small version available, also using a model with fewer parameters often come with high overhead and reduced accuracy.
Technical Contributions
Extrapolation Algorithm for Greater Language Model Efficiency (EAGLE)
Before making a prediction, the model looks ahead by one token/word in the sequence.
One advantage of the EAGLE method is that only a single decoder layer needs to be trained rather than an entire draft model. This makes the training process extremely efficient in terms of runtime/computational cost.
The EAGLE method makes improvements to the process based on the following 2 observations:
1. Autoregression is simpler at the feature level than the token level.
2. The uncertainties from the sampling process negatively affect the performance.
Feature-Level Autoregression
Unlike traditional speculative sampling methods that operate at the token level, EAGLE performs autoregression at the feature level, specifically utilizing the second-to-top-layer features of the LLM. This approach simplifies the drafting process and enhances efficiency.
Experimental Results
1. EAGLE is much faster than ordinary autoregressive decoding.
2. EAGLE can be applied to various LLMs without reducing the model's generation quality.
3. EAGLE can generate approximately 3.2–4.5 tokens per pass.
Summaries of key points
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a framework designed to accelerate the text generation of large language models (LLMs) without sacrificing the original distribution of generated tokens. Traditionally, LLMs rely on autoregressive decoding, generating text one token at a time and incurring heavy computational costs. Speculative sampling, introduced in earlier works, seeks to reduce these costs by splitting text generation into a faster *“drafting”* step and a single-step “verification” step using the main LLM, thereby validating multiple drafted tokens in parallel. However, existing methods often struggle to balance high draft accuracy with low overhead, limiting their speed improvements.
EAGLE addresses these issues by predicting hidden-layer features rather than tokens. Specifically, it focuses on the second-to-top-layer feature vector of the original LLM. Because these features exhibit more consistent structure than raw language tokens, they can be modeled with fewer errors and simpler modules. Additionally, EAGLE includes a small Autoregression Head that factors in one-step-ahead token information—this is critical for mitigating the inherent uncertainty in feature prediction. When generating text, the main LLM determines which token is sampled at a given step, but from the feature-only perspective, it is impossible to know the token that will appear. By feeding the token as well as the past feature states into a lightweight single-layer decoder, EAGLE’s draft model can produce a highly accurate prediction of the next feature. After obtaining the predicted features, EAGLE applies the LLM’s original output layer (the LM head) to sample candidate tokens in parallel.
Once a tree of drafted tokens is produced, the method performs a single verification pass in the main LLM. In one forward call, EAGLE obtains the LLM’s probabilities for all tokens in this tree and accepts or rejects each token based on a mathematically proven acceptance criterion. Crucially, this ensures the distribution of tokens matches what the LLM would have generated if it performed naive, step-by-step autoregressive decoding.
In evaluations across code (HumanEval), mathematical reasoning (GSM8K), instruction following (Alpaca), and multi-turn dialogue (MT-bench) datasets, EAGLE yields 2.7–3.5× speedups for models like LLaMA2-Chat 70B and 2.8–3.3× for smaller Vicuna variants on single-GPU setups. Even for more modest setups, or MoE-based models like Mixtral 8×7B Instruct, EAGLE still demonstrates strong acceleration. The approach does not require finetuning the main LLM itself; only a small draft network is trained on a fixed dataset (e.g., ShareGPT). Because the overhead of this training is low—typically 1–2 days even for 70B-parameter models—EAGLE proves highly practical.
Overall, it offers a simple but powerful way to generate tokens in parallel while retaining exact quality and distribution guarantees, making it a compelling choice for real-world systems seeking reduced latency in LLM services.
Related Works
There is now a new version of Eagle, Eagle 2. Eagle 2 improves from this version by introducing dynamically adjustable draft tree. It would adjust the draft tree based on the context and position, which is built based on speculative sampling. Upon some testing, Eagle can be 20% - 40% faster than EAGEL-1
Group 2 Presentation: EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Presented by:
Kareena Bhalla and Chelsea Huffman
Paper Citation
Du, Y., Ram, D., Liu, X., Su, Y., Liu, S., Lee, J., Mohamed, A., & Ma, T. (2024). EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. arXiv preprint arXiv:2402.00842.
Summaries of key points
Speculative sampling speeds up language model inference by having a fast draft model guess multiple tokens, which are then verified by a slower, accurate model. While effective, this breaks the model’s training assumption of strictly sequential input, introducing feature uncertainty—the hidden states are based on possibly incorrect guesses. The paper proposes EAGLE, an energy-based method that learns to assess and modulate this uncertainty at the feature level. Instead of discarding uncertain tokens, EAGLE uses a learned energy score to decide how much to trust them during decoding. This leads to better performance, especially on tasks requiring reasoning or longer context, and is more robust than previous speculative decoding approaches Problem Motivation: Speculative decoding (used to speed up LLM inference by parallelizing token generation) is great for efficiency but introduces a mismatch between training and inference: training assumes sequential decoding, but speculative sampling adds a “guess-and-check” step that breaks that assumption.
Key Idea: EAGLE (which stands for Energy-based Adaptive Guidance with Latent Evidence) proposes a new method to handle uncertainty that arises in speculative decoding. It adjusts the model’s internal feature representations to reflect this uncertainty, rather than just masking or ignoring it.
How It Works: Instead of relying on the regular transformer’s last hidden state, EAGLE builds an energy-based auxiliary model that learns to estimate whether a token guess is valid, using both the main model’s predictions and the speculative draft. This energy score helps modulate the influence of uncertain features during decoding.
Results: EAGLE shows better performance on downstream tasks compared to vanilla speculative decoding, especially on tasks that require reasoning or handling uncertainty — e.g., question answering or coding benchmarks.
Explanations to aid understanding
Speculative decoding in simpler terms: Imagine trying to write the next word in a sentence, but instead of just writing one word and waiting, you guess a bunch in parallel and then double-check them. This saves time, but makes it harder for the model to know what it should trust. EAGLE essentially adds a smart layer that acts like a “confidence gauge” for the guesses, using a learned energy function to decide how much to trust each speculative token and its underlying features.
Group 3 Presentation: Mamba: Linear-Time Sequence Modelling with Selective State Spaces
Presented by:
Liang Wu, Jingcheng Yu, Candace Ng
Paper Citation
Gu, A., & Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv. https://arxiv.org/abs/2312.00752.
Background
Technical Contributions
- Introduce a selection mechanism for state space models that enables input-dependent state updates
- Develop a hardware-aware recurrent scan algorithm for efficient computation
- Propose Mamba, an attention-free, linear-time architecture that achieves state-of-the-art performance across multiple modalities
Architecture
Mamba integrates selective SSMs into a single homogeneous block. Each block comprises: a linear projection, a selective SSM layer, and an MLP block. This architecture is attention-free and scales linearly in sequence length.
Related Work
- Builds on structured SSMs (S4, S5) and architectures such as H3 and Hyena.
- Demonstrates theoretical and empirical connections to classical RNN gating.
Future Directions:
- Scale to larger models and refine training recipes
- Extend Mamba to multimodal tasks (e.g. video)
- Explore additional downstream affordances (fine-tuning, in-context learning, quantization)
Summaries of Key Points
1. Motivation: Faster and More Flexible Sequence Modeling
Large-scale "foundation models" (FMs) largely rely on Transformers (with attention) to handle long-range context. However, attention has quadratic complexity in sequence length and requires storing the entire key–value cache. This leads to high memory usage and slow generation. Meanwhile, Structured State-Space Models (SSMs) offer linear-time processing through either convolution or recurrence, achieving excellent results in domains like audio and vision. Yet they typically assume constant, time-invariant parameters, which hinders their ability to handle discrete or "content-based" tasks such as language.
2. Proposed Approach: Selective State-Space Models (S6)
The paper introduces a selection mechanism to inject content-based (input-dependent) reasoning into SSMs. Specifically:
- Parameterizing SSM transitions by inputs: Conventional SSMs have fixed transition matrices (A, B, C), but here, B and C—and crucially, the step-size parameter Δ—become input-dependent. This small twist allows the model to "select" which inputs matter and how strongly to update its hidden state, like gating in RNNs.
- Efficient "Selective Scan" Implementation: Because parameters vary with the input, one cannot use global convolution for training (which was key to SSM efficiency). Instead, the authors design a hardware-aware parallel scan on GPUs, fusing operators so that large intermediate states reside only in on-chip memory. This approach retains O(L) time scaling in sequence length while dramatically cutting overhead.
3. Mamba Architecture
To showcase selective SSMs, they build Mamba, a simple "attention-free" architecture. Each layer is effectively an MLP with a parallel SSM path included. By training only these blocks end-to-end (without multi-head attention), Mamba delivers:
- Linear-time training because it uses convolution-like or scan-based methods. - Constant-time inference per token (no caching entire past context), which is especially beneficial for extremely long sequences.
Additionally, Mamba integrates gating mechanisms within its SSM path, adaptively controlling information flow to better capture complex dependencies. Its streamlined layer design allows hardware-aware GPU optimizations, enhancing throughput and scalability compared to attention-based models.
4. Experiments Across Multiple Domains
1. Synthetic Tests (Selective Copying, Induction Heads): Mamba (or more precisely, its selective mechanism) solves tasks where ordinary time-invariant SSMs fail. It can ignore irrelevant tokens and preserve essential ones.
2. Language Modeling: On The Pile (multi-domain text), Mamba matches or exceeds standard Transformer perplexities when model size scales to ~1B parameters. It also outperforms other sub-quadratic methods (e.g., Hyena, RWKV). Notably, Mamba obtains 4–5× higher inference throughput than Transformers because it avoids large key–value caches.
3. Genomics: Mamba attains better perplexities than alternative long-sequence models (HyenaDNA) on a human genome corpus, especially at very long context (up to millions of tokens).
4. Audio Waveforms: The model outperforms prior state-of-the-art (e.g., SaShiMi) in generating raw speech signals, achieving lower FID scores and better sample quality.
5. Significance and Conclusions
Mamba demonstrates that:
1. Combining gating/selection with state-space recurrences, and
2. A carefully engineered GPU scan can outperform Transformers in certain tasks and match them in others, all with linear-time complexity. This opens a path for foundation models in domains needing extremely long context—like genomics, raw audio, and (possibly) next-generation language modeling—where Transformers become unwieldy. The authors highlight that future directions include scaling Mamba to billions of parameters (7B+), exploring interpretability, and combining it with other approaches (e.g., partial attention) for even broader performance gains.
Group 4 Presentation: Learning spatiotemporal dynamics with a pretrained generative model
Presented by:
- Karolina Suszek
- Negin Amou
- Muhammad Azeem
Paper Citation
Z. Li et al., “Learning spatiotemporal dynamics with a pretrained generative model,” Nature Machine Intelligence, vol. 6, no. 12. Springer Science and Business Media LLC, pp. 1566–1579, Dec. 06, 2024. doi: 10.1038/s42256-024-00938-z.
Background
- Spatiotemporal dynamics: how the state of a physical system varies with space and time
- Real datasets often contain data with sparse measurements where there are a limited number of sensors available. There needs to be a way to convert the sparse measurement data into a full spatiotemporal field.
- Existing solutions learn to map the input to output and ignores missing data, but this reduces the models ability to generalize.
- Paper proposes the use of Sparse-Sensor-Assisted Score-Based Generative Model (S3GM) which uses unlabeled data durring training and can reconstruct incomplete data after training to make accurate predictions even when there isnt much information available.
- Key Idea: Learn the probabilith distribution of spatiotemporal data using score-based generative model and refine the samples via schochastic sampling
Technical Contributions
The main proposed model is the Sparse-Sensor-Assissted Score-Based Generative Model. It learns patterns from a large amount of data before hand. It also is unsupervised so it does not require any labels during training. It tries to learn the significant features of the data/natural patterns. After training, the model can be used to take incomplete data and reconstruct the missing parts to make predictions.
Core Components:
- Pre Training Stage: Learns the joint probability distribution of the data
- Generating Stage: Use a stochastic differential equation to refine and generate full field predictions
- Refinement Mechanism: Ensure Allignment with observations and enforce sequence consistency
Some of the common applications of this model are Turbulent flow modeling, climate forecasting, and physics-based simulations.
Summaries of key points
- Challenge Addressed: Traditional end-to-end learning models often struggle with generalization in reconstructing spatiotemporal dynamics, particularly when data is sparse—a common scenario in real-world applications.
- S³GM Methodology: Pretraining Phase: An unconditioned generative model is pretrained in a self-supervised manner on a comprehensive dataset, capturing the joint distribution of the system's dynamics. Generation Phase: The pretrained model is conditioned on new, sparse measurements to reconstruct and predict the full-field spatiotemporal dynamics.
- Validation and Performance: S³GM's efficacy was tested across multiple dynamical systems using synthetic, real-world, and laboratory datasets, including applications in turbulent flow modelling and weather forecasting. The results demonstrated that S³GM achieves high accuracy, generalizability, and robustness, even when faced with significant data sparsity and noise.
S³GM offers a promising approach for modeling and predicting complex spatiotemporal dynamics in situations where data is limited, leveraging the strengths of pretrained generative models to enhance performance in small data regimes.
Group 5 Presentation: Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models
Presented by:
Paper Citation
De, S., Smith, S., Fernando, A., Botev, A., Cristian-Muraru, G., Gu, A., Haroun, R., Berrada, L., Chen, Y., Srinivasan, S., Desjardins, G., Doucet, A., Budden, D., Teh, Y. W., Pascanu, R., De Freitas, N., Gulcehre, C. (2024). Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models. arXiv. arxiv.org/pdf/2402.19427
Background
RNNs and transformers
Recurrent neural networks (RNNs) are a basic architecture for handling sequential data; they are good for short sequences but can struggle with long ones.
Transformers generally perform better than RNNs and have become dominant in recent years. However, for large sequences, they become computationally expensive for a couple of reasons. Their global attention mechanism has a quadratic complexity. Furthermore, the key-value cache and the multi-query attention cache grows linearly with sequence length.
Technical Contributions
Model architecture
Their proposed models -- Hawk and Griffin -- use these following structures:
- Residual block: This is the main structure in the architecture. It starts with an RMSNorm layer being applied to the hidden state, followed by a temporal mixing block. There's a residual connection. Next, there is another RMSNorm layer followed by an MLP (multi-layer perceptron) block.
- Gated MLP block: This is a gated feedforward block. There are 2 branches: one is linear and one uses a GeLU activation. This part of the structure is used for feature selection.
- Temporal mixing block: They used 3 different kinds of temporal mixing blocks in their models: global multi-query attention, local sliding window attention, and recurrent blocks (what they proposed). The recurrent block contains 2 parallel branches. One has a GeLU activation, while the other has a temporal 1D convolutional layer followed by a RG-LRU layer (a proposal in this paper).
Performance and Efficiancy
- Hawk surpasses the reported performance of Mamba on downstream tasks.
- Griffin matches the performance of Llama-2, despite being trained on over six times fewer tokens.
- Both models demonstrate the ability to extrapolate to sequences significantly longer than those encountered during training.
Real-Gated Linear Recurrent Unit (RG-LRU)
RG-LRUs are inspired by regular Linear Recurrent Units (LRUs), but with a gating mechanism. They are more computational efficient, as they avoid the quadratic complexity that transformers have. Mathematically, the layer is defined as follows.
The recurrence gate:
[math]\displaystyle{ r_t = \sigma (W_a x_t + b_a) }[/math]
The input gate:
[math]\displaystyle{ i_t = \sigma (W_x x_t + b_x) }[/math]
[math]\displaystyle{ a_t = a^{cr_t} }[/math]
The output:
[math]\displaystyle{ h_t = a_t \odot h_{t-1} + \sqrt{1-a_t^2} \odot (i_t \odot x_t) }[/math]
Summaries of Key Points
Context and Motivation
Transformers have dominated large language modeling, but their attention mechanism can be expensive for very long sequences, particularly because of the quadratic complexity of attention and a Key-Value (KV) cache that grows linearly in sequence length. Recurrent neural networks (RNNs) compress entire contexts into a fixed-size hidden state and avoid storing a cache that grows with length, which can reduce memory and speed up inference. Historically, though, RNNs have been difficult to train at scale and often underperform Transformers on complex tasks. This paper proposes new recurrent-based architectures that can match or exceed Transformer performance while retaining the training efficiency of standard deep-learning pipelines and gaining a significant advantage during inference.
Proposed Architectures
Two models are introduced: Hawk, a purely recurrent architecture, and Griffin, which mixes gated linear recurrences with local attention. Both rely on a new Real-Gated Linear Recurrent Unit (RG-LRU) layer, which extends a previously studied linear recurrency (the LRU) by adding two gating mechanisms. One gate modulates how much of the new input to incorporate each step, and another governs whether the layer updates in a standard LRU style or simply retains the previous hidden state.
RG-LRU and Recurrent Blocks
RG-LRU’s diagonal recurrence matrix dramatically lowers the computational cost for large sequence lengths. The authors also include a lightweight convolution for near-token interactions. For Griffin, they interleave RG-LRU-based blocks with a local sliding-window attention, allowing short-range local attention plus a global recurrent state. This combination helps the model handle both local details and long-distance dependencies in a memory-efficient way.
Empirical Results
The authors scale Hawk and Griffin to billions of parameters and compare them with: 1. A baseline Transformer using Multi-Query Attention (MQA), 2. The Mamba recurrent model (from prior work), 3. Llama-2.
Their main findings include: - Both Hawk and Griffin follow standard power-law scaling in validation loss versus training compute, matching or beating Transformers. - Hawk matches or exceeds previous recurrent models like Mamba on language modeling benchmarks, despite training on fewer tokens. - Griffin matches Llama-2’s accuracy on multiple downstream NLP tasks while training on only about one-seventh the token count. - During training, these models achieve speeds comparable to Transformers (thanks to parallelized or fused kernels), and in some configurations are faster at long sequence lengths (2048 tokens or more). - At inference time, recurrent and local-attention blocks avoid the large KV caches that hamper Transformers. Hence, Hawk and Griffin show lower latency and can decode tokens at significantly higher throughput, especially for long sequences. - Hawk and Griffin extrapolate to far longer sequences than they were trained on. They also efficiently learn synthetic copying/retrieval tasks if explicitly trained on them. However, their out-of-the-box retrieval or copying for tasks not seen at training time is still below that of Transformers.
Significance
By mixing local attention with a novel gated recurrence, Griffin retains the best of both recurrent and attention-based approaches: it can model local context with a sliding window while maintaining a small, fixed-size global state for the entire sequence. This achieves strong performance on large-scale language modeling benchmarks while offering major advantages in memory footprint and decoding speed. The results position Hawk and Griffin as compelling alternatives to purely Transformer-based architectures for scalable language models.
Related work and Future direction
Griffin builds on several key areas in efficient language modelling, particularly in recurrent architectures and hybrid approaches combining recurrence and attention: State Space Models (SSMs), Hybrid Recurrence-Attention Models, Efficient Transformer Alternatives.
Scaling Griffin to Larger Models
The paper discusses Griffin models up to 14B parameters but suggests further exploration into scaling beyond this size to compete with GPT-4 or Gemini models. Investigating how Griffin handles long-context tasks in ultra-large models would be an interesting future study.
Memory and Efficiency Optimization
Griffin already improves efficiency compared to transformers, but further research into sparsification, quantization, and hardware-specific optimizations could enhance its real-world applicability. Exploring mixed-precision training and inference optimizations for mobile and edge devices.
Extending Griffin to Multimodal Learning
While Griffin is designed for language modeling, incorporating it into multimodal tasks (e.g., vision-language models, audio processing) could expand its impact. Combining Griffin’s recurrence mechanism with diffusion models or video understanding models might be a promising direction.
Group 6 Presentation: Learning to (Learning at Test Time): RNNs with Expressive Hidden States
Presented by:
Zhiyang Cheng and Pingchu Zhang
Paper Citation
Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., Hashimoto, T., & Guestrin, C. (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv. https://doi.org/10.48550/arXiv.2407.04620.
Summaries of key points
This paper revisits the traditional role of RNN hidden states, proposing that they can do more than just store information—they can enable learning at test time. The authors introduce a method where a hypernetwork dynamically generates the RNN’s weights as a function of its hidden state, allowing the model to adapt its behavior on the fly. This gives rise to what they call expressive hidden states, which encode both memory and the capacity to steer the model’s future updates. The approach effectively blurs the line between training and inference, treating test-time prediction as a form of continual adaptation. This results in stronger performance in settings like few-shot learning and online learning, where flexibility and rapid adaptation are crucial. Rather than relying on explicit optimization at test time (as in typical meta-learning setups), the RNN itself becomes the learner, continuously reshaping its internal dynamics based on the sequence it's processing.
While innovative, the method introduces nontrivial architectural and computational overhead. The use of a hypernetwork to produce weights at every time step means the model must manage a more complex parameter space and could become less scalable for long sequences or larger models. There's also the risk of instability, since small changes in the hidden state can lead to large changes in the generated weights. Regularization and careful design are needed to prevent the model from diverging. Another limitation is that while the paper shows strong performance on synthetic and controlled few-shot learning tasks, it doesn’t extensively benchmark on more complex natural language or real-world sequential data, leaving questions about generalization and practicality open.
Clear explanations to aid understanding
In a standard RNN, the weights are fixed during inference—you feed in tokens or sequence elements, and the hidden state updates based on those fixed rules. What this paper suggests is: what if the hidden state itself could influence the rules? So instead of always using the same weights, the RNN can generate new ones depending on what it's seen so far. This is done using a hypernetwork—a small neural network that outputs the weights for the main RNN. So as the RNN processes a sequence, it effectively reshapes its own behavior to fit the task or data distribution it's encountering. It’s like the RNN is learning while it’s making predictions, adapting in real-time to maximize performance without needing gradient descent at test time.
Group 6 Presentation: Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Presented by:
- Pingchu Zhang
- Zhiyang Cheng
Paper Citation
Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., Hashimoto, T., & Guestrin, C. (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv. https://doi.org/10.48550/arXiv.2407.04620
Background
For modern RNNs, performance in long context is limited by the expressive power of their hidden state of fixed size. Hence the authors introduced test-time training (TTT)
Technical Contributions
- Introduce TTT layers, where the hidden state is a model and the update rule is self-supervised learning, offering a new research direction.
- TTT-Linear, a simple implementation of TTT layers, outperforms Transformers and Mamba in evaluations.
- Improve the hardware efficiency of TTT layers through mini-batch TTT and the dual form, making TTT-Linear already a practical building block for LLMs.
Methodology
The key idea is to make the hidden state itself a model with weights, and the update rule a gradient step on the self-supervised loss. Then updating the hidden state on a test sequence is equivalent to training the model at test time.
Training a network with TTT layers
- Training the larger network as the outer loop and training weights within each TTT layer as the inner loop is preferred.
- TTT layers can replace RNN or self-attention layers in any network architecture. Training a network with TTT layers also works the same way as training any other language model.
Learning a self-supervised task for TTT
Add some outer-loo parameters to make this task learnable.
The input [math]\displaystyle{ x_t }[/math] is transformed using a learnable matrix [math]\displaystyle{ \theta_K }[/math] to create a projection [math]\displaystyle{ \tilde x_t = \theta_k x_t }[/math]
The reconstruction label is another low-rank projection [math]\displaystyle{ \theta_V x_t }[/math] which can differ from the input. Then we can create a test view [math]\displaystyle{ \theta_Q x_t }[/math]
Now the new self-supervised loss is: [math]\displaystyle{ l(W,;x_t) = \|f(\theta_k x_t; W)-\theta_V x_t\|^2 }[/math] and the output rule is modified to [math]\displaystyle{ z_t = f(\theta_q x_t;W_t) }[/math]
Summaries of Key Points
Motivation: Compressing Long Context Efficiently
Traditional Transformers handle large contexts by storing every token in a Key-Value cache, which grows linearly with sequence length and makes inference complexity scale quadratically. Modern recurrent neural networks (RNNs) like Mamba sidestep storing the entire context by having a fixed-size hidden state, which leads to linear time complexity. However, RNNs often struggle to exploit very long contexts because the fixed-size hidden state must compress a large amount of information. The authors propose a new design, Test-Time Training (TTT), that treats the hidden state as a small learnable model trained via a self-supervised loss on each incoming token—even at test time.
TTT Layers: Hidden State as a Learner
The paper reframes any sequence-modeling layer as “a hidden state plus an update rule.” For TTT layers, the hidden state is itself a small parametric or nonparametric model f, and the update rule is a step of gradient descent (or other training procedure) on each new token. Thus, at every token step, the hidden state is updated by “training” f on a self-supervised objective. Concretely, one might define a corruption or partial view of the token and train the parametric model to reconstruct the hidden or relevant aspects of the token.
Two Main Instantiations: TTT-Linear and TTT-MLP
The authors propose TTT-Linear, where the learner f is a simple linear mapping plus optional layer normalization and a residual connection. They also propose TTT-MLP, which uses a two-layer MLP as its learner, offering a more expressive hidden state. Both can be integrated into existing RNN-based or Transformer-based architectures in place of the usual self-attention or simple RNN blocks. Like other RNN layers, TTT layers compress all historical tokens into a fixed-size hidden state—but the learner can be updated more flexibly via gradient steps each time a new token arrives.
Efficiency Enhancements
Naively computing a gradient step per token would be too slow. Two key ideas improve hardware utilization: 1. **Mini-batch TTT** processes a batch of b tokens at once to parallelize the internal gradient steps. Smaller b yields more gradient steps (and better expressiveness) but can slow performance. 2. **A “dual form”** for TTT-Linear and TTT-MLP reworks the update and output computations into larger matrix multiplications, ensuring that modern accelerators (GPUs, TPUs) can exploit efficient batched operations.
Empirical Results
On language-modeling benchmarks (the Pile and Books), TTT-Linear and TTT-MLP match or exceed strong baselines (Transformer and the modern RNN Mamba) across model scales (125M to 1.3B parameters). TTT-Linear typically does as well as Mamba in short context (2k tokens) but outperforms Mamba substantially in longer contexts (8k or 32k), demonstrating that the extra expressiveness helps exploit more tokens. TTT-MLP can be even more expressive at very long contexts but can be more memory intensive. The authors also show that TTT-Linear can train and infer efficiently in wall-clock time using a specialized GPU kernel, yielding near-constant inference latency as context grows (unlike the linear growth in Transformers).
Significance and Future Work
TTT recasts the hidden-state update in an RNN-like layer as explicitly training a miniature model at test time—essentially “learning to learn” from each incoming token. With further improvements in tasks (beyond simple reconstruction), hardware kernels, and more expressive hidden states, TTT-based architectures may offer a new path toward efficient, high-performing sequence models for extremely long contexts.
Group 7 Presentation: Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains
Presented by:
Jonathan Gallagher and Mariya Anashkina
Paper Citation
A. V. Makkuva et al., “Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains,” 2024, arXiv. doi: 10.48550/ARXIV.2402.04161.
Summaries of key points
This paper offers a theoretical framework for analyzing transformers using Markov chains, providing a new way to understand how information flows and dependencies are formed across layers. The core idea is to view the self-attention mechanism as inducing a Markov process over the sequence positions, where each attention head defines a transition probability matrix. By interpreting attention through this lens, the authors derive principled explanations for phenomena like context aggregation, token mixing, and the effects of multiple layers and heads. They show that stacking layers corresponds to composing Markov transitions, meaning that deeper transformers can be understood as performing longer-range probabilistic walks over the input. This allows them to make quantitative predictions about attention spread and mixing time, and helps formalize intuitions about how depth and attention head structure influence model behavior. Importantly, the framework applies without modifying the architecture—it’s purely analytical, making it a useful tool for understanding model internals in a mathematically grounded way.
The abstraction into Markov chains may oversimplify certain aspects of how attention operates—particularly when attention scores are data-dependent and influenced by non-linear operations (e.g., softmax and masking). The analysis tends to assume relatively clean or idealized cases, and may not fully capture the nuances of attention in real-world settings like language modeling with positional encoding or context-specific weighting. While the framework is insightful, it’s primarily useful for interpretability and analysis, not for improving model performance directly. Finally, the notion of interpretability here is mathematical, but not necessarily human-friendly—it gives you equations, not explanations in natural language.
Clear explanations to aid understanding
Think of each attention head as deciding where to “move” next in the sequence—like a random walker choosing the next word to look at. If you treat those attention weights as transition probabilities, then attention turns into a kind of Markov process, where information flows step-by-step based on those probabilities. Stacking layers is like walking further in the sequence graph, and having multiple heads is like having multiple walkers with different preferences. This view lets you talk about things like mixing time—how fast information spreads—or how different tokens influence each other over multiple layers. It’s a powerful way to bridge probabilistic modeling and deep learning, especially when trying to reason about what attention is doing beyond just visualizing heatmaps.
Group 7 Presentation: Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains
Presented by:
- Jonathan Gallagher
- Mariya Anashkina
Paper Citation
A. V. Makkuva et al., “Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains,” 2024, arXiv. doi: 10.48550/ARXIV.2402.04161.
Background
- Markov chains: probability models that predict the future state based on the current and prior states of a system.
- Language can be modeled as a markov process, and token predictions from transformers follow closley to markov chains
- The paper introduces a framework to systematically analyze (through the lense of a markov chain) how transformers learn to model sequential data
- The objective is to explore how a transformer succedes or struggles on first order Markov chains, a distribution that only needs one step of memory.
Theoretical Results
Theorem 1 (Global minimum). Let the input sequence be [math]\displaystyle{ \{x_n\}_{n=1}^N \sim \bigl(\pi(p,q), P(p,q)\bigr) }[/math] for some fixed [math]\displaystyle{ (p,q)\in(0,1)^2. }[/math] Then for all [math]\displaystyle{ (p,q), }[/math] there exists a [math]\displaystyle{ \theta_{\star}\in\mathbb{R}^{D-d} }[/math] with an explicit construction such that it is a global minimum for the population loss [math]\displaystyle{ L(\cdot) }[/math].
Therom 2 (Bad local minimum). Let the input sequence be [math]\displaystyle{ \{x_n\}_{n=1}^N \sim \bigl(\pi(p,q), P(p,q)\bigr) }[/math] for some fixed [math]\displaystyle{ (p,q)\in(0,1)^2. }[/math] If [math]\displaystyle{ p+q\gt 1, }[/math] there exists an explicit [math]\displaystyle{ \theta_{\pi}\in\mathbb{R}^{D-d} }[/math] such that it is a bad local minimum for the loss [math]\displaystyle{ L(\cdot) }[/math]
Theorem 3 (Global minimum). Consider the same setting as in Thm.~1. Then for all [math]\displaystyle{ (p, q), }[/math] if [math]\displaystyle{ \theta_{\star} = (e_{\star} = a_{\star}, \dots, b_{\star}) \in \mathbb{R}^{D-d} }[/math] is a global minimum for the loss [math]\displaystyle{ L(\cdot) }[/math] in the weight-tied scenario, then its extension [math]\displaystyle{ \bar{\theta}_{\star} = (\bar{e}_{\star}, \bar{a}_{\star}) \in \mathbb{R}^D }[/math] is also a global minimum for [math]\displaystyle{ L(\cdot) }[/math] in [math]\displaystyle{ \mathbb{R}^D }[/math] in the non-weight-tied case. Further, [math]\displaystyle{ \bar{\theta}_{\star} }[/math] satisfies the same properties (ii)--(iv) as in Thm.~1.
Theorem 4 (Saddle point). Consider the same setting as in Thm.~3. For [math]\displaystyle{ p + q \gt 1, }[/math] let [math]\displaystyle{ \theta_{\pi} = (e_{\pi} = a_{\pi}, \dots, b_{\pi}) \in \mathbb{R}^{D-d} }[/math] be the corresponding bad local minimum for the loss [math]\displaystyle{ L(\cdot) }[/math] in the weight-tied scenario. Then its extension [math]\displaystyle{ \bar{\theta}_{\pi} = (\bar{e}_{\pi}, \bar{a}_{\pi}) \in \mathbb{R}^D }[/math] is a saddle point for [math]\displaystyle{ L(\cdot) }[/math] in [math]\displaystyle{ \mathbb{R}^D }[/math] in the non-weight-tied case. Further, [math]\displaystyle{ \bar{\theta}_{\pi} }[/math] satisfies the same properties (ii)--(iv) as in Thm.~2.
Main Contributions
1. The authors introduce a theoretical framework that models the data source as a Markov process. This allows them to study how transformers learn sequential structure, contrasting it with other approaches that treat training data simply as i.i.d.
2. Focusing on single-layer transformers trained for next-token prediction on first-order Markov data, the paper gives a detailed analysis of the cross-entropy loss surface: There is always a set of parameters that perfectly recover the true Markov transition probabilities (thus achieving the global minimum).When the sum of the Markov chain's flipping probabilities exceeds one, the model can converge to parameters that simply predict the stationary distribution rather than the true transition probabilities. This phenomenon does not arise (or becomes only a saddle) when weights are untied or when the transformer has multiple layers.
3. Through experiments, they show that the theoretical findings match empirical behaviors. In particular: When weights are tied, the model may learn a constant (stationary) probability and fail to leverage sequential context if the transition probabilities are above a certain threshold. Removing weight tying—or increasing the transformer's depth—helps avoid such bad local minima.
4. The authors extend the analysis to higher-order processes. Surprisingly, simply increasing the transformer depth does not guarantee learning of higher-order transitions. They find, however, that restricting the attention window (rather than letting the network attend to all past tokens) dramatically improves learning of higher-order Markov patterns.
Related work
1. Analyzing Transformer Architectures:
Vaswani et al. (2017) introduced the Transformer architecture, which has since been extensively studied to understand its theoretical foundations and practical applications.
Rogers et al. (2020) provided a comprehensive analysis of BERT, a model based on the Transformer architecture, discussing its linguistic capabilities and limitations.
2. Combining Attention Mechanisms with Probabilistic Models:
Bahdanau et al. (2015) introduced the attention mechanism in neural machine translation, allowing models to focus on relevant parts of the input sequence dynamically.
Graves et al. (2014) proposed the Neural Turing Machine, combining neural networks with external memory, enabling models to learn algorithmic tasks.
Future direction
1.Enhanced Interpretability of Transformer Models:
Applying the Markov chain framework to dissect and visualize the internal workings of Transformers could lead to more interpretable models, aiding in identifying and mitigating biases.
2.Development of Hybrid Models:
Integrating Markovian structures with attention mechanisms may result in hybrid models that leverage the strengths of both approaches, potentially improving performance in tasks requiring sequential dependencies.
3.Theoretical Analysis of Attention Dynamics:
Further exploration into the mathematical properties of attention modelled as Markov processes could provide deeper insights into the stability and convergence of Transformer-based models.
Group 7 Review: An Analysis of Transformers via Markov Chains
Presented by:
Jonathan Gallagher and Mariya Anashkina
Paper Citation
A. V. Makkuva et al., “Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains,” 2024, arXiv. doi: 10.48550/ARXIV.2402.04161.
Review
I thought this presentation made it really easy to see the connection between Markov Chains, and Transformers when tasked to predict future tokens. The lunch menu analogy made it really easy to see how a first order markov chain behaves, and was a great general idea to discuss before diving deeper into how Markov chains behave in the language world. Theoretical results were also well explained before linking them to empirical results and discussing why the empirical results behave the way they do.
The empirical results were also well explained, showing how removing weight tying or increasing model depth can help escape bad local minima and ensure the transformers parameters are in the global minima for first order markov chains. And how smaller context windows are required to escape bad local minima for higher-order markov chains, even for deeper models (regardless of depth or weight tying).
I also thought this presentation was really well structured, and found that the slides were really easy to follow with great visual aids.
Group 8 Presentation: MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Presented by:
Nana Ye and Xingjian Zhou
Summaries of key points
This paper introduces MEDUSA, a lightweight yet effective method for accelerating inference in large language models by attaching multiple decoding heads at intermediate layers. The key idea is that, during inference, you can use these heads to predict several future tokens in parallel, reducing the number of sequential steps needed. Unlike speculative decoding (which relies on a separate draft model), MEDUSA uses the same model and architecture, just augmented with extra linear heads that predict future tokens based on intermediate hidden states. These predictions are verified by the base model in a final pass, similar in spirit to Draft & Verify, but with much lower overhead and implementation complexity. Despite its simplicity, MEDUSA achieves competitive speedups—up to 2×—on models like LLaMA, and integrates easily into existing transformer pipelines. It also preserves generation quality well, maintaining high accuracy across benchmarks without requiring retraining.
One potential limitation of MEDUSA is that its performance gains depend on the quality of intermediate predictions—if the early layers aren't predictive enough, the method may yield minimal speedup or introduce verification bottlenecks. Another concern is scalability: adding too many decoding heads could increase memory consumption or introduce architectural clutter. While the paper shows good results on standard benchmarks, it's less clear how MEDUSA performs in more complex decoding scenarios like beam search, sampling with temperature, or instruction-tuned models. Finally, although it's simple to implement, any modification to production LLM inference stacks still carries deployment costs, which MEDUSA somewhat underplays.
Clear explanations to aid understanding
Think of a transformer generating text one word at a time—slow, because each step waits on the previous. MEDUSA says: what if we could guess ahead a few tokens using partial information? It adds small prediction heads at different layers in the model, each trying to guess future tokens before the final layer finishes computing. Once these guesses are made, the base model verifies them. If correct, we skip ahead; if not, we fall back. It’s like speculative decoding, but self-contained—no second model, no complicated setup. You get the parallelism of speculative methods with the simplicity of just tweaking the model's architecture slightly.
Group 8 Presentation: MEDUSA: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Presented by:
- Nana Ye
- Xingjian Zhou
Paper Citation
T. Cai et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads,” 2024, arXiv. doi: 10.48550/ARXIV.2401.10774.
https://arxiv.org/abs/2401.10774
Background
- As the size of LLMs grow, the speed at which they can generate tokens decreses. The bottleneck is primairly the transfer of data to/from the GPU
- Speculative Sampling is an existing solution that predicts multiple tokens in the future at once using smaller "draft" models
- Medusa instead solves this problem by adding multiple decoding heads and a tree based attention mechanism to existing LLMS
- Paper discusses the implementations of Medusa1 and Medusa2
Main Idea
The idea is that by replacing the draft model and use the heads within our own model since the best representation of a model is itself. The multiple heads predicts multiple tokens at once to leverage parallelism. This allows it to be more efficient and provide more tokens for the tree-based attention to choose. The tree-based attention is used is simulate the idea as if the tokens are being generated sequentially, by traversing through a tree from top to bottom, where top is the initial word.
Technical Contributions
Medusa 1:
- Uses a frozed pre-trained LLM and trains extra decoding heads on top
- Each additional decoding head predicts a token K time steps in the future
- Uses a probability loss function that scales based on the number of steps into the future
- Reduces memory usage because the backbone model is only used for hidden state extraction
- In simple terms, Medusa adds additional linear layers on top of the last hidden layer from the transformer output which are training to predict the tokens in future positions, rather than just the next token like a conventional transformer mechanism does in a typical auto-regressive manner.
Medusa 2:
- Fine tunes the LLM and trains the decoding heads at the same time.
- Encountered problems with high losses, switched to a two-stage training process:
- Stage1: train only the Medusa heads (simillar to Medusa1)
- Stage2: Train both the backbone model and the medusa heads together
Tree Attention
- Tree attention is used to enable the heads predicting later tokens to include the additional context which may have been created by the earlier medusa heads in the pipeline
- This tree structure does not occur autoregressively, however
- The top predictions from each head are fed into the tree structure as candidate tokens
- An attention mask is used to ensure that the future token prediction from the tree is based on prior tokens, not future ones past the one being dedicated
- Multiple future candidate tokens can be predicted with context-aware attention simultaneously
Self Distillation
- A dataset with prompts relevant to the desired model are created
- The full large language model predicts outputs to these prompts in a typical auto regressive manner. These prompts are used to form a training dataset for the self-distillation step
- Medusa Heads are trained on the generated training dataset
Tree Construction
Branches with low probability of containing the next token are pruned from the tree of candidate tokens in tree attention, this reduces the computational expensiveness of MEDUSA 2
Empirical Evaluation
Experiments on various LLMs show consistent 2–3 times speedups in practice without harming output quality (assessed by GPT-4 and other metrics). The authors also include ablation studies on key design choices (number of heads, attention structure, sampling thresholds), confirming the effectiveness and generality of the proposed framework.
Group 9 Presentation: Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Presented by:
- Kaiyue Ma
- Wenzhe Wang
Paper Citation
T. Dao and A. Gu, “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality,” 2024, arXiv. doi: 10.48550/ARXIV.2405.21060.
Background
- Transformers are effective, but computationally expensive and suffer from quadratic complexity
- Structured state space models (SSMs) are an alternative that scales linearly instead and works for long range modeling
- SSMs have not recieved the same main stream improvements as transformers, and lack support for parallelization and hardware acceleration
- Structured state space duality (SSD) bridges the gaps between transformers and SSMs
Additional Background
SSM(State Space Models) are traditionally used in control theory to model a dynamic system via variables. But then from this paper https://compneuro.uwaterloo.ca/files/publications/voelker.2018.pdf, they discovered that SSM is great for describing the time cells in the brain. A useful diagram of a SSM can be found here https://cdn-uploads.huggingface.co/production/uploads/613b0a62a14099d5afed7830/G7icfkYoxIqHZcJGHM7UD.png, where, n state variables, u, m state inputs, and y, p outputs.
Technical Contributions
- Represents SSMs as semiseparable matrices and uses semiseparable matrices for efficient matrix operations.
- Uses generalized linear attention mechanism with structured masks.
- Refines the original Mamba model to yield Mamba-2, which incorporates the new structured-state-space-duality algorithms. Mamba-2 is easily parallelizable and scales better to large state sizes. Empirical results demonstrate strong performance on language modelling benchmarks, surpassing older SSM models and matching or outperforming Transformers at various model scales.
Summaries of Key Points
The paper explores the theoretical connections between Transformer architectures and Structured State Space Models (SSMs). The authors introduce the State Space Duality (SSD) framework, which bridges these two model families through the concept of structured semiseparable matrices. This framework reveals that certain attention mechanisms in Transformers can be interpreted as SSMs, providing a unified perspective on sequence modelling techniques.
Leveraging the SSD framework, the authors propose a new architecture called Mamba-2. This model refines the selective SSM approach used in the original Mamba model, resulting in a design that is 2-8 times faster while maintaining competitive performance in language modelling tasks. Mamba-2 achieves this efficiency by simplifying the SSM layer, enabling better scalability and computational speed.
The paper also introduces efficient algorithms based on block decompositions of semiseparable matrices, which enhance the computational efficiency of SSMs. These algorithms allow for larger recurrent state sizes and improve the practicality of SSMs in handling long-range dependencies within sequences.
Empirical evaluations demonstrate that Mamba-2 outperforms both its predecessor and Transformer models in terms of training efficiency and performance on language modelling benchmarks. The architecture also shows superior capabilities in associative recall tasks, highlighting its effectiveness in capturing and utilizing long-range dependencies.
In summary, the paper provides a theoretical foundation connecting Transformers and SSMs, introduces the Mamba-2 architecture as a practical application of this theory, and presents algorithms that enhance the efficiency and scalability of sequence modelling techniques.
Group 10 Presentation: Accelerating Large Language Model Decoding with Speculative Sampling
Paper Citation
C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, ‘Accelerating Large Language Model Decoding with Speculative Sampling’, Feb. 02, 2023, arXiv: arXiv:2302.01318. doi: 10.48550/arXiv.2302.01318.
https://arxiv.org/abs/2302.01318
Background
- Traditional autoregressive decoding for large language model is computationally expensive, as the entire model has to run for each additional token which is generated
- Transformer neural networks are typically memory bandwidth limted, and using quantization or distillation to make smaller models are solutions which have been used in the past to improve the performance of LLMs
Technical Contributions
- Speculative sampling was developed to increase the speed of LLM decoding without meaningfully reducing it's performance on predicting future tokens compared to the base model
- Generates a draft of k future tokens using a smaller model
- Score the proposed tokens using the base model
- A modified rejection sampling scheme was developed by the authors in the paper
- The acceptance of a draft token is based on the minimum of 1 and the ratio of the target model's probability to the draft model's probability for that token.
Explanations to Aid Understanding
- Transformer Decoding: This is the process by which a large language model generates text. Given an input prompt, the model sequentially selects the most probable next token, then uses that token to predict the subsequent one, and so on, and this process is computationally intensive.
- Speculative Sampling: Unlike traditional decoding where one token is generated at a time by the target model, speculative sampling aims to generate multiple tokens in parallel by using a faster, potentially smaller "draft model" to propose a short sequence (the draft). The target model evaluates these drafted tokens, and a rejection sampling mechanism decides which ones to accept, ensuring the output remains consistent with the target model's capabilities.
- Parallel Scoring: Instead of computing the logits for each drafted token sequentially, the method computes the logits for all (K+1) tokens in the draft at the same time. The presentation notes that the computing time for this parallel process is similar to sampling a single token with the target model, which is a key factor in the potential speedup. The key insight is that the model inference pass is dominated by memory bandwidth and inter-device communication rather than purely by the token count. By handling several drafted tokens per pass, overall decoding time is greatly reduced.
Summaries of Key Points
- Decoding Challenges in LLMs: Traditional autoregressive sampling methods generate one token at a time, leading to inefficiencies, especially as model
- Speculative Sampling Overview: SpS utilizes a draft model, which is a smaller and faster version of the target LLM, to propose a sequence of tokens. The target model then evaluates this proposed sequence, accepting or rejecting tokens based on a modified rejection sampling scheme that ensures the final output aligns with the target model's distribution.
- Algorithm Efficiency: The draft model generates a short sequence (e.g., K tokens) in parallel. The target model scores this sequence, and tokens are accepted or resampled as needed, allowing for the potential acceptance of up to K+1 tokens per iteration. This parallel approach contrasts with traditional methods that generate tokens sequentially, thereby reducing decoding latency.
- Empirical Results: Implementing SpS with Chinchilla, a 70-billion-parameter language model, resulted in a 2 to 2.5 times speedup in decoding across benchmark tasks such as XSum and HumanEval. These speed improvements were achieved without degrading sample quality or requiring changes to the target model's parameters or architecture.
- Advantages of Speculative Sampling: Maintains the integrity of the target model's output distribution. Requires no alterations to existing model architectures, facilitating easy integration into current systems. Demonstrates versatility across various tasks and decoding methods, making it broadly applicable in LLM deployments.
Group 11 Presentation: Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff
Presented by:
Yiyuan Yang, Anar Kuatzhan, Chuan Zhang
Paper Citation
Arora, S., Eyuboglu, S., Zhang, M., Timalsina, A., Alberti, S., Zinsley, D., ... & Ré, C. (2024). Simple linear attention language models balance the recall-throughput tradeoff. arXiv preprint arXiv:2402.18668. https://arxiv.org/pdf/2402.18668
Background
Nowadays large language models still struggle with efficiency. In attention-based models, attention requires a huge number of calculations as the input gets longer. Attention also stores every previous word in memory, which makes it memory-intensive. New language models developed in recent years can generate text faster while maintaining low perplexity. Low perplexity doesn't necessarily mean good recall. Gated convolutional models also struggle with recall. Attention-based models excels in recall tasks. To address these problems, the authors introduced the Based model.
Technical Contributions
Memory-Recall Tradeoff: Observed both within and across architecture classes.
Performance with Fixed Recurrent State: Not all architectures have the same recall capacity. Mamba optimally utilizes limited memory budgets while convolutional architectures underperform with memory constraints.
The Based model combines local fine-grained attention + long-range linear attention via Taylor approximation of softmax exponential function that are sub-quadratic complexity during training and permit an efficient recurrent inference view. Based outperforms prior sub-quadratic architectures in recall quality by up to 6.2 accuracy points.
Architecture
Softmax-approximating linear attention (applied globally) + exact softmax attention with sliding windows (applied locally)
This combination achieves 90.8% of full softmax attention's recall accuracy while reducing latency by a factor of 100,000.
Group 12 Presentation: EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Paper Citation
Y. Li, F. Wei, C. Zhang, and H. Zhang, ‘EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees’, Jun. 30, 2024, arXiv: arXiv:2406.16858. doi: 10.48550/arXiv.2406.16858.
https://arxiv.org/abs/2406.16858
Background
- LLMs to date have shown great performance, but they are slow and computationally expensive
- Speculative sampling - small model generates candidate tokens, whereas large model then evaluates those tokens, reducing the number of times the expensive computations of the large model have to occur
- EAGLE 1 used a draft tree to improve performance of speculative execution, and this builds on the author's prior work
Paper Contributions
- EAGLE 1 doesn't directly predict tokens, and rather maps tokens to features, predicts features, and then predicts tokens from those features
- This work was shown to improve upon previous speculative execution methods
- Eagles uses a tree structure to propose alternative token when speculative sampling draft token is rejected by the full model
- EAGLE 2 noted that token acceptance is dependant on context and position. The first token has a high acceptance rate, and later tokens have lower acceptance rates
- EAGLE 2 uses a dynamic draft trees which incorporate "tree attention" to incorporate context information into selecting the next candidate token, increasing the acceptance rate of the token, as it depends on context as well as not only position
Summaries of Key Points
- Problem Addressed: Large language model inference is slow and computationally demanding due to the large parameter sizes and the sequential nature of token generation.
- Foundation: EAGLE-2 builds upon speculative sampling, where a smaller "draft model" proposes candidate tokens that are then verified by the full large language model.
- Improvement over Eagle One: EAGLE-1 applied static draft tree based on the assumption that draft token acceptance rates depend only on their position in the tree. While EAGLE-2 recognizes that acceptance rates are also highly context-dependent and introduces a dynamic adjustment mechanism for the draft tree.
- Key Stages
1. Drafting: EAGLE applies the draft model to produce feature representations instead of directly predicting tokens. The feature representations are passed to the head of large language model to generate token predictions.
2. Verification: Tree-structured verification is employed by EAGLE. This is more efficient than chain-structured verification in standard speculative sampling.
- Dynamic Draft Tree Adjustment: EAGLE-2 dynamically adjusts the structure of the draft tree based on confidence scores of the draft model. This addresses the context-dependent nature of token acceptance rates.
- Dynamic Expansion and Re-ranking
1. Expansion Phase: EAGLE-2 introduces an expansion phase that utilizes tree attention to process all tokens in a layer simultaneously which can improve efficiency. It also employs selective expansion, prioritizing only the top-k tokens with the highest estimated global acceptance probabilities based on confidence scores to avoid exponential growth.
2. Re-ranking Phase: EAGLE-2 reranks all draft tokens by selecting the top-m tokens with the highest global acceptance probabilities and prioritizing shallower nodes in case of ties.
- Experimental Results: EAGLE-2 achieved acceleration ratios of 3.05x - 4.26x across various tasks and large language model series such as Kuna, Llama 2 and Llama 3, making it 20% - 40% faster than EAGLE-1. It is also approximately 2 times faster than Medusa and 2.3 times faster than Lookahead. On token throughput, EAGLE-2 processes 4-5.5 tokens per verification cycle, about twice as many as traditional speculative sampling.
- Major Advantages
1. Out-of-the-box Usability: EAGLE-2 does not require additional model training as it leverages the pre-trained draft model from EAGLE-1.
2. Reliability: EAGLE-2 does not modify original model parameters or relax acceptance conditions. It also maintains consistency.
3. Task and Model Generalization: EAGLE-2 generalizes effectively across multiple tasks and model architectures and demonstrates strong robustness in diverse applications.
Group 13 Presentation: Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation
Presented By
Yuke Liu, Mei Si
Paper Citation
R. Li, J. Su, C. Duan, and S. Zheng, ‘Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation’, Aug. 20, 2020, arXiv: arXiv:2007.14902. doi: 10.48550/arXiv.2007.14902.
https://arxiv.org/abs/2007.14902
Background
- Existing transformer models have [math]\displaystyle{ \mathcal{O} (n^2) }[/math] complexity, which is problematic as model size grows
- This limits the growth of model sizes due to computational resources constraints
- This paper focused on an alternative method to conventional dot product attention that is more computationally efficient
- Standard attention required the computation of [math]\displaystyle{ Q K^\top }[/math] which requires [math]\displaystyle{ \mathcal{O} (n^2) }[/math] complexity
- The paper proposes a linear attention mechanism that solves the problem while keeping the performance.
Technical Contributions
Rather than doing the full computation for the softmax in the transformer architecture, the authors instead compute
[math]\displaystyle{ D(\mathbf{Q}, \mathbf{K}, \mathbf{V})_i = \frac{\sum_{j=1}^{N} e^{\mathbf{q}_i^{T} \mathbf{k}_j} \mathbf{v}_j}{\sum_{j=1}^{N} e^{\mathbf{q}_i^{T} \mathbf{k}_j}} = \frac{\sum_{j=1}^{N} \text{sim}(\mathbf{q}_i, \mathbf{k}_j) \mathbf{v}_j}{\sum_{j=1}^{N} \text{sim}(\mathbf{q}_i, \mathbf{k}_j)} }[/math]
and define the transformation function as
[math]\displaystyle{ \text{sim}(\mathbf{q}_i, \mathbf{k}_j) = \phi(\mathbf{q}_i)^{T} \phi(\mathbf{k}_j) }[/math]
The authors apply a first-order Taylors series expansion, and after some rearranging and substiution arrive to their final formula (full derivation not shown)
[math]\displaystyle{ D(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \frac{ \sum_j \mathbf{v}_{i,j} + \left( \frac{\mathbf{Q}}{\lVert \mathbf{Q} \rVert_2} \right) \left( \left( \frac{\mathbf{K}}{\lVert \mathbf{K} \rVert_2} \right)^{T} \mathbf{V} \right) }{ N + \left( \frac{\mathbf{Q}}{\lVert \mathbf{Q} \rVert_2} \right) \sum_j \left( \frac{\mathbf{K}}{\lVert \mathbf{K} \rVert_2} \right)_{i,j}^{T} } }[/math]
The authors form of the attention mechanism can be solved in [math]\displaystyle{ \mathcal{O} (n) }[/math] complexity, reducing the computational scaling that existing with conventional transformers, and enabling the creation of larger models using the underlying attention mechanism
Model Performance Evaluation
The model performance evaluation primarily focused on assessing how well the proposed linear attention mechanism enhances semantic segmentation performance while maintaining computational efficiency.
- Dataset: The experiments were conducted using the Fine Grained Image Data set, which consists of high-resolution satellite images. This dataset was selected due to its complex landscapes and varied environmental conditions, presenting significant challenges.
- Data Preprocessing: Due to the large size of the images in the dataset, each image was divided into smaller patches of 256x256 pixels, resulting in a total of 7,280 patches.
- Data Split: These batches were methodically partitioned into 60% for training, 20% for validation, and 20% for testing. This split ensured a rigorous evaluation of the model's performance across different scenarios.
- The standard dot-product attention is replaced by the linear attention mechanism in the baseline segmentation models such as U-Net, Res101, DeepLab, PSPNet, etc
- Implementation Details: The experiments were implemented using PyTorch framework and trained on an NVIDIA RTX 2080Ti GPU.
- Evaluation Metrics: OA, AA, K, mloU, F1.
Comparative Analysis
The proposed linear attention mechanism achieves a similar level of accuracy in semantic segmentation tasks compared to traditional dot product attention.
Linear attention maintains comparable performance metrics while drastically reducing both memory footprint and processing time required.
The efficiency gains of linear attention become more apparent in large-scale segmentation tasks, making it suitable for high-resolution images and long text sequences.
Future work includes extending the linear attention mechanism to more complex network designs, integrating with state-of-the-art deep learning models, and optimizing for real-time processing in demanding applications.
Method | OA | AA | K | mIoU | F1 |
---|---|---|---|---|---|
U-Net | 86.378 | 74.532 | 83.357 | 64.516 | 75.532 |
U-Net LAM | 87.692 | 77.297 | 84.935 | 68.038 | 78.593 |
Res101 | 89.251 | 80.451 | 86.846 | 72.433 | 82.510 |
Res101 LAM | 90.178 | 82.757 | 88.041 | 74.085 | 83.105 |
RefineNet | 89.857 | 81.169 | 87.597 | 73.167 | 83.113 |
RefineNet LAM | 90.214 | 83.544 | 88.083 | 74.973 | 84.311 |
DeepLab | 89.388 | 80.905 | 87.079 | 71.809 | 81.077 |
DeepLab LAM | 89.576 | 81.692 | 87.287 | 72.827 | 82.702 |
DeepLabV3+ | 90.125 | 81.483 | 87.959 | 72.668 | 81.492 |
DeepLabV3+ LAM | 90.315 | 81.695 | 88.182 | 73.727 | 82.736 |
PSPNet | 90.573 | 82.211 | 88.485 | 74.797 | 83.761 |
PSPNet LAM | 90.725 | 83.088 | 88.677 | 75.695 | 84.480 |
FastFCN | 90.336 | 83.625 | 88.221 | 74.364 | 83.704 |
FastFCN LAM | 90.835 | 83.075 | 88.769 | 75.174 | 84.023 |
Group 14 Presentation: Scalable Watermarking for Identifying Large Language Model Outputs
Presented by:
Ryan Tymkow and Benjamin Schnapp
Paper Citation
Dathathri, S., See, A., Ghaisas, S. et al. Scalable watermarking for identifying large language model outputs. Nature 634, 818–823 (2024). https://doi.org/10.1038/s41586-024-08025-4.
Summaries of key points
This paper tackles the problem of watermarking—embedding a detectable signal into the outputs of large language models (LLMs) to distinguish generated text from human-written content. The challenge is doing this in a way that is robust, scalable, and minimally intrusive to the model's output quality. The authors propose a method based on statistical biasing of token selection during generation. Specifically, they partition the vocabulary into “greenlist” and “redlist” tokens at each step based on a seeded hash function, and then bias sampling toward the greenlist using a tunable parameter. The watermark is invisible in individual outputs but detectable over longer texts using hypothesis testing. Importantly, the approach is model-agnostic, doesn’t require retraining, and adds very little computational overhead. It also scales well to large models and can be applied in high-throughput or real-time settings. Overall, it’s a lightweight yet effective strategy for watermarking that balances detectability, scalability, and usability.
A key limitation of this approach is that it may still be vulnerable to paraphrasing or text transformation—simple rewriting could break the statistical signature. Another concern is adversarial robustness: since the watermarking method is relatively transparent (based on vocabulary partitioning), a knowledgeable attacker could design strategies to erase or spoof the signal. Additionally, while the method maintains fluency and quality in most cases, biasing token selection could subtly affect stylistic or semantic nuances, especially in creative writing or long-form tasks. The paper doesn’t deeply explore how this might influence user-facing applications like chatbots or summarizers. Lastly, while the watermark is statistically detectable, it’s not embedded in a cryptographic sense, so it may not offer strong guarantees in high-stakes verification contexts.
Clear explanations to aid understanding
Imagine if every time a language model generated text, it subtly preferred certain words over others—but in a way that's invisible to readers. That’s what this watermark does. At each generation step, the model hashes its current context to choose a “greenlist” of preferred tokens and then slightly boosts their probabilities. Over many words, these choices form a statistically detectable pattern. It's like nudging a roulette wheel so certain numbers come up just a bit more often—not enough to be obvious, but enough to spot if you know where to look. The method is efficient and easy to integrate, since it works at the sampling level and doesn't require modifying the model architecture.
Group 14 Presentation: Scalable Watermarking for Identifying Large Language Model Outputs
Presented by:
Ryan Tymkow, Benjamin Schnapp
Paper Citation
Dathathri, S., See, A., Ghaisas, S. et al. Scalable watermarking for identifying large language model outputs. Nature 634, 818–823 (2024). https://doi.org/10.1038/s41586-024-08025-4
Background
With the rise of LLMs and generative AI, there is a risk of spreading AF generated misinformation as if it were from a human. It is important to distinguish AI-generated text from human writing. There exists solutions to address this problem, but they all have limitations involving privacy and computational costs. For example, the traditional watermarking can result in unwanted artifacts in the text.
Previous Approaches
Some of the existing approaches are Retrieval Based Tracking, Post-Hoc Detection, and Traditional Watermarking.
Retrieval Based Tracking stores generated LLM responses in a reference database to determine whether a newly generated response is from the tracked LLM. The issue with this is scalability and privacy since you are storing generated responses.
Post-Hoc Detection uses statistical features of LLM generated text. The issue with this is the high computational cost and as LLM continues to improve to be more human like with the output, the more difficult this approach becomes.
Traditional watermarking would include hidden features in the text, for example replace synonyms or unicode characters. But tinkering with the output will degrade the quality of the original LLM.
Technical Contributions
This paper introducted SynthID-Text, a watermarking method for large language models (LLMs). The method uses a Tournament sampling approach, which ensures that generated text contains a detectable watermark with minimal computational overhead. SynthID-Text incorporates a random seed generator and scoring functions to embed a watermark into the model's output. This technique enhances the ability to identify if text originates from a specific LLM, while preserving the text's quality and minimizing distortion. SynthID-Text does not affect LLM training. It can be configured as distortionary or non-distortionary.
Central to SynthID-Text is the novel Tournament sampling procedure. Rather than sampling each token directly from the LLM's distribution, multiple candidate tokens compete in a multi-layer "tournament" based on pseudorandom watermarking scores, embedding a statistical “signature” that can later be detected.
Results
Synth ID achieved a 95% true positive detection rate with 1% false positives in a 20 million interaction test on Google's Gemini chatbot. It offers high detection accuracy with minimal computational cost and can be configured for non-distortionary or distortionary watermarking.
The benefits are as follows:
Minimal impact on large language model training:
-Synth ID text can be applied to any large language model with minimal modifications, as it only alters the sampling step.
High detection accuracy with low computational cost:
-Outperforms retrieval-based tracking, post hoc detection, and traditional watermarking methods.
-Offers the best balance between computational cost, scalability, and accuracy.
-Can be integrated into production environments using speculative sampling, where smaller models suggest tokens and the main model verifies their validity.
Configurable distortion levels:
-Allows for distortionary or non-distortionary configurations, enabling better control over the quality of generated text versus detection accuracy.
-In non-distortionary watermarking, the average token distribution of the generated text matches the original model's token distribution.
Group 15 Presentation: DiGress: Discrete denoising diffusion for graph generation
Presented by:
Sean Tang, Buji Wong
Paper Citation
Vignac, C., Krawczuk, I., Siraudin, A., Wang, B., Cevher, V., & Frossard, P. (2022). Digress: Discrete denoising diffusion for graph generation. arXiv preprint arXiv:2209.14734.
Background
Graph generation
The goal of this project is to generate graphs, which are represented by node matrices and edge matrices. Edges and nodes can also have their own categories. One application of this is molecule generation: atoms would be represented by nodes and the chemical bonds would be represented by edges.
The challenge of graph generation is a complex task due to the unordered nature and sparsity of graphs. While denoising diffusion models have been successful in other domains like images, they struggle with graphs due to their structural properties. Existing approaches that use continuous noise models disrupt the sparsity and connectivity crucial for graphs.
Technical Contributions
DiGress
The authors introduce DiGress, a discrete denoising diffusion model designed specifically for graph generation with categorical node and edge attributes. DiGress improves graph generation by using a discrete noise model that preserves graph sparsity and structural properties. The model involves progressively editing graphs through edge addition/removal and attribute changes. A graph transformer network is used to reverse this noisy process using cross-entropy loss, sampling from the trained model by iteratively updating the noise level and computing structural features.
Key enhancements include a noise model that maintains node and edge type distributions, a guidance procedure for conditioning on graph-level features, and the use of auxiliary graph-theoretic features. DiGress achieves state-of-the-art performance on both molecular and non-molecular graph datasets.
Results showed Digress outperformed continuous diffusion methods on various metrics, including degree distribution, clustering, and novelty, and was more scalable for larger graphs. Moreover, in the creation of novel molecules, discrete diffusion aids scalability for larger graphs and molecules, making it more efficient compared to continuous diffusion. DiGress is the first one-shot graph-based model that feasibly trains on over a million molecules without fragment-based heuristics. Its performance on drug-like molecule benchmarks reaches or exceeds that of specialized autoregressive or SMILES-based baselines.
Group 16 Presentation: Machine Learning and Hamilton-Jacobi-Bellman Equation for Optimal Decumulation: a Comparison Study
Presented by:
Zeyu Zhang
Paper Citation
Chen M, Shirazi M, Forsyth PA, Li Y. Machine Learning and Hamilton-Jacobi-Bellman Equation for Optimal Decumulation: a Comparison Study. Published online 2023. doi:10.48550/arxiv.2306.10582
Background
The paper is based on computational finance, focusing on the optimization problem related to "defined benefit" and "defined contribution plans". The main focus is on the challenge of ensuring retirees have enough funds for their retirement. Two key plans were discussed:
"Defined benefit plans" guarantee fixed monthly payments based on factors like tenure and salary but are cost-prohibitive and risky.
"Contribution plans" shift the investment and withdrawal strategy burden to individual investors, but they struggle to balance maximizing withdrawals and minimizing risk.
This problem, often called the "Nazi's hardest problem in finance," highlights the complexity of balancing risk and reward in financial planning for retirement.
The 4% rule is a traditional method recommending a constant 4% withdrawal each year, adjusted for inflation, and investing in stocks and bonds.
Despite its popularity, the 4% rule is suboptimal and not globally optimal
Peter Fauci proposed the HJB PDE method to maximize expected withdrawal and minimize the risk of running out of savings.
The HJB PDE method uses scalarization techniques to achieve Pareto optimal points, but it has limitations.
Technical Contributions
1. Hamilton-Jacobi-Bellman (HJB):
- The problem formulation involves complex mathematical equations related to computational finance.
- The problem uses dynamic programming to break down the optimal control problem, leading to the HJB function that represents the value function.
- The paper assumes stock and bond prices follow a jump diffusion model.
- The investors' total wealth at time [math]\displaystyle{ t }[/math] is defined as the sum of stock price and bond price at that time.
- The capital [math]\displaystyle{ T }[/math] is set to 30 years, and rebalancing times are defined with discrete withdrawal amounts and allocation for stocks and bonds.
2. Neural Network (NN): Control and Objective Function:
- The control at time [math]\displaystyle{ T_i }[/math] includes the withdrawal amount [math]\displaystyle{ Q_i }[/math] and allocation for the wealth at time [math]\displaystyle{ T_i^- }[/math].
- The admissible control set is defined, and the expected shortfall is introduced as a measure of risk.
- The expected total withdrawal is used as a measure of reward, aiming to maximize the expected total withdrawal while minimizing the expected shortfall.
- The pre-commitment in the expected shortfall problem is defined, focusing on maximizing the expected total withdrawal and minimizing the expected shortfall.
Neural Network (NN) Formulation
As an alternative to the HJB framework, the authors propose a Neural Network (NN) approach to solve the stochastic control problem. This framework has several advantages:
1. The NN approach is data-driven, meaning it avoids explicitly defining parametric models for stochastic processes. This provides flexibility and allows integration of auxiliary variables if needed.
2. It circumvents the computation of high-dimensional conditional expectations by solving a single, unconstrained optimization problem for control decisions. This avoids the curse of dimensionality often associated with dynamic programming.
3. If the optimal control is continuous in time and state, the NN reflects this property. If the control is discontinuous, the NN yields a smooth approximation, which is beneficial in practice for implementing investment policies.
4. The method is scalable, making it suitable for long horizons and high-frequency rebalancing without significantly increasing computational complexity.
NN Approximation Setup
- The control policy [math]\displaystyle{ \mathcal{P} }[/math] is approximated using two feed-forward neural networks, with parameters [math]\displaystyle{ \boldsymbol{\theta}_q }[/math] and [math]\displaystyle{ \boldsymbol{\theta}_p }[/math], representing withdrawal and allocation strategies respectively.
- These networks take as inputs the Brownian motion path [math]\displaystyle{ W(t_i) }[/math] and time [math]\displaystyle{ t_i }[/math] to approximate control decisions:
[math]\displaystyle{ \hat{q}(W_i^-, t_i^-, \boldsymbol{\theta}_q) \approx q_i(W_i^-), \quad \hat{p}(W_i^+, t_i^+, \boldsymbol{\theta}_p) \approx p_i(W_i^+) }[/math]
- The final control policy is given by: [math]\displaystyle{ \hat{\mathcal{P}} = \{ (\hat{q}(\cdot), \hat{p}(\cdot)) \} \approx \mathcal{P} }[/math]
- The functions [math]\displaystyle{ \hat{p} }[/math] and [math]\displaystyle{ \hat{q} }[/math] use time as one of the inputs, allowing a single NN to handle decisions across all rebalancing points, rather than training separate models for each time step.
- The paper also discusses how the architecture includes activation functions that enforce stochastic constraints naturally.
Group 18 Presentation: HIGEN: HIERARCHICAL GRAPH GENERATIVE NETWORKS
Presented by:
- Shiyu Zhu
- Jesse Xue
Paper Citation
M. Karami, “HiGen: Hierarchical Graph Generative Networks,” 2023, arXiv. doi: 10.48550/ARXIV.2305.19337.
Background
- Standard softmax function uses quadratic computational complexity during prediction, and linear attention mechanism leads to an underperforming model.
- Hierarchical or Multi-Scale Structure: Captures high level interactions or relationships between objects or groups, while also representing the lower level structures. One example is a company org chart.
- Existing graph generating models include: Variational Autoencoders, Generative Adversarial Networks, Autoregressive Models (GNN, Graph RNN, GRAN) and Diffusion Models
- Paper introduces HIGEN, Hierarchical Graph Generative Networks to address problems with existing models.
- Experiments were conducted on 5 datasets, each with increasing size and scale. The GraphRNN, GRAN, DiGress, GDSS, and SPEC
Technical Contributions
- Related Hierarchical Methods: The presentation discusses several recent hierarchical methods in specific domains like chemistry, highlighting HiGen's broader applicability, multi-level approach, and parallel generation as advantages over these more specialized techniques. These include a multi-based generation for molecular graphs (2020) relying on domain knowledge, a hierarchical normalizing flow model (2021) based on local neighborhoods, and a tree decomposition framework (2022) limited to a single abstraction level and medium-sized graphs.
Definition: Hierarchical Graph
A Hierarchical Graph is a multi-level representation of a graph [math]\displaystyle{ \mathcal{G} = (\mathcal{V}, \mathcal{E}) }[/math] where:
- [math]\displaystyle{ \mathcal{V} }[/math] is the set of nodes (vertices), and [math]\displaystyle{ \mathcal{E} }[/math] is the set of edges, with sizes [math]\displaystyle{ n = |\mathcal{V}| }[/math] and [math]\displaystyle{ m = |\mathcal{E}| }[/math].
- A node partition function [math]\displaystyle{ \mathcal{F}: \mathcal{V} \rightarrow \{1, ..., c\} }[/math] groups nodes into [math]\displaystyle{ c }[/math] communities or clusters.
- Each cluster forms a subgraph [math]\displaystyle{ \mathcal{C}_i = (\mathcal{V}(\mathcal{C}_i), \mathcal{E}(\mathcal{C}_i)) }[/math] with adjacency matrix [math]\displaystyle{ A_i }[/math].
- Cross-links between communities form bipartite graphs [math]\displaystyle{ \mathcal{B}_{ij} = (\mathcal{V}(\mathcal{C}_i), \mathcal{V}(\mathcal{C}_j), \mathcal{E}(\mathcal{B}_{ij})) }[/math].
- Each cluster is aggregated into a super-node and each bipartite into a super-edge at the next higher level. This forms a coarser graph at the parent level.
Group 20 Gated linear attention transformers with hardware efficient training
Reference
arXiv:2312.06635
Background
The paper discusses Gated Linear Attention (GLA) Transformers, addressing the computational inefficiency of traditional transformers with softmax attention. Regular transformers have a quadratic computational complexity with sequence length, which becomes extremely expensive for long sequences, and linear attention mechanism leads to an underperforming model.
Technical Contributions
It proposes using a linear kernel as an alternative to the softmax function, which allows attention to be formulated as a linear RNN with 2D hidden states. The key innovations include:
1. Introducing a data-dependent gating mechanism to improve model performance, which allows the model to forget past information adaptively
2. Developing a linear attention approach that reduces computational complexity
3. Creating a hardware-efficient training method (FLASHLINEARATTENTION Algorithm) that can handle long sequences more effectively
The main goal was to create a more efficient transformer model that can:
- Reduce computational expenses
- Maintain competitive performance across different tasks
- Handle long sequences more effectively
- Leverage modern GPU architectures for improved training and inference
The approach addresses the fundamental challenge of making transformer models more scalable and computationally efficient, particularly for tasks involving long sequences like processing books, dialogues, or complex scientific texts.
Results
The results and conclusions of the paper showed:
Performance Results:
- For the 340 million parameter model:
- Achieved competitive performance - Close to transformer performance - Slightly better or comparable to Rednet - Slightly below Mamba on some tasks
- For the 1.3 billion parameter model:
- Beat most benchmarks in average accuracy - Slightly behind transformer++ in perplexity - Showed impressive accuracy across tasks
Key Findings:
1. Gating mechanism is crucial for model performance
- Removing it significantly increased perplexity - Data-dependent scalar decay improved results
2. Recall-intensive tasks:
- Smaller model: Transformer still led - Larger model: GLA closed performance gap considerably - Competitive with Mamba and Rednet
3. Computational Efficiency:
- Higher training throughput for larger batch sizes - Slight increase in GPU memory usage - More efficient for bigger batches
Conclusions:
- GLA is highly effective for handling long sequences - Hardware-efficient design reduces computational costs - Gating mechanism significantly enhances model performance - Promising approach for making transformers more accessible and efficient - A efficient replacement for softmax attention in Transformers
The paper suggests future research should focus on optimizing the balance between performance and efficiency.
Linear Attention
Transformers traditionally use softmax attention, which scales poorly with sequence length due to quadratic complexity. Linear attention approximates softmax with kernel-based attention mechanisms, reducing this cost.
Parallel and Recurrent Forms
-
Parallel Form: Computes full attention using:
[math]\displaystyle{ \mathbf{O} = \text{softmax}((\mathbf{QK}^\top) \odot \mathbf{M}) \mathbf{V} }[/math]
Enables efficient training with full-sequence inputs. -
Recurrent Form: Used during inference, processes token-by-token with:
[math]\displaystyle{ \mathbf{o}_t = \frac{\sum_{i=1}^{t} \phi(\mathbf{q}_t) \phi(\mathbf{k}_i)^\top \mathbf{v}_i}{\sum_{i=1}^{t} \phi(\mathbf{q}_t) \phi(\mathbf{k}_i)^\top} }[/math] -
Using [math]\displaystyle{ \phi(x) = x }[/math] and removing normalization yields the simplified linear attention update:
[math]\displaystyle{ \mathbf{S}_t = \mathbf{S}_{t-1} + \mathbf{k}_t^\top \mathbf{v}_t }[/math], [math]\displaystyle{ \quad \mathbf{o}_t = \mathbf{q}_t \mathbf{S}_t }[/math]
Chunkwise Parallel Linear Attention
The chunkwise parallel form balances between full parallelism and full recurrence, enabling faster training on long sequences.
- Splits input [math]\displaystyle{ \mathbf{X} }[/math] into chunks of length [math]\displaystyle{ C }[/math].
-
Inter-chunk state update:
[math]\displaystyle{ \mathbf{S}_{[i+1]} = \mathbf{S}_{[i]} + \sum_{j=iC+1}^{(i+1)C} \mathbf{k}_j^\top \mathbf{v}_j }[/math] -
Intra-chunk output:
[math]\displaystyle{ \mathbf{O}_{[i+1]} = \mathbf{Q}_{[i+1]} \mathbf{S}_{[i]} + \left((\mathbf{Q}_{[i+1]} \mathbf{K}_{[i+1]}^\top) \odot \mathbf{M}\right) \mathbf{V}_{[i+1]} }[/math]
Benefits
- Time complexity: [math]\displaystyle{ \mathcal{O}(LCd + Ld^2) }[/math], which is sub-quadratic.
- [math]\displaystyle{ C = 1 }[/math] recovers the recurrent form; [math]\displaystyle{ C = L }[/math] recovers the parallel form.
- Efficient and scalable to long sequences with minimal performance loss.
Future Work
1. Future hardware-aware optimization: balance between efficiency and performance.
2. Application to other data: the potential of applying GLA to image, video, or scientific data.
3. Test how GLA perform on larger model: due to computational limitations, the experiment is on moderate scale model.
Group 23 Presentation: Discrete Diffusion Modelling By Estimating the Ratios of the Data Distribution
Presented By
Chenxin Lyu, Yixuan Zeng
Paper Citation
A. Lou, C. Meng, and S. Ermon, ‘Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution’, Jun. 06, 2024, arXiv: arXiv:2310.16834. doi: 10.48550/arXiv.2310.16834.
https://arxiv.org/abs/2310.16834
Background
- Diffusion models have shown great performance for generative artifical intelligence when applied to domains with continuous data
- Diffusion models are more difficult to implement for data in the discrete domain, such as tokenized texts
- Prior attempts at applying diffusion to text generations have performed worse than autoregressive models
Paper Contributions
- Developed a method called Score Entropy Discrete Diffusion (SEDD)
- Parameterizes the diffusion process for discrete data using data distribution ratios, rather than dealing with the tokenized data directly
Discrete Diffusion Processes
- Models probability distributions over a finite discrete space [math]\displaystyle{ \mathcal{X} = \{1, \ldots, N\} }[/math], using probability mass vectors [math]\displaystyle{ p_t \in \mathbb{R}^N }[/math].
-
Evolution of [math]\displaystyle{ p_t }[/math] follows a linear ODE:
[math]\displaystyle{ \frac{dp_t}{dt} = Q_t p_t,\quad p_0 \approx p_{\text{data}} }[/math] - [math]\displaystyle{ Q_t }[/math] is a diffusion matrix with non-negative off-diagonal entries and column sums equal to 0 (mass is preserved).
- Often simplified as [math]\displaystyle{ Q_t = \sigma(t) Q }[/math], driving [math]\displaystyle{ p_t }[/math] toward a base distribution as [math]\displaystyle{ t \to \infty }[/math].
-
Simulated using Euler steps with small [math]\displaystyle{ \Delta t }[/math]. Transition probability:
[math]\displaystyle{ p(x_{t+\Delta t} = y \mid x_t = x) = \delta_{xy} + Q_t(y, x) \Delta t + O(\Delta t^2) }[/math] -
Time Reversal: Reverse process uses another matrix [math]\displaystyle{ \overline{Q}_t }[/math] with:
[math]\displaystyle{ \overline{Q}_t(y, x) = \frac{p_t(y)}{p_t(x)} Q_t(x, y) }[/math]
Reverse ODE: [math]\displaystyle{ \frac{dp_{T-t}}{dt} = \overline{Q}_{T-t} p_{T-t} }[/math] - This connects to the concrete score, generalizing the score function [math]\displaystyle{ \nabla_x \log p_t }[/math].
Group 24 Presentation: Mitigating the Missing Fragmentation Problem in De Novo Peptide Sequencing With A Two-Stage Graph-Based Deep Learning Model
Paper Citation
Mao, Z., Zhang, R., Xin, L. et al. Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model. Nat Mach Intell 5, 1250–1260 (2023). https://doi.org/10.1038/s42256-023-00738-x
https://www.nature.com/articles/s42256-023-00738-x#citeas
Background
- Proteins are crucial for biological functions
- Proteins are formed from peptides which are sequences of amino acids
- Mass spectrometry is used to analyze peptide sequences
- De Novo sequencing is used to piece together peptide sequences when the sequences are missing from existing established protein databases
- Deep learning has become commonly implimented to solve the problem of de-novo peptide sequencing
- When a peptide fails to fragment in the expected manner, it can make protein reconstruction difficult due to missing data
- One error in the protein can propogate to errors throughout the entire sequence
Paper Contributions
- Graph Novo was developed to handle incomplete segments
- GraphNovo-PathSearcher instead of directly predicting, does a path search method to predict the next peptide in a sequence
- A graph neural network is used to find the best path from the graph generated from the mass spectrometry input
- GraphNovo-SeqFiller instead of directly predicting, does a path search method to predict the next peptide in a sequence.
- It's expected that some peptides/ amino acids may have been missed, SeqFiller uses a transformer to add in amino acids which have been missed from PathSearcher
- Input is mass spectrum from mass spectrometry
- Graph construction is done where nodes represent possible fragments, and edges represent possible peptides (PathSearcher module)
- PathSearcher uses machine learning to find the optimal path on the generated graph
- SeqFiller fills in missing amino acids that may have not been included in the PathSearcher module due to lacking data from the mass spectrometry inputs
Peptide Sequencing in AlphaPeptDeep
- Peptide sequencing determines amino acid sequences of peptides, crucial for understanding proteins.
- Mass spectrometry (MS) is used to fragment peptides and analyze the resulting spectra.
Methods referenced in the presentation:
- Database Search & Spectral Library Search: AlphaPeptDeep improves prediction of MS spectra and retention time, boosting accuracy of both methods.
- de novo Sequencing: Enhanced spectral prediction from AlphaPeptDeep supports building peptide sequences without prior knowledge.
- AlphaPeptDeep predicts peptide properties (e.g., fragmentation patterns) to improve spectrum matching and sequence inference.
Group 47 Presentation: Jamba: A Hybrid Transformer - Mamba Language Model
Paper Citation
Lieber, O., Lenz, B., Bata, H., Cohen, G., Osin, J., Dalmedigos, I., Safahi, E., Meirom, S., Belinkov, Y., Shalev-Shwartz, S., Abend, O., Alon, R., Asida, T., Bergman, A., Glozman, R., Gokhman, M., Manevich, A., Ratner, N., Rozen, N., Shwartz, E., Zusman, M., Shoham, Y. (2024). Jamba: A Hybrid Transformer-Mamba Language Model. arXiv. https://arxiv.org/abs/2403.19887
https://doi.org/10.48550/arXiv.2403.19887
Summaries of Key Points
- Jamba is a novel hybrid language model that combines Transformer and Mamba architectures with a Mixture of Experts (MoE) module. This combination aims to leverage the strengths of both architectures: the expressiveness of Transformers and the efficiency of Mamba for long sequences.
- The main features of Jamba include hybrid Transformer-Mamba layers for memory efficiency and high throughput, a Mixture of Experts to increase capacity without excessive compute cost, and optimization for long context (up to 256,000 tokens on a single GPU).
- Jamba's hybrid architecture integrates three key components
1. Transformer layers: Applied self-attention to capture complex relationships between tokens, crucial for tasks requiring rich contextual understanding and reasoning. However, they suffer from high memory and compute costs with long sequences.
2. Mamba layers: Based on state-space models, they efficiently process long sequences without storing extensive key-value caches, making Jamba more memory-efficient for long context tasks. They maintain a hidden state to summarize prior information.
3. Mixture of Experts (MoE): Introduces sparsity by dynamically selecting a small subset of experts per token, scaling model capacity while keeping computational costs manageable. Jamba applies MoE every other layer, using 16 experts in total but with only two active per token, balancing efficiency and performance.
The architecture of a single Jamba block consists of a sequence of Transformer layers, Mamba layers, and a Mixture of Experts layer. The structure follows a 1:7 ratio of Transformer layers to Mamba layers. MOE layers replace every second multi-layer perception layer to increase model capacity without significantly increasing the number of parameters.
Performance and Benefits
You can fill in!