stat940W25-presentation: Difference between revisions
Line 143: | Line 143: | ||
- Pingchu Zhang | - Pingchu Zhang | ||
- Zhiyang Cheng | - Zhiyang Cheng | ||
Line 150: | Line 151: | ||
=== Background === | === Background === | ||
For modern RNNs, performance in long context is limited by the expressive power of their hidden state of fixed size. Hence the authors introduced test-time training (TTT) | |||
=== Technical Contributions === | |||
- Introduce TTT layers, where the hidden state is a model and the update rule is self-supervised learning, offering a new research direction. | |||
- TTT-Linear, a simple implementation of TTT layers, outperforms Transformers and Mamba in evaluations. | |||
- Improve the hardware efficiency of TTT layers through mini-batch TTT and the dual form, making TTT-Linear already a practical building block for LLMs. | |||
=== Methodology === | |||
The key idea is to make the hidden state itself a model with weights, and the update rule a gradient step on the self-supervised loss. Then updating the hidden state on a test sequence is equivalent to training the model at test time. | |||
==== TTT as updating a hidden state ==== | |||
=== | ==== Training a network with TTT layers ==== | ||
- Training the larger network as the outer loop and training weights within each TTT layer as the inner loop is preferred. | |||
- TTT layers can replace RNN or self-attention layers in any network architecture. Training a network with TTT layers also works the same way as training any other language model. | |||
==== Learning a self-supervised task for TTT ==== | |||
Add some outer-loo parameters to make this task learnable. | |||
The input <math>x_t</math> is transformed using a learnable matrix <math>\theta_K</math> to create a projection <math>\tilde x_t = \theta_k x_t</math> | |||
The reconstruction label is another low-rank projection <math>\theta_V x_t</math> which can differ from the input. Then we can create a test view <math>\theta_Q x_t</math> | |||
Now the new self-supervised loss is: <math>l(W,;x_t) = \|f(\theta_k x_t; W)-\theta_V x_t\|^2</math> and the output rule is modified to <math>z_t = f(\theta_q x_t;W_t)</math> | |||
</div> | </div> |
Revision as of 14:37, 24 March 2025
Notes on Presentations
Group 1 Presentation: Universal Physics-Informed Neural Networks: Symbolic Differential Operator Discovery with Sparse Data
Paper Citation
Background
Differential equations
Examples of differential equations in physics include Newton's second law (which is an ordinary differential equation), the Navier-Stokes equations (which are partial differential equations), etc.
Existing methods of solving differential equations:
- Analytical methods, such as integration or separation of variables.
- Numerical methods, such as finite difference, finite volume, or finite elements.
- Data-driven approaches: these involve Universal Differential Equations (UDEs) and Physics-Informed Neural Networks (PINNs), which are the focus of this paper.
Introduction to PINNs
With (many) machine learning approaches, the goal is to approximate the solution to a DE using a feed-forward neural network, optimized with MSE loss. The key difference that makes it physics-informed is an extra term in the loss, which penalizes the model for deviating from the governing DE.
Introduction to UDEs
Here, the differential equation is expressed as a sum of two terms: the known physics-based model and an unknown neural network.
Paper Contributions
Universal Physics-Informed Neural Networks (UPINNs)
PINNs and UDEs are combined, addressing the limitations of the original methods, while sharing their benefits.
The loss function contains three terms:
- MSE
- Boundary loss, if you were to provide boundary conditions with the problem.
- PINN loss, but slightly modified.
Lotka-Volterra System
They first experimented with the UPINN on the Lotka-Volterra system of differential equations, which are used to model predator-prey dynamics:
[math]\displaystyle{ \frac{dx}{dt} = \alpha x - \beta xy }[/math]
[math]\displaystyle{ \frac{dy}{dt} = -\delta y + \gamma xy }[/math]
The UDE and PINN were individually tested on two scenarios: sparse data (where there are very few input data points) and noisy data. Alone, each model did not do very well, especially when the data was very sparse or very noisy. When the UPINN was used, the solution was quite good, even with high sparsity or noise.
Viscous Burger's Equation
Their next experiment was used Burger's equation, a system in fluid dynamics.
[math]\displaystyle{ \frac{\partial u}{\partial t} = -u \frac{\partial u}{\partial x} + \nu \frac{\partial^2 u}{\partial x^2} }[/math]
Group 2 Presentation: EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Presented by:
Editing in progress
Paper Citation
Editing in progress
Background
In this paper, we are looking at Large Language Models (LLMs). Autoregressive decoding in LLMs refers to generating text one token at a time, the model basing its predictions on the tokens that came before it.
Speculative Sampling
Speculative sampling is a technique meant to reduce the computational cost and runtime of autoregressive decoding. The process consists of two main parts:
- Draft stage: A small, fast model suggests some tokens.
- Verification stage: In parallel, the large main LLM verifies these tokens and selects the best ones.
How do we choose a draft model that functions like the big LLM, but faster? One approach is to use a reduced version of the main LLM, but this doesn't work when there is no small version available.
Technical Contributions
Extrapolation Algorithm for Greater Language Model Efficiency (EAGLE)
Before making a prediction, the model looks ahead by one token/word in the sequence.
One advantage of the EAGLE method is that only a single decoder layer needs to be trained rather than an entire draft model. This makes the training process extremely efficient in terms of runtime/computational cost.
Group 4 Presentation:
Presented by:
- Karolina Suszek
- Negin Amou
- Muhammad Azeem
Paper Citation
Z. Li et al., “Learning spatiotemporal dynamics with a pretrained generative model,” Nature Machine Intelligence, vol. 6, no. 12. Springer Science and Business Media LLC, pp. 1566–1579, Dec. 06, 2024. doi: 10.1038/s42256-024-00938-z.
Background
- Spatiotemporal dynamics: how the state of a physical system varies with space and time
- Real datasets often contain data with sparse measurements where there are a limited number of sensors available. There needs to be a way to convert the sparse measurement data into a full spatiotemporal field.
- Existing solutions learn to map the input to output and ignores missing data, but this reduces the models ability to generalize.
- Paper proposes the use of Sparse-Sensor-Assisted Score-Based Generative Model (S3GM) which uses unlabeled data durring training and can reconstruct incomplete data after training to make accurate predictions even when there isnt much information available.
- Key Idea: Learn the probabilith distribution of spatiotemporal data using score-based generative model and refine the samples via schochastic sampling
Technical Contributions
Core Components:
- Pre Training Stage: Learns the joint probability distribution of the data
- Generating Stage: Use a stochastic differential equation to refine and generate full field predictions
- Refinement Mechanism: Ensure Allignment with observations and enforce sequence consistency
Group 6 Presentation:
Presented by:
- Pingchu Zhang
- Zhiyang Cheng
Paper Citation
Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., Hashimoto, T., & Guestrin, C. (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. arXiv. https://doi.org/10.48550/arXiv.2407.04620
Background
For modern RNNs, performance in long context is limited by the expressive power of their hidden state of fixed size. Hence the authors introduced test-time training (TTT)
Technical Contributions
- Introduce TTT layers, where the hidden state is a model and the update rule is self-supervised learning, offering a new research direction.
- TTT-Linear, a simple implementation of TTT layers, outperforms Transformers and Mamba in evaluations.
- Improve the hardware efficiency of TTT layers through mini-batch TTT and the dual form, making TTT-Linear already a practical building block for LLMs.
Methodology
The key idea is to make the hidden state itself a model with weights, and the update rule a gradient step on the self-supervised loss. Then updating the hidden state on a test sequence is equivalent to training the model at test time.
Training a network with TTT layers
- Training the larger network as the outer loop and training weights within each TTT layer as the inner loop is preferred.
- TTT layers can replace RNN or self-attention layers in any network architecture. Training a network with TTT layers also works the same way as training any other language model.
Learning a self-supervised task for TTT
Add some outer-loo parameters to make this task learnable.
The input [math]\displaystyle{ x_t }[/math] is transformed using a learnable matrix [math]\displaystyle{ \theta_K }[/math] to create a projection [math]\displaystyle{ \tilde x_t = \theta_k x_t }[/math]
The reconstruction label is another low-rank projection [math]\displaystyle{ \theta_V x_t }[/math] which can differ from the input. Then we can create a test view [math]\displaystyle{ \theta_Q x_t }[/math]
Now the new self-supervised loss is: [math]\displaystyle{ l(W,;x_t) = \|f(\theta_k x_t; W)-\theta_V x_t\|^2 }[/math] and the output rule is modified to [math]\displaystyle{ z_t = f(\theta_q x_t;W_t) }[/math]
Group 7 Presentation:
Presented by:
- Jonathan Gallagher
- Mariya Anashkina
Paper Citation
A. V. Makkuva et al., “Attention with Markov: A Framework for Principled Analysis of Transformers via Markov Chains,” 2024, arXiv. doi: 10.48550/ARXIV.2402.04161.
Background
- Markov chains: probability models that predict the future state based on the current and prior states of a system.
- Language can be modeled as a markov process, and token predictions from transformers follow closley to markov chains
- The paper introduces a framework to systematically analyze (through the lense of a markov chain) how transformers learn to model sequential data
- The objective is to explore how a transformer succedes or struggles on first order Markov chains, a distribution that only needs one step of memory.
Theoretical Results
Theorem 1 (Global minimum). Let the input sequence be [math]\displaystyle{ \{x_n\}_{n=1}^N \sim \bigl(\pi(p,q), P(p,q)\bigr) }[/math] for some fixed [math]\displaystyle{ (p,q)\in(0,1)^2. }[/math] Then for all [math]\displaystyle{ (p,q), }[/math] there exists a [math]\displaystyle{ \theta_{\star}\in\mathbb{R}^{D-d} }[/math] with an explicit construction such that it is a global minimum for the population loss [math]\displaystyle{ L(\cdot) }[/math].
Therom 2 (Bad local minimum). Let the input sequence be [math]\displaystyle{ \{x_n\}_{n=1}^N \sim \bigl(\pi(p,q), P(p,q)\bigr) }[/math] for some fixed [math]\displaystyle{ (p,q)\in(0,1)^2. }[/math] If [math]\displaystyle{ p+q\gt 1, }[/math] there exists an explicit [math]\displaystyle{ \theta_{\pi}\in\mathbb{R}^{D-d} }[/math] such that it is a bad local minimum for the loss [math]\displaystyle{ L(\cdot) }[/math]
Theorem 3 (Global minimum). Consider the same setting as in Thm.~1. Then for all [math]\displaystyle{ (p, q), }[/math] if [math]\displaystyle{ \theta_{\star} = (e_{\star} = a_{\star}, \dots, b_{\star}) \in \mathbb{R}^{D-d} }[/math] is a global minimum for the loss [math]\displaystyle{ L(\cdot) }[/math] in the weight-tied scenario, then its extension [math]\displaystyle{ \bar{\theta}_{\star} = (\bar{e}_{\star}, \bar{a}_{\star}) \in \mathbb{R}^D }[/math] is also a global minimum for [math]\displaystyle{ L(\cdot) }[/math] in [math]\displaystyle{ \mathbb{R}^D }[/math] in the non-weight-tied case. Further, [math]\displaystyle{ \bar{\theta}_{\star} }[/math] satisfies the same properties (ii)--(iv) as in Thm.~1.
Theorem 4 (Saddle point). Consider the same setting as in Thm.~3. For [math]\displaystyle{ p + q \gt 1, }[/math] let [math]\displaystyle{ \theta_{\pi} = (e_{\pi} = a_{\pi}, \dots, b_{\pi}) \in \mathbb{R}^{D-d} }[/math] be the corresponding bad local minimum for the loss [math]\displaystyle{ L(\cdot) }[/math] in the weight-tied scenario. Then its extension [math]\displaystyle{ \bar{\theta}_{\pi} = (\bar{e}_{\pi}, \bar{a}_{\pi}) \in \mathbb{R}^D }[/math] is a saddle point for [math]\displaystyle{ L(\cdot) }[/math] in [math]\displaystyle{ \mathbb{R}^D }[/math] in the non-weight-tied case. Further, [math]\displaystyle{ \bar{\theta}_{\pi} }[/math] satisfies the same properties (ii)--(iv) as in Thm.~2.
Group 8 Presentation:
Presented by:
- Nana Ye
- Xingjian Zhou
Paper Citation
T. Cai et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads,” 2024, arXiv. doi: 10.48550/ARXIV.2401.10774.
https://arxiv.org/abs/2401.10774
Background
- As the size of LLMs grow, the speed at which they can generate tokens decreses. The bottleneck is primairly the transfer of data to/from the GPU
- Speculative Sampling is an existing solution that predicts multiple tokens in the future at once using smaller "draft" models
- Medusa instead solves this problem by adding multiple decoding heads and a tree based attention mechanism to existing LLMS
- Paper discusses the implementations of Medusa1 and Medusa2
Technical Contributions
Medusa 1:
- Uses a frozed pre-trained LLM and trains extra decoding heads on top
- Each additional decoding head predicts a token K time steps in the future
- Uses a probability loss function that scales based on the number of steps into the future
- Reduces memory usage because the backbone model is only used for hidden state extraction
- In simple terms, Medusa adds additional linear layers on top of the last hidden layer from the transformer output which are training to predict the tokens in future positions, rather than just the next token like a conventional transformer mechanism does in a typical auto-regressive manner.
Medusa 2:
- Fine tunes the LLM and trains the decoding heads at the same time.
- Encountered problems with high losses, switched to a two-stage training process:
- Stage1: train only the Medusa heads (simillar to Medusa1)
- Stage2: Train both the backbone model and the medusa heads together
Tree Attention
- Tree attention is used to enable the heads predicting later tokens to include the additional context which may have been created by the earlier medusa heads in the pipeline
- This tree structure does not occur autoregressively, however
- The top predictions from each head are fed into the tree structure as candidate tokens
- An attention mask is used to ensure that the future token prediction from the tree is based on prior tokens, not future ones past the one being dedicated
- Multiple future candidate tokens can be predicted with context-aware attention simultaneously
Self Distillation
- A dataset with prompts relevant to the desired model are created
- The full large language model predicts outputs to these prompts in a typical auto regressive manner. These prompts are used to form a training dataset for the self-distillation step
- Medusa Heads are trained on the generated training dataset
Tree Construction
Branches with low probability of containing the next token are pruned from the tree of candidate tokens in tree attention, this reduces the computational expensiveness of MEDUSA 2
Group 9 Presentation:
Presented by:
- Kaiyue Ma
- Wenzhe Wang
Paper Citation
T. Dao and A. Gu, “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality,” 2024, arXiv. doi: 10.48550/ARXIV.2405.21060.
Background
- Transformers are effective, but computationally expensive and suffer from quadratic complexity
- Structured state space models (SSMs) are an alternative that scales linearly instead and works for long range modeling
- SSMs have not recieved the same main stream improvements as transformers, and lack support for parallelization and hardware acceleration
- Structured state space duality (SSD) bridges the gaps between transformers and SSMs
Technical Contributions
- Represents SSMs as semiseparable matrices and uses semiseparable matrices for efficient matrix operations.
- Uses generalized linear attention mechanism with structured masks
Group 10 Presentation: Accelerating Large Language Model Decoding with Speculative Sampling
Paper Citation
C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, ‘Accelerating Large Language Model Decoding with Speculative Sampling’, Feb. 02, 2023, arXiv: arXiv:2302.01318. doi: 10.48550/arXiv.2302.01318.
https://arxiv.org/abs/2302.01318
Background
- Traditional autoregressive decoding for large language model is computationally expensive, as the entire model has to run for each additional token which is generated
- Transformer neural networks are typically memory bandwidth limted, and using quantization or distillation to make smaller models are solutions which have been used in the past to improve the performance of LLMs
Technical Contributions
- Speculative sampling was developed to increase the speed of LLM decoding without meaningfully reducing it's performance on predicting future tokens compared to the base model
- Generates a draft of k future tokens using a smaller model
- Score the proposed tokens using the base model
- A modified rejection sampling scheme was developed by the authors in the paper
Group 12 Presentation: EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Paper Citation
Y. Li, F. Wei, C. Zhang, and H. Zhang, ‘EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees’, Jun. 30, 2024, arXiv: arXiv:2406.16858. doi: 10.48550/arXiv.2406.16858.
https://arxiv.org/abs/2406.16858
Background
- LLMs to date have shown great performance, but they are slow and computationally expensive
- Speculative sampling - small model generates candidate tokens, whereas large model then evaluates those tokens, reducing the number of times the expensive computations of the large model have to occur
- EAGLE 1 used a draft tree to improve performance of speculative execution, and this builds on the author's prior work
Paper Contributions
- EAGLE 1 doesn't directly predict tokens, and rather maps tokens to features, predicts features, and then predicts tokens from those features
- This work was shown to improve upon previous speculative execution methods
- Eagles uses a tree structure to propose alternative token when speculative sampling draft token is rejected by the full model
- EAGLE 2 noted that token acceptance is dependant on context and position. The first token has a high acceptance rate, and later tokens have lower acceptance rates
- EAGLE 2 uses a dynamic draft trees which incorporate "tree attention" to incorporate context information into selecting the next candidate token, increasing the acceptance rate of the token, as it depends on context as well as not only position
Group 13 Presentation:
Paper Citation
R. Li, J. Su, C. Duan, and S. Zheng, ‘Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation’, Aug. 20, 2020, arXiv: arXiv:2007.14902. doi: 10.48550/arXiv.2007.14902.
https://arxiv.org/abs/2007.14902
Background
- Existing transformer models have [math]\displaystyle{ \mathcal{O} (n^2) }[/math] complexity, which is problematic as model size grows
- This limits the growth of model sizes due to computational resources constraints
- This paper focused on an alternative method to conventional dot product attention that is more computationally efficient
- Standard attention required the computation of [math]\displaystyle{ Q K^\top }[/math] which requires [math]\displaystyle{ \mathcal{O} (n^2) }[/math] complexity
Technical Contributions
Rather than doing the full computation for the softmax in the transformer architecture, the authors instead compute
[math]\displaystyle{ D(\mathbf{Q}, \mathbf{K}, \mathbf{V})_i = \frac{\sum_{j=1}^{N} e^{\mathbf{q}_i^{T} \mathbf{k}_j} \mathbf{v}_j}{\sum_{j=1}^{N} e^{\mathbf{q}_i^{T} \mathbf{k}_j}} = \frac{\sum_{j=1}^{N} \text{sim}(\mathbf{q}_i, \mathbf{k}_j) \mathbf{v}_j}{\sum_{j=1}^{N} \text{sim}(\mathbf{q}_i, \mathbf{k}_j)} }[/math]
and define the transformation function as
[math]\displaystyle{ \text{sim}(\mathbf{q}_i, \mathbf{k}_j) = \phi(\mathbf{q}_i)^{T} \phi(\mathbf{k}_j) }[/math]
The authors apply a first-order Taylors series expansion, and after some rearranging and substiution arrive to their final formula (full derivation not shown)
[math]\displaystyle{ D(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \frac{ \sum_j \mathbf{v}_{i,j} + \left( \frac{\mathbf{Q}}{\lVert \mathbf{Q} \rVert_2} \right) \left( \left( \frac{\mathbf{K}}{\lVert \mathbf{K} \rVert_2} \right)^{T} \mathbf{V} \right) }{ N + \left( \frac{\mathbf{Q}}{\lVert \mathbf{Q} \rVert_2} \right) \sum_j \left( \frac{\mathbf{K}}{\lVert \mathbf{K} \rVert_2} \right)_{i,j}^{T} } }[/math]
The authors form of the attention mechanism can be solved in [math]\displaystyle{ \mathcal{O} (n) }[/math] complexity, reducing the computational scaling that existing with conventional transformers, and enabling the creation of larger models using the underlying attention mechanism
Model Performance Evaluation
Fill me in!
Group 18 Presentation
Presenters
Editing in progress
Paper Citation
Editing in progress
Background
Editing in progress
Technical Contributions
Editing in progress
Group 23 Presentation: Discrete Diffusion Modelling By Estimating the Ratios of the Data Distribution
Paper Citation
A. Lou, C. Meng, and S. Ermon, ‘Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution’, Jun. 06, 2024, arXiv: arXiv:2310.16834. doi: 10.48550/arXiv.2310.16834.
https://arxiv.org/abs/2310.16834
Background
- Diffusion models have shown great performance for generative artifical intelligence when applied to domains with continuous data
- Diffusion models are more difficult to implement for data in the discrete domain, such as tokenized texts
- Prior attempts at applying diffusion to text generations have performed worse than autoregressive models
Paper Contributions
- Developed a method called Score Entropy Discrete Diffusion (SEDD)
- Parameterizes the diffusion process for discrete data using data distribution ratios, rather than dealing with the tokenized data directly
Group 24 Presentation: Mitigating the Missing Fragmentation Problem in De Novo Peptide Sequencing With A Two-Stage Graph-Based Deep Learning Model
Paper Citation
Mao, Z., Zhang, R., Xin, L. et al. Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model. Nat Mach Intell 5, 1250–1260 (2023). https://doi.org/10.1038/s42256-023-00738-x
https://www.nature.com/articles/s42256-023-00738-x#citeas
Background
- Proteins are crucial for biological functions
- Proteins are formed from peptides which are sequences of amino acids
- Mass spectrometry is used to analyze peptide sequences
- De Novo sequencing is used to piece together peptide sequences when the sequences are missing from existing established protein databases
- Deep learning has become commonly implimented to solve the problem of de-novo peptide sequencing
- When a peptide fails to fragment in the expected manner, it can make protein reconstruction difficult due to missing data
- One error in the protein can propogate to errors throughout the entire sequence
Paper Contributions
- Graph Novo was developed to handle incomplete segments
- GraphNovo-PathSearcher instead of directly predicting, does a path search method to predict the next peptide in a sequence
- A graph neural network is used to find the best path from the graph generated from the mass spectrometry input
- GraphNovo-SeqFiller instead of directly predicting, does a path search method to predict the next peptide in a sequence.
- It's expected that some peptides/ amino acids may have been missed, SeqFiller uses a transformer to add in amino acids which have been missed from PathSearcher
- Input is mass spectrum from mass spectrometry
- Graph construction is done where nodes represent possible fragments, and edges represent possible peptides (PathSearcher module)
- PathSearcher uses machine learning to find the optimal path on the generated graph
- SeqFiller fills in missing amino acids that may have not been included in the PathSearcher module due to lacking data from the mass spectrometry inputs