stat940W25-presentation: Difference between revisions

Revision as of 13:09, 24 March 2025

Notes on Presentations

Group 1 Presentation: Universal Physics-Informed Neural Networks: Symbolic Differential Operator Discovery with Sparse Data

Paper Citation

Background

Differential equations

Examples of differential equations in physics include Newton's second law (which is an ordinary differential equation), the Navier-Stokes equations (which are partial differential equations), etc.

Existing methods of solving differential equations:

Analytical methods, such as integration or separation of variables.
Numerical methods, such as finite difference, finite volume, or finite elements.
Data-driven approaches: these involve Universal Differential Equations (UDEs) and Physics-Informed Neural Networks (PINNs), which are the focus of this paper.

Introduction to PINNs

With (many) machine learning approaches, the goal is to approximate the solution to a DE using a feed-forward neural network, optimized with MSE loss. The key difference that makes it physics-informed is an extra term in the loss, which penalizes the model for deviating from the governing DE.

Introduction to UDEs

Here, the differential equation is expressed as a sum of two terms: the known physics-based model and an unknown neural network.

Paper Contributions

Universal Physics-Informed Neural Networks (UPINNs)

PINNs and UDEs are combined, addressing the limitations of the original methods, while sharing their benefits.

The loss function contains three terms:

MSE
Boundary loss, if you were to provide boundary conditions with the problem.
PINN loss, but slightly modified.

Group 4 Presentation:

Presented by:

Editing in progress

Paper Citation

Editing in progress

Background

Editing in progress

Technical Contributions

Editing in progress

Group 8 Presentation:

Presented by:

- Nana Ye

- Xingjian Zhou

Paper Citation

T. Cai et al., “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads,” 2024, arXiv. doi: 10.48550/ARXIV.2401.10774.

Background

- As the size of LLMs grow, the speed at which they can generate tokens decreses. The bottleneck is primairly the transfer of data to/from the GPU

- Speculative Sampling is an existing solution that predicts multiple tokens in the future at once using smaller "draft" models

- Medusa instead solves this problem by adding multiple decoding heads and a tree based attention mechanism to existing LLMS

- Paper discusses the implementations of Medusa1 and Medusa2

Technical Contributions

Medusa1:

- Uses a frozed pre-trained LLM and trains extra decoding heads on top

- Each additional decoding head predicts a token K time steps in the future

- Uses a probability loss function that scales based on the number of steps into the future

- Reduces memory usage because the backbone model is only used for hidden state extraction

Medusa2:

- Fine tunes the LLM and trains the decoding heads at the same time.

- Encountered problems with high losses, switched to a two-stage training process:

- Stage1: train only the Medusa heads (simillar to Medusa1)

- Stage2: Train both the backbone model and the medusa heads together

Group 9 Presentation:

Presented by:

- Kaiyue Ma

- Wenzhe Wang

Paper Citation

T. Dao and A. Gu, “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality,” 2024, arXiv. doi: 10.48550/ARXIV.2405.21060.

Background

- Transformers are effective, but computationally expensive and suffer from quadratic complexity

- Structured state space models (SSMs) are an alternative that scales linearly instead and works for long range modeling

- SSMs have not recieved the same main stream improvements as transformers, and lack support for parallelization and hardware acceleration

- Structured state space duality (SSD) bridges the gaps between transformers and SSMs

Technical Contributions

- Represents SSMs as semiseparable matrices and uses semiseparable matrices for efficient matrix operations.

- Uses generalized linear attention mechanism with structured masks

Group 13 Presentation:

Paper Citation

R. Li, J. Su, C. Duan, and S. Zheng, ‘Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation’, Aug. 20, 2020, arXiv: arXiv:2007.14902. doi: 10.48550/arXiv.2007.14902.

https://arxiv.org/abs/2007.14902

Background

- Existing transformer models have [math]\displaystyle{ \mathcal{O} (n^2) }[/math] complexity, which is problematic as model size grows

- This limits the growth of model sizes due to computational resources constraints

- This paper focused on an alternative method to conventional dot product attention that is more computationally efficient

- Standard attention required the computation of [math]\displaystyle{ Q K^\top }[/math] which requires [math]\displaystyle{ \mathcal{O} (n^2) }[/math] complexity

Technical Contributions

Rather than doing the full computation for the softmax in the transformer architecture, the authors instead compute

[math]\displaystyle{ D(\mathbf{Q}, \mathbf{K}, \mathbf{V})_i = \frac{\sum_{j=1}^{N} e^{\mathbf{q}_i^{T} \mathbf{k}_j} \mathbf{v}_j}{\sum_{j=1}^{N} e^{\mathbf{q}_i^{T} \mathbf{k}_j}} = \frac{\sum_{j=1}^{N} \text{sim}(\mathbf{q}_i, \mathbf{k}_j) \mathbf{v}_j}{\sum_{j=1}^{N} \text{sim}(\mathbf{q}_i, \mathbf{k}_j)} }[/math]

and define the transformation function as

[math]\displaystyle{ \text{sim}(\mathbf{q}_i, \mathbf{k}_j) = \phi(\mathbf{q}_i)^{T} \phi(\mathbf{k}_j) }[/math]

The authors apply a first-order Taylors series expansion, and after some rearranging and substiution arrive to their final formula (full derivation not shown)

[math]\displaystyle{ D(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \frac{ \sum_j \mathbf{v}_{i,j} + \left( \frac{\mathbf{Q}}{\lVert \mathbf{Q} \rVert_2} \right) \left( \left( \frac{\mathbf{K}}{\lVert \mathbf{K} \rVert_2} \right)^{T} \mathbf{V} \right) }{ N + \left( \frac{\mathbf{Q}}{\lVert \mathbf{Q} \rVert_2} \right) \sum_j \left( \frac{\mathbf{K}}{\lVert \mathbf{K} \rVert_2} \right)_{i,j}^{T} } }[/math]

The authors form of the attention mechanism can be solved in [math]\displaystyle{ \mathcal{O} (n) }[/math] complexity, reducing the computational scaling that existing with conventional transformers, and enabling the creation of larger models using the underlying attention mechanism

Model Performance Evaluation

Fill me in!

Group 23 Presentation: Discrete Diffusion Modelling By Estimating the Ratios of the Data Distribution

Paper Citation

A. Lou, C. Meng, and S. Ermon, ‘Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution’, Jun. 06, 2024, arXiv: arXiv:2310.16834. doi: 10.48550/arXiv.2310.16834.

https://arxiv.org/abs/2310.16834

Background

- Diffusion models have shown great performance for generative artifical intelligence when applied to domains with continuous data

- Diffusion models are more difficult to implement for data in the discrete domain, such as tokenized texts

- Prior attempts at applying diffusion to text generations have performed worse than autoregressive models

Paper Contributions

- Developed a method called Score Entropy Discrete Diffusion (SEDD)

- Parameterizes the diffusion process for discrete data using data distribution ratios, rather than dealing with the tokenized data directly

Group 24 Presentation: Mitigating the Missing Fragmentation Problem in De Novo Peptide Sequencing With A Two-Stage Graph-Based Deep Learning Model

Paper Citation

Mao, Z., Zhang, R., Xin, L. et al. Mitigating the missing-fragmentation problem in de novo peptide sequencing with a two-stage graph-based deep learning model. Nat Mach Intell 5, 1250–1260 (2023). https://doi.org/10.1038/s42256-023-00738-x

https://www.nature.com/articles/s42256-023-00738-x#citeas

Background

- Proteins are crucial for biological functions

- Proteins are formed from peptides which are sequences of amino acids

- Mass spectrometry is used to analyze peptide sequences

- De Novo sequencing is used to piece together peptide sequences when the sequences are missing from existing established protein databases

- Deep learning has become commonly implimented to solve the problem of de-novo peptide sequencing

- When a peptide fails to fragment in the expected manner, it can make protein reconstruction difficult due to missing data

- One error in the protein can propogate to errors throughout the entire sequence

Paper Contributions

- Graph Novo was developed to handle incomplete segments

- GraphNovo-PathSearcher instead of directly predicting, does a path search method to predict the next peptide in a sequence

- A graph neural network is used to find the best path from the graph generated from the mass spectrometry input

- GraphNovo-SeqFiller instead of directly predicting, does a path search method to predict the next peptide in a sequence.

- It's expected that some peptides/ amino acids may have been missed, SeqFiller uses a transformer to add in amino acids which have been missed from PathSearcher

- Input is mass spectrum from mass spectrometry

- Graph construction is done where nodes represent possible fragments, and edges represent possible peptides (PathSearcher module)

- PathSearcher uses machine learning to find the optimal path on the generated graph

- SeqFiller fills in missing amino acids that may have not been included in the PathSearcher module due to lacking data from the mass spectrometry inputs

@@ Line 152: / Line 152: @@
 == Group 13 Presentation:  ==
-=== Presented by: ===
-Editing in progress
 === Paper Citation ===
-Editing in progress
+R. Li, J. Su, C. Duan, and S. Zheng, ‘Linear Attention Mechanism: An Efficient Attention for Semantic Segmentation’, Aug. 20, 2020, arXiv: arXiv:2007.14902. doi: 10.48550/arXiv.2007.14902.
+https://arxiv.org/abs/2007.14902
 === Background ===
-Editing in progress
+- Existing transformer models have <math>\mathcal{O} (n^2) </math> complexity, which is problematic as model size grows
+- This limits the growth of model sizes due to computational resources constraints
+- This paper focused on an alternative method to conventional dot product attention that is more computationally efficient
+- Standard attention required the computation of <math> Q K^\top </math> which requires  <math>\mathcal{O} (n^2) </math> complexity
 === Technical Contributions ===
-Editing in progress
+Rather than doing the full computation for the softmax in the transformer architecture, the authors instead compute
+<math> D(\mathbf{Q}, \mathbf{K}, \mathbf{V})_i =
+\frac{\sum_{j=1}^{N} e^{\mathbf{q}_i^{T} \mathbf{k}_j} \mathbf{v}_j}{\sum_{j=1}^{N} e^{\mathbf{q}_i^{T} \mathbf{k}_j}}  =  \frac{\sum_{j=1}^{N} \text{sim}(\mathbf{q}_i, \mathbf{k}_j) \mathbf{v}_j}{\sum_{j=1}^{N} \text{sim}(\mathbf{q}_i, \mathbf{k}_j)} </math>
+and define the transformation function as
+<math>\text{sim}(\mathbf{q}_i, \mathbf{k}_j) = \phi(\mathbf{q}_i)^{T} \phi(\mathbf{k}_j) </math>
+The authors apply a first-order Taylors series expansion, and after some rearranging and substiution arrive to their final formula (full derivation not shown)
+<math>D(\mathbf{Q}, \mathbf{K}, \mathbf{V}) =
+\frac{
+\sum_j \mathbf{v}_{i,j} + \left( \frac{\mathbf{Q}}{\lVert \mathbf{Q} \rVert_2} \right) \left( \left( \frac{\mathbf{K}}{\lVert \mathbf{K} \rVert_2} \right)^{T} \mathbf{V} \right) }{ N + \left( \frac{\mathbf{Q}}{\lVert \mathbf{Q} \rVert_2} \right) \sum_j \left( \frac{\mathbf{K}}{\lVert \mathbf{K} \rVert_2} \right)_{i,j}^{T} } </math>
+The authors form of the attention mechanism can be solved in <math>\mathcal{O} (n) </math> complexity, reducing the computational scaling that existing with conventional transformers, and enabling the creation of larger models using the underlying attention mechanism
+=== Model Performance Evaluation ===
+Fill me in!
 </div>

stat940W25-presentation: Difference between revisions

Revision as of 13:09, 24 March 2025

Contents

Notes on Presentations

Group 1 Presentation: Universal Physics-Informed Neural Networks: Symbolic Differential Operator Discovery with Sparse Data

Paper Citation

Background

Paper Contributions

Group 4 Presentation:

Presented by:

Paper Citation

Background

Technical Contributions

Group 8 Presentation:

Presented by:

Paper Citation

Background

Technical Contributions

Group 9 Presentation:

Presented by:

Paper Citation

Background

Technical Contributions

Group 13 Presentation:

Paper Citation

Background

Technical Contributions

Model Performance Evaluation

Group 23 Presentation: Discrete Diffusion Modelling By Estimating the Ratios of the Data Distribution

Paper Citation

Background

Paper Contributions

Group 24 Presentation: Mitigating the Missing Fragmentation Problem in De Novo Peptide Sequencing With A Two-Stage Graph-Based Deep Learning Model

Paper Citation

Background

Paper Contributions

Navigation menu