Seminar: Pruning and Efficiency in LLMs

Prerequisite

We require that you have taken lectures on or are familiar with the following:

Machine Learning
Deep Learning
Automated Machine Learning

Organization

After the kick-off meeting, everyone is assigned a paper (one or two depending on the content). Then, everyone understands the paper(s) assigned to them and prepares two presentations.

The first presentation will focus on establishing, the background, motivation for the work and a concise overview of the approach proposed in the paper
The second presentation will focus on the details of the approach, the results and takeaways from the paper and an “add-on” described below

Students will contribute an "add-on" related to the paper for the final report. This includes but is not limited to a thorough literature review, reproducing some experiments, profiling inference latency of the LLMs, implementing a part of the paper or providing a colab demo on applying the method in the paper to a different LLM. Students can (e-)meet with Rhea Sukthanker for feedback and any questions (e.g., to discuss a potential "add-on").

Grading

Presentations: 50% (two times 25min + 15min Q&A)
Report: 30% (4 pages in AutoML Conf format, due one week after last end term)
Add-on: 20%

List of Potential Papers

Are Sixteen Heads Really Better than One? https://arxiv.org/pdf/1905.10650.pdf
FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search https://arxiv.org/abs/2308.03290
Minitron https://www.arxiv.org/abs/2407.14679
MiniLLM https://openreview.net/pdf?id=5h0qf7IBZZ
Compressing LLMs: The Truth is Rarely Pure and Never Simple https://arxiv.org/abs/2310.01382
Wanda : https://arxiv.org/pdf/2306.11695
SparseGPT: https://arxiv.org/abs/2301.00774
On the Effect of Dropping Layers of Pre-trained Transformer Models https://arxiv.org/pdf/2004.03844.pdf
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned https://arxiv.org/pdf/1905.09418.pdf
A Fast Post-Training Pruning Framework for Transformers https://proceedings.neurips.cc/paper_files/paper/2022/file/987bed997ab668f91c822a09bce3ea12-Paper-Conference.pdf
LLM-Pruner: On the Structural Pruning of Large Language Models https://arxiv.org/pdf/2305.11627.pdf
Compresso https://arxiv.org/pdf/2310.05015.pdf
LLM Surgeon https://arxiv.org/pdf/2312.17244.pdf
Shortened Llama https://arxiv.org/abs/2402.02834
SliceGPT https://arxiv.org/abs/2401.15024
Structural pruning of large language models via neural architecture search https://arxiv.org/abs/2405.02267
Not all Layers of LLMs are Necessary during Inference https://arxiv.org/pdf/2403.02181.pdf
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect https://arxiv.org/abs/2403.03853
Shortened Llama https://arxiv.org/abs/2402.02834
FLAP: Fluctuation-based adaptive structured pruning for large language models https://arxiv.org/abs/2312.11983
Bonsai: Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes https://arxiv.org/pdf/2402.05406.pdf
The Unreasonable Ineffectiveness of the Deeper Layers https://arxiv.org/pdf/2403.17887v1.pdf
Sheared Llama https://arxiv.org/abs/2310.06694
Netprune https://arxiv.org/pdf/2402.09773.pdf
MiniLLM: Knowledge Distillation of Large Language Models https://arxiv.org/pdf/2306.08543
A Survey on Knowledge Distillation of Large Language Models https://arxiv.org/abs/2402.13116
QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs https://arxiv.org/pdf/2404.00456
A Survey on Efficient Inference for Large Language Models https://arxiv.org/pdf/2404.14294v2
Efficient Large Language Models: A Survey https://arxiv.org/pdf/2312.03863