Prerequisite
We require that you have taken lectures on or are familiar with the following:
- Machine Learning
- Deep Learning
- Automated Machine Learning
Organization
After the kick-off meeting, everyone is assigned a paper (one or two depending on the content). Then, everyone understands the paper(s) assigned to them and prepares two presentations.
- The first presentation will focus on establishing, the background, motivation for the work and a concise overview of the approach proposed in the paper
- The second presentation will focus on the details of the approach, the results and takeaways from the paper and an “add-on” described below
Students will contribute an "add-on" related to the paper for the final report. This includes but is not limited to a thorough literature review, reproducing some experiments, profiling inference latency of the LLMs, implementing a part of the paper or providing a colab demo on applying the method in the paper to a different LLM. Students can (e-)meet with Rhea Sukthanker for feedback and any questions (e.g., to discuss a potential "add-on").
Grading
- Presentations: 50% (two times 25min + 15min Q&A)
- Report: 30% (4 pages in AutoML Conf format, due one week after last end term)
- Add-on: 20%
List of Potential Papers
- Are Sixteen Heads Really Better than One? https://arxiv.org/pdf/1905.10650.pdf
- FLIQS: One-Shot Mixed-Precision Floating-Point and Integer Quantization Search https://arxiv.org/abs/2308.03290
- Minitron https://www.arxiv.org/abs/2407.14679
- MiniLLM https://openreview.net/pdf?id=5h0qf7IBZZ
- Compressing LLMs: The Truth is Rarely Pure and Never Simple https://arxiv.org/abs/2310.01382
- Wanda : https://arxiv.org/pdf/2306.11695
- SparseGPT: https://arxiv.org/abs/2301.00774
- On the Effect of Dropping Layers of Pre-trained Transformer Models https://arxiv.org/pdf/2004.03844.pdf
- Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned https://arxiv.org/pdf/1905.09418.pdf
- A Fast Post-Training Pruning Framework for Transformers https://proceedings.neurips.cc/paper_files/paper/2022/file/987bed997ab668f91c822a09bce3ea12-Paper-Conference.pdf
- LLM-Pruner: On the Structural Pruning of Large Language Models https://arxiv.org/pdf/2305.11627.pdf
- Compresso https://arxiv.org/pdf/2310.05015.pdf
- LLM Surgeon https://arxiv.org/pdf/2312.17244.pdf
- Shortened Llama https://arxiv.org/abs/2402.02834
- SliceGPT https://arxiv.org/abs/2401.15024
- Structural pruning of large language models via neural architecture search https://arxiv.org/abs/2405.02267
- Not all Layers of LLMs are Necessary during Inference https://arxiv.org/pdf/2403.02181.pdf
- ShortGPT: Layers in Large Language Models are More Redundant Than You Expect https://arxiv.org/abs/2403.03853
- Shortened Llama https://arxiv.org/abs/2402.02834
- FLAP: Fluctuation-based adaptive structured pruning for large language models https://arxiv.org/abs/2312.11983
- Bonsai: Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes https://arxiv.org/pdf/2402.05406.pdf
- The Unreasonable Ineffectiveness of the Deeper Layers https://arxiv.org/pdf/2403.17887v1.pdf
- Sheared Llama https://arxiv.org/abs/2310.06694
- Netprune https://arxiv.org/pdf/2402.09773.pdf
- MiniLLM: Knowledge Distillation of Large Language Models https://arxiv.org/pdf/2306.08543
- A Survey on Knowledge Distillation of Large Language Models https://arxiv.org/abs/2402.13116
- QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs https://arxiv.org/pdf/2404.00456
- A Survey on Efficient Inference for Large Language Models https://arxiv.org/pdf/2404.14294v2
- Efficient Large Language Models: A Survey https://arxiv.org/pdf/2312.03863