The paper explores the use of structured pruning, a technique enabling the development of smaller, yet powerful Large Language Models (LLMs) from pre-trained, heftier counterparts. The approach integrates two vital methods: targeted structured pruning, responsible for trimming down the model to a desired shape by strategically removing layers, heads, and various dimensions, and dynamic batch loading which refreshes data samples in every training batch based on domain-specific varying losses. The quantifiable results are embodied in the Sheared-LLaMA series, which includes the pruned LLaMA2-7B model tailored down to 1.3B and 2.7B parameters. These models not only outclass similar-sized alternatives like Pythia and INCITE but also demand just 3% of the computational resources usually necessary for training LLMs from the beginning.
Key insights include:
The implications of this research are profound, showcasing an approach to LLM development that could dramatically reduce both financial and computational costs. Further research inspired by this could revolutionize not only LLM architectures but also their application in domains with limited computing abilities.