slowllama
Fine-tune Llama2 and CodeLLama models, including 70B/35B on Apple M1/M2 devices (for example, Macbook Air or Mac Mini) or consumer nVidia GPUs.
slowllama is not using any quantization. Instead, it offloads parts of model to SSD or main memory on both forward/backward passes. In contrast with training large models from scratch (unattainable) or inference, where we are likely to care about interactivity, we can still get something finetuned if you let it run for a while.
Current version is using LoRA to limit the updates to a smaller set of parameters. First version supported full finetuning as well, but I decided to remove it for now, more on that below.
Finetuning is the only focus, there’s nothing special done for inference, consider llama.cpp.
For CUDA-specific experiments, see report on a10.
Example
Tests were done on Apple M1 with 16Gb memory and Apple M2 with 24Gb memory.
In order to fine-tune llama2 model we need to:
Install dependencies: pip install torch sentencepiece numpy. Optional: install pip install fewlines for weight/gradient distribution logging.
Clone llama2 and follow instructions to download the models. The script will download tokenizer as well. tokenizer.model should be put into the same directory as llama model itself. Use codellama for CodeLLama models. Example folder structure could look like:
/parent/
/slowllama/… #
>>> Read full article>>>
Copyright for syndicated content belongs to the linked Source : Hacker News – https://github.com/okuvshynov/slowllama