Wednesday, January 22, 2025

Notes on LLMs

I've been learning to build LLMs the hard way. These are some notes I made on my journey.

Intro

I'm not going to train my LLM from scratch. Instead, I'm going to use somebody else's ("Pre Training") and tune it ("Supervised Fine Tuning"). The differences can be seen here [Microsoft] but suffice to say that Pre Training takes huge resources whereas Supervised Fine Tuning can be achieved on modest hardware.

The Environment

I'll be extensively using these Python libraries:
  • TRL "is a library created and maintained by Hugging Face to train LLMs using SFT and preference alignment." [1]
  • Unsloth "uses custom kernels to speed up training (2-5x) and reduce memory use (up to 80% less memory)... It is based on TRL and provides many utilities, such as automatically converting models into the GGUF quantization [see below] format." [1]
  • "Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code!" [HuggingFace] It offloads large matrices to disk or CPU
Preference alignment is an umbrella term. It "addresses the shortcomings of SFT by incorporating direct human or AI feedback into the training process" whereas SFT is just trained on a corpus of text learning its structure.

But on Ubuntu 20, my code gives:

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

This appears to be referring to the Linux kernel [SO]

OOMEs

Using a newer kernel (6.5.0-1025-oem) gets me further but then barfs with:

  File "/home/henryp/venv/llms/lib/python3.10/site-packages/transformers/activations.py", line 56, in forward
    return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.17 GiB. GPU 0 has a total capacity of 7.75 GiB of which 529.62 MiB is free. Including non-PyTorch memory, this process has 7.22 GiB memory in use. Of the allocated memory 6.99 GiB is allocated by PyTorch, and 87.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Some advice I received was:
You need to reduce the batch size even more.
You can even try using different optimizers as optimizers hold a lot of memory - Use 8bitadam or adafactor.
You can try LoRA - Low rank adaptation.
You can also quantize the model to load in lower bit precisions [Discord]
"LoRA is a parameter-efficient technique for fine-tuning LLMs... The primary purpose of LoRA is to enable the fine-tuning of LLMs with significantly reduced
computational resources."

The idea is that when we fine tune the weights matrix (W), we'd semantically do this:

W' = W + M

But if M is large, that's a lot of memory we're talking about.

The trick with LoRA that instead of M we use two matrices of much smaller rank, A and B and replace M with AB. as long as A has the same number of rows as M and B the same number of columns, they can otherwise be much smaller (referred to as r in the library code). "Larger ranks may capture more diverse tasks but could lead to overfitting" [1, p214].

Sure, we lose some accuracy but when my code now reads:

model, tokenizer = FastLanguageModel.from_pretrained( # Unsloth
...
            load_in_4bit=True, # "You can activate QLoRA by setting load_in_4bit to True"  LLMEngineering, p251
        )

it runs to completion. Note that QLora combines LoRA with quantization. "The core of QLoRA’s approach involves quantizing the base model parameters to a custom 4-bit NormalFloat (NF4) data type, which significantly reduces memory usage." [1]

Note that I also tried to set a quantization_config and although the fine tuning stage worked, when I tried to use the model, but caused the code using the model to complain about non-zero probabilities for reasons I don't yet understand. 

Preference Alignments

Reinforcement Learning from Human Feedback "typically involves training a separate reward model and then using reinforcement learning algorithms like PPO to fine-tune the language model... This is typically done by presenting humans with different answers and asking them to indicate which one they prefer." [1]

Proximal Policy Optimization (PPO) "is one of the most popular RLHF algorithms. Here, the reward model is used to score the text that is generated by the trained model. This reward is regularized by an additional Kullback–Leibler (KL) divergence factor, ensuring that the distribution of tokens stays similar to the model before training (frozen model)." [1]

Direct Preference Optimization uses Kulback-Leibler [previous post] to compare an accepted answer with a rejected one mitigating the need for an RL algorithm.

The idea with preference alignment is that the adjustment to the underlying model is much smaller. Compare my trained model weights (M in the above equation):

$ du -sh saved_model/model_mlabonne
321M    saved_model/model_mlabonne

with the model weight on which it was based (W in the above equation):

$ du -sh ~/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B/
15G     /home/henryp/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B/

and it's tiny.

[1] The LLM Engineers Handbook

No comments:

Post a Comment