The long answer: if you download an LLM from, say, HuggingFace, you'll see a directory structure a little like this:
$ tree ~/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B/
├── refs
│ └── main
└── snapshots
└── 5b8c30c51e0641d36a2f9007c88425c93db0cb07
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json
[slightly modified for clarity].
You can see the actual matrices with something like:
from transformers import AutoModel, AutoTokenizer
import torch
model = AutoModel.from_pretrained("mlabonne/TwinLlama-3.1-8B")
embedding_layer = model.get_input_embeddings() # This contains token ID → vector mappings
embedding_weights = embedding_layer.weight # Shape: (vocab_size, embedding_dim)
print(embedding_weights.size())
Output:
torch.Size([128256, 4096])
But if you want to get even more raw, you can just read the files directly:
from safetensors.torch import load_file
state_dict = load_file("model-00001-of-00004.safetensors")
embeddings = state_dict['model.embed_tokens.weight']
print(embeddings.size())
Output:
torch.Size([128256, 4096])
That is the size of the vocabulary and we can demonstrate this with:
$ grep -rl 128256 ~/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B/
...
config.json: "vocab_size": 128256
You can see what the actual tokens (note: not words) are in tokenizer.json.
File formats
Safe Tensors are a HuggingFace format for storing tensors. The "safe" properties come from the possible security implications of PyTorch's .pt and .bin formats that allow the pickling of code in the files.
GGUF (Giant GGML Unified Format) is a format used by Llama (and compatible) models. The format stores everything (both weights and metadata) in one file, is good for CPU processing and allows quantization.
You can convert from Safe Tensors to GGUF by running something like this from the llama.cpp repo:
python convert_hf_to_gguf.py --outtype f16 --outfile /home/henryp/Temp/models--mlabonne--TwinLlama-3.1-8B-DPO.gguf /home/henryp/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B-DPO/snapshots/4a76d14414118c00fbbaed96cf1dd313a10f1409/
That directory being where the model's config.json lives.
Tensor files store their weights in various formats. FP16 and BF16 are both 16-bit encodings with the difference being which bits they allocate to the mantissa and exponent. BF16 has a larger range (same as FP32) but lower precision. FP16 is the traditional format for GPUs with more modern GPUs now also supporting BF16.
You can even just use integers like INT8 and Q4_0, the latter being a 4-bit integer with the _0 refering to which technique was used to quantize it (eg, scaling). These different strategies trade off accuracy, speed, complexity etc.
The model
There's a nice Java implementation of Llama.cpp here. It's very nicely written and even allows you to compile the Java code to a native binary (use the latest GraalVM). This class represents the weights that are loaded from disk. In the case of a LLama model, the GGUF file is read from disk, parsed and just becomes an object this class.
No comments:
Post a Comment