Showing posts sorted by relevance for query tiny type. Sort by date Show all posts
Showing posts sorted by relevance for query tiny type. Sort by date Show all posts

Sunday, March 20, 2022

Scala 3 Type System Notes

Importance of types

Pathological states are unreachable; the state space of the application simply does not include them.

Scala 3 has some new keywords that help metaprogramming. Here are some notes I made as I try to grok them.

New Scala3 Keywords

The inline modifier "tells the compiler that it should inline any invocation of this method at compile-time. If it's not possible, compiler will fail the compilation." Note that the compiler will complain with inline match can only be used in an inline method if you try to put it in a normal match clause.

The inline keyword will partially act like the Scala 2 or C/C++ version of the keyword if the expression is too complex to evaluate at runtime. That is, it will stick the code in at compile time and let it evaluate at runtime. If you don't want this partial evaluation, "you can tell the compiler to avoid complex expression as parameter. This is done by using an inline if" [Scalac blog]. So, by using a def and an if that are both inlined you could, for example have all debug statements removed at compile time.

Note, the parameters to a function can also be inlined.

"erasedValue comes from compiletime package. It's usually used in tandem with inline match. It allows us to match on the expression type, but we cannot access extracted value as that code is executed at compile-time

"constValue comes from compiletime too. It returns the value of a singleton type."

Note that "in Scala 3 you must use a lower-case identifier for a type being extracted from a pattern match." [Michał Sitko's excellent blog post]

The keyword using appears to be the Scala 3 equivalent of implicit in Scala 2. For compilation to succeed, every time you see a using there must be a corresponding given that satisfies its demands.

The keyword with is a synonym for the equals operator. Quoting the Scala Book, this SO answer says:

Because it is common to define an anonymous instance of a trait or class to the right of the equals sign when declaring an alias given, Scala offers a shorthand syntax that replaces the equals sign and the "new ClassName" portion of the alias given with just the keyword with.
The keyword given will make available zero or more functions available in the ether for a corresponding using.

The keyword transparent can be applied to both traits and inline methods. In each case, it's saying the returned type is more specific at compile time. If it cannot be more specific at compile time (say, it depends on a condition that is only known at runtime) then it will fall back to the most general type.

Opaque types seem to basically be tiny types that are very easy to use (see the official documentation). To paraphrase: outside of the module, it is not possible to discover that the opaque type is actually implemented as its alias.

Examples

If we define the natural numbers as:

  trait Nat
  class _0 extends Nat
  class Succ[A <: Nat] extends Nat
  trait <[A <: Nat, B <: Nat]

then the Peano axioms can be represented by:

  given basic[B <: Nat]: <[_0, Succ[B]] with {}
  given inductive[A <: Nat, B <: Nat](using lt: <[A, B]): <[Succ[A], Succ[B]] with {}

(From RockTheJVM)

So, that inductive function allows compilation if it is satisfied in a (potentially) recursive manner until we reach basic. Having these axioms in the ether, we can compile this:

type _1 = Succ[_0]
type _2 = Succ[_1] // Succ[Succ[_0]]
type _3 = Succ[_2] // Succ[Succ[Succ[_0]]]
type _4 = Succ[_3] // Succ[Succ[Succ[Succ[_0]]]]
...

I hope to build on this to make data engineer pipelines easier to debug. I'll expand on this idea in a future post as this has become long enough.

Thursday, September 22, 2022

Rage against the Markup

Markup is pervasive in DevOps. But markup is also:

  • hard to refactor
  • has limited control flow
  • not type-safe
  • hard to test

Refactoring

I see in our codebase something like this:

      - name: Set env to Develop
        if: endsWith(github.ref, '/develop')
        run: |
          echo "ENVIRONMENT=develop" >> $GITHUB_ENV
      - name: Set env to Staging
        if: endsWith(github.ref, '/main')
        run: |
          echo "ENVIRONMENT=staging" >> $GITHUB_ENV
      - name: Set env to Productions
        if: endsWith(github.ref, '/production')
        run: |
          echo "ENVIRONMENT=production" >> $GITHUB_ENV

Not only is this ugly, it's copy-and-pasted everywhere. I can't refactor it and what's more, there is a...

(Lack) of Control

Imagine I want to create an AWS Postrgres instance with Terraform then provision the DB with a schema using the Python library, Alembic all via GitHub Actions? GHA can call the Terraform file and create a DB, but how do I get the URL of that Postgres instance so I can give it to Alembic for it to connect and create the tables? Weaving different technologies together in a Turing Complete language is easy; with markup: less so. I had to hack some calls to the AWS CLI command and parse the JSON it returned, all in bash.

Type-safety issue #1

An example of a lack of type safety can be found in any Terraform script. We had something like this:

resource "aws_ecs_task_definition" "compute_task" {
  family                   = var.task_name
  container_definitions    = <<DEFINITION
  [
    {
      "name": "${var.task_name}",
      "image": "${aws_ecr_repository.docker_container_registry.repository_url}:${var.tag}",
...

Notice the CloudControl JSON being injected into Terraform. Now, trying to add a reference to the aws_iam_role here (as is suggested on numerous websites - see a previous post here) is silently ignored. This wouldn't happen using, say, an SDK in a type-safe language, as you can obvioulsy only access the methods it offers you.

Type-safety issue #2

The secrets in GitHub actions can only be uppercase alpha-numeric and underscore, apparently. AWS identifiers can include alpha-numerics and a dash. Mess these up and you spend time fixing tedious bugs. Tiny types would help with this.

Types #3 

Another example: for creating a DB, we used the password M3dit8at!0n$ - seems OK, right? The DB built fine, the GitHub Actions script then also created the schema fine but we could not login to the DB. Cue hours of frantic checking that there were no network or permission issues. The problem? That bloody password includes characters that need to be escaped on the Linux CLI and that's how Terraform and Alembic were invoked! They were at least consistent - that is, the infrastructure was Terraformed and Alembic built the schema, but for the rest of us, the password didn't work.

In Java, a String is just a String and its content isn't going to break the program's flow. Not so in markup land.

Testing times

Which leads to testing. I made my changes in my GitHub Actions file to use the password in the secret ${{ secrets.DB_PASSWORD-staging }} and ran my mock GHS so:

act -j populate -s DB_PASSWORD-staging=...

and the whole thing worked wonderfully. Only when I tried to create the secret in-situ was I told that DB_PASSWORD-staging was an invalid name.

And how do you test this abomination?

Spot the errors
This was the result of some hasty copy-and-paste.

Solutions...?

What you can do in Scala 3 with metaprogramming is truly amazing. For example, proto-quill creates ASTs that are passed around and are checked against the DB at compile time! I think this might be a bit overkill and a more practical approach is IP4S that compile-time checks your URLs, ports etc. I have a couple of much-neglected projects to at least give the gist of a solution (here's one) that I'll expand on soon.

Monday, April 11, 2022

More MLOps: bug hunting

We have several ML models for which we calculate the odds (the coefficients in a linear regression model) and the risks (the difference in probabilities between the factual and the counterfactual). 

Previously, we saw that for one of our six models, the sign of the odds and risks differed when they should always be the same.

Fortunately, we caught this by putting a transform into our pipeline that takes the output of both the linear regression model and factual/counterfactual probabilities and simply checks them.

Part of our ML pipeline. Some transforms create production data, others diagnostic.

In the graph above, anything with _regression_tests appended is a diagnostic transform while everthing else is production data that will ultimately be seen by the customer.

Recently, a diagnostic test was reporting:
difference in risk (0.14285714285714285) does not agree with the sign of the coefficient (-4182.031458784066) for STP QR1 when p-value = 0.988606800326618
Transform containing pipeline diagnostic data.

Aside: there are other violations in this file but they're not unexpected. For instance, there are 42 health regions in England but the data has some random rubbish from the upstream data capture systems. The last row is due to us expecting at least 10 statistically signifcant outputs from our model when we only found 9. That's fine, we'll turn up the squelch on that one. A red flag would be if the number of insights dropped to zero. This happened when we needed the outcome to be binary but accidentally applied a function of type ℕ => {0,1} twice. Oops.

In a previous post, we found the problem was the "risks" and the "odds" transforms, although using the same input file, were consuming that file at significantly different times. The problem manifested itself when most regions for that one model had different signs. But this single diagnostic we see in the violations above is different. Only one region (QR1) differs. All the others in the data set were fine. 

Looking at the offending region, the first things to note was that we had very little data for it. In fact, of the 42 health regions, it had the least by a large margin.
Number of rows of data per health region for a particular model

So, the next step was to build a Python test harness and using these 127 rows and run the data against production code locally. To run this on my laptop is so much easier as I can, say, create a loop and run the model 20 times to see if there was some non-determinsm happening. Apart from tiny changes in numbers due to numerical instability, I could not recreate the bug.

So, like in the previous post, could my assumption that the data is the same for both risk and odds calculations be flawed? The pipeline was saying that the two descendant data sets were built seconds after the parent.

Both risk and odds data descend from the same parent

Could there be some bizarre AWS eventual consistency issue? This seems not the case as S3 read/write consistency now has strong consistency (see this AWS post from December 2020).

Conclusion

Why just one region in one model has a different sign is still a mystery. Multiple attempts to see the same bug have failed. However, since there is so little data for this region, the p-values are very high (c. 0.99) we disregard any insights anyway. We'll just have to keep an eye out for it happening again.

Wednesday, January 22, 2025

Notes on LLMs

I've been learning to build LLMs the hard way. These are some notes I made on my journey.

Intro

I'm not going to train my LLM from scratch. Instead, I'm going to use somebody else's ("Pre Training") and tune it ("Supervised Fine Tuning"). The differences can be seen here [Microsoft] but suffice to say that Pre Training takes huge resources whereas Supervised Fine Tuning can be achieved on modest hardware.

The Environment

I'll be extensively using these Python libraries:
  • TRL "is a library created and maintained by Hugging Face to train LLMs using SFT and preference alignment." [1]
  • Unsloth "uses custom kernels to speed up training (2-5x) and reduce memory use (up to 80% less memory)... It is based on TRL and provides many utilities, such as automatically converting models into the GGUF quantization [see below] format." [1]
  • "Accelerate is a library that enables the same PyTorch code to be run across any distributed configuration by adding just four lines of code!" [HuggingFace] It offloads large matrices to disk or CPU
Preference alignment is an umbrella term. It "addresses the shortcomings of SFT by incorporating direct human or AI feedback into the training process" whereas SFT is just trained on a corpus of text learning its structure.

But on Ubuntu 20, my code gives:

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.

This appears to be referring to the Linux kernel [SO]

OOMEs

Using a newer kernel (6.5.0-1025-oem) gets me further but then barfs with:

  File "/home/henryp/venv/llms/lib/python3.10/site-packages/transformers/activations.py", line 56, in forward
    return 0.5 * input * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (input + 0.044715 * torch.pow(input, 3.0))))
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.17 GiB. GPU 0 has a total capacity of 7.75 GiB of which 529.62 MiB is free. Including non-PyTorch memory, this process has 7.22 GiB memory in use. Of the allocated memory 6.99 GiB is allocated by PyTorch, and 87.96 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Some advice I received was:
You need to reduce the batch size even more.
You can even try using different optimizers as optimizers hold a lot of memory - Use 8bitadam or adafactor.
You can try LoRA - Low rank adaptation.
You can also quantize the model to load in lower bit precisions [Discord]
"LoRA is a parameter-efficient technique for fine-tuning LLMs... The primary purpose of LoRA is to enable the fine-tuning of LLMs with significantly reduced
computational resources."

The idea is that when we fine tune the weights matrix (W), we'd semantically do this:

W' = W + M

But if M is large, that's a lot of memory we're talking about.

The trick with LoRA that instead of M we use two matrices of much smaller rank, A and B and replace M with AB. as long as A has the same number of rows as M and B the same number of columns, they can otherwise be much smaller (referred to as r in the library code). "Larger ranks may capture more diverse tasks but could lead to overfitting" [1, p214].

Sure, we lose some accuracy but when my code now reads:

model, tokenizer = FastLanguageModel.from_pretrained( # Unsloth
...
            load_in_4bit=True, # "You can activate QLoRA by setting load_in_4bit to True"  LLMEngineering, p251
        )

it runs to completion. Note that QLora combines LoRA with quantization. "The core of QLoRA’s approach involves quantizing the base model parameters to a custom 4-bit NormalFloat (NF4) data type, which significantly reduces memory usage." [1]

Note that I also tried to set a quantization_config and although the fine tuning stage worked, when I tried to use the model, but caused the code using the model to complain about non-zero probabilities for reasons I don't yet understand. 

Preference Alignments

Reinforcement Learning from Human Feedback "typically involves training a separate reward model and then using reinforcement learning algorithms like PPO to fine-tune the language model... This is typically done by presenting humans with different answers and asking them to indicate which one they prefer." [1]

Proximal Policy Optimization (PPO) "is one of the most popular RLHF algorithms. Here, the reward model is used to score the text that is generated by the trained model. This reward is regularized by an additional Kullback–Leibler (KL) divergence factor, ensuring that the distribution of tokens stays similar to the model before training (frozen model)." [1]

Direct Preference Optimization uses Kulback-Leibler [previous post] to compare an accepted answer with a rejected one mitigating the need for an RL algorithm.

GRPO is an algorithm where the LLM is given then same prompt multiple times and the answers are ranked for quality. This way, the LLM is rewarded.

Note that PPO and GRPO do not inherently require human feeback but often do.

The idea with preference alignment is that the adjustment to the underlying model is much smaller. Compare my trained model weights (M in the above equation):

$ du -sh saved_model/model_mlabonne
321M    saved_model/model_mlabonne

with the model weight on which it was based (W in the above equation):

$ du -sh ~/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B/
15G     /home/henryp/.cache/huggingface/hub/models--mlabonne--TwinLlama-3.1-8B/

and it's tiny.

[1] The LLM Engineers Handbook