LLMs 🤝 Data Science

Fine-tuning LLMs to Write Data Science Code

Fine tuning Gemma Instruct 2b for Data Science tasks

Benedict Neo
bitgrit Data Science Publication
8 min readMay 1, 2024

--

made author

Kaggle released this competition a while back AI Assistants for Data Tasks with Gemma. The goal is to build tools to assist Kaggle developers.

I decided to participate in it by fine tuning Gemma to write data science code.

Let’s get started.

Code is on Deepnote & Kaggle

Google Gemma models

Gemma: Google introduces new state-of-the-art open models

Gemma is a new family of open LLMs by Google built using the same technology that created their Gemini models.

Gemma models are available in two sizes (2B and 7B) so you can use them based on your available computing resources.

They outperform other open source models in most benchmarks as you can see in this table below from their technical report.

gemma-report.pdf

In this tutorial, I’ll be using the 2B parameter size.

And now the question is how we’re going to load them.

Keras & KerasNLP

Keras is a high-level, multi-framework deep learning API designed for simplicity and ease of use. Using Keras 3, you can run workflows on one of three backends: TensorFlow, JAX, or PyTorch.

KerasNLP is a collection of natural language processing (NLP) models implemented in Keras and runnable on JAX, PyTorch, and TensorFlow

We’re going to install them and import them below.

Note the XLA_PYTHON_CLIENT_MEM_FRACTION set to 1.

That enables preallocation, making JAX preallocate 100% of the total GPU memory, instead of the default 75%.

Adjusting these variables are key to avoiding errors like below.

Read more here: GPU memory allocation — JAX documentation

Configuration

We create a configuration class so we can keep tweak the moving parts of our code in one location.

Create a model

Here we’re loading the Gemma 2B instruction-tuned model

Let’s look at our model.

Note there are 2.5 billion trainable parameters.

The embedding layer is not counted against the number of parameters to name it “2B”

Let’s try it out by generating some text.

Generate Text

Using the generate method we can generate text based on a prompt

The max_length argument specifices the maximum length of the generated sequence

batched prompts

We can also provide it multiple prompts at a time.

Now since we want the LLM to write code for us, we need a dataset for it to study from.

Data Science Code Instruct dataset

I’m using the ed001/ds-coder-instruct-v2 dataset on Hugging Face.

It’s essentially a dataset that has instructions and outputs, and when we feed this to the LLM, we’re telling it, “Here’s the input and the output, I expected you to generate outputs like this in the future when you see an input like this”.

Here’s what the dataset looks like.

instruct dataset

Here’s more info about the dataset.

DS Coder instruct dataset contains (input, instruction, output) triplets. Instruction provides a task in the data science domain and output contains the code to solve the task. Where available, it also contains text field holding Alpaca style input. Metadata, such as the programming language (lang) and topics (topics) are provided. topics lists the concepts used in the code (eg. ML, neural networs, plotting, etc.). This is determined based on which kinds of libraries the code uses. This field can be used to obtain subset of data for specific tasks, such as data vizualisation.

I browsed through the comprehensive dataset and it even has instructions for playing a C-major chord.

Let’s load it using load_dataset function from Hugging Face.

Here’s what the dataset looks like. It’s represented as a DatasetDict.

Let’s convert it to pandas.

For this article, I only need the instruction and the output, so I’ll filter for those columns.

I’ll also be taking the last 3 rows of the dataset to test how well the fine-tune vs no fine-tune model performs

Here’s the LLM vs original responses for those 3 test instructions.

Two key observations:

  • model outputs are too verbose, for the last example, it defines the data, the code, the output, an explanation, and even additional notes!
  • In some cases (not shown here) instead of writing code, it will only provide steps instead.

So let’s fine tune gemma to bias towards producing code.

Finetuning with LoRA

What is Low-Rank Adaptation (LoRA)?

LoRA is a tehcnique that accelerates the fine-tuning of LLMs while consuming less memory.

This doesn’t involve finetuning whole of the base model, which can be huge and cost a lot of time and money.

It instead adds a small number of trainable parameters to the model while keeping the original model parameters frozen.

Why LoRA?

source

Even though we’re adding more layers to the model with LoRA, it actually helps save memory.

This is because the smaller layers (A and B) have fewer parameters to learn compared to the big model and fewer trainable parameters mean fewer optimizer variables to store.

So, even though the overall model might seem bigger, it’s actually more efficient in terms of memory usage.

What is Rank?

Rank determines the dimensions of trainable matrices that are added to the original weights of the LLM. It controls the expressiveness and precision of the fine-tuning adjustment.

  • higher rank = more detailed changes possible, more trainable parameters
  • lower rank = less computational overhead, but potentially less precise adaptation

To understand LoRA better, check out these videos

The most important ingredient in any AI problem, the data.

prompt preparation

Here we create a template to format our prompt to fine-tune the model.

Here’s what it looks like.

Now it’s time to set the rank.

We use a LoRA rank of 4.

Recall that ahigher rank means more detailed changes are possible, but also means more trainable parameters. So more compute and time.

Notice how enabling LoRA reduces the number of trainable parameters significantly (from 2.5 billion → 1.3 million).

fine-tuning

Let’s look at this line by line

  • the sequence length controls memory usage by limiting the length of the input sequences the model processes
  • The AdamW optimizer is an extension of the Adam optimizer with decoupled weight decay
  • The learning_rate parameter controls the step size at each iteration, weight_decay adds a penalty to the loss function to encourage smaller weights, and beta_1 and beta_2 are hyperparameters that control the decay rates of the moving averages.
  • exclude_from_weight_decay(var_names=["bias", "scale"]): This line excludes the "bias" and "scale" variables from weight decay. This means that these variables will not be penalized during training, which can be useful for certain layers like LayerNorm.
  • in compile() we define the loss function: Sparse Categorical Crossentropy, an extension of Categorical Cross Entropy, where instead of the target being one-hot encoded [0, 0, 1], it expects it to be an int corresponding to an index, i.e. 2 in (0, 1, 2)
  • The weighted_metrics argument specifies that the model should track the sparse categorical accuracy metric, which measures the accuracy of the model's predictions.

And now we start fitting the model on the data.

Here’s a peak at the 100% GPU P100 utilization on Kaggle

screenshot of my kaggle notebook.

Once the training is done, we save the model.

And now for the fun part, inference.

Inference

I created a utility function to generate an output given an instruction.

seen example

This particular example is really detailed, and showcases the value of LLMs when you give it a big task with multiple steps. As compared to a vague instruction “Write PyTorch code to train a CNN model” which wouldn’t be useful in real-life applications.

Our test sample

Now let’s test it on the 3 instructions we had at the beginning.

Notice how much more concise the model is!

Conclusion

The outputs are more concise, especially compared to the model without fine-tuning.

It’s important to remember that we only fine-tuned this model using 1000 samples for just one epoch and with a low LoRA rank value, so there’s a lot of room for improvements.

Here are some tips:

  1. Increasing the size of the fine-tuning dataset
  2. Training for more steps (epochs)
  3. Setting a higher LoRA rank
  4. Modifying the hyperparameter values such as learning_rate and weight_decay.
  5. Try using the larger version of Gemma (7B).
  6. Increase sequence_length.
  7. Experiment with advanced prompt engineering techniques.
  8. Implement augmentation to increase the number of samples.

More Resources

Google AI for developers

KerasNLP

PyTorch

Finetuning

Thanks for reading!

Be sure to follow the bitgrit Data Science Publication to keep updated!

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit below to stay updated on workshops and upcoming competitions!

Discord | Website | Twitter | LinkedIn | Instagram | Facebook | YouTube

--

--