Fine-Tuning Step by Step: How to Customize Your AI Models

Since Large Language Models started being used massively, one common question has been how to customize them to respond to our needs.

Although effective prompting techniques have been devised for certain scenarios, such as one-shot learning or few-shot learning, sometimes we need to take a step further by retraining the model so that the outputs are as we require. This retraining is known as fine-tuning, which we will explore in detail in this article.

What Is Fine-Tuning?

We know that OpenAI has a very wide catalog of LLM models, which have been trained on a broad variety of topics and can solve many tasks. However, there are times when we need to generate outputs that have a certain style or are specialized in a single topic, for which fine-tuning proves to be the best option. Fine-tuning involves taking a pre-trained base model and then providing additional training through a set of examples of how we would like the generated responses to be.

You might ask yourself, why not use prompting techniques, like one-shot learning or few-shot learning? Well, what happens internally through fine-tuning is that the model's internal weights are modified, allowing us to change its tone or output format. With prompting techniques, we regularly have to send samples of how we want the model to respond, which results in increased token consumption and elevated costs for each interaction.

When Is It Advisable to Use Fine-Tuning?

As mentioned earlier, fine-tuning is advisable when we want to customize a model's outputs. For example, when we need an institutional document to adhere to certain pre-established standards or when we want an article to stick to a specific tone. It would also be useful when we need an output with a specific syntax of a language or output format, such as for the creation of svg files or similar.

On the other hand, it is important to mention that fine-tuning is not always the answer to all problems. Some scenarios where we definitely should not use it include when we need to add new information to the model, in which case it is better to use the RAG pattern. Similarly, fine-tuning does not implement advanced reasoning; if we wanted that, it would be better to opt for selecting a model that already does. Finally, in cases where minor style changes are needed, it is better to use some prompting techniques.

Fine-Tuning Methods Available in OpenAI

Although there are various ways to perform fine-tuning through Open Source and private models, we will focus on the three fine-tuning methods available in OpenAI, as they are among the easiest to use and access.

Supervised Fine-Tuning (SFT) Method

This is the easiest fine-tuning method to implement, as it is based on providing examples of input-output pairs through a JSONL file (where each line is a JSON object). In this file, it is recommended to include examples with inputs and outputs that are as realistic as possible, meaning that they should be carefully selected examples taken from some history, examples provided by experts in a topic, logged data, etc., with the purpose of obtaining the best results after training. To use this method, a minimum of 10 examples is recommended, although better results are achieved with 50 to 100 examples.

Some use cases for this type of training are:

Adjustments in the style of responses

Correct classification based on options

Cultural or language adaptation

Extraction of structured information

The format that each line of the training file should follow should look similar to the following example:

{

  "messages": [

    {

      "role": "system",

      "content": "You are a text classifier. Read the user message and answer ONLY with one label: positive, negative, or neutral."

    {

      "role": "user",

      "content": "Text: \"The product arrived broken and support is not responding.\" What is the sentiment?"

    {

      "role": "assistant",

      "content": "negative"

    }

  ]

}

In the previous object, you can see how the roles of system, user and assistant are simulated, providing in each content training text to guide the model on how we would like it to respond based on the user's question.

Direct Preference Optimization (DPO) Method

The second fine-tuning method available in OpenAI is Direct Preference Optimization, in which we can guide the model to learn about human preferences, favoring the responses we like more. This is achieved by providing prompts and a pair of responses, one with the preferred response and another with a non-preferred response.

This method is ideal for the following cases:

Removing responses that seem to be generated by AI models

Reducing hallucinations

Content moderation

Product or content recommendations

In this type of training, you need a series of JSONL lines like the following, containing the prompt, alongside a preferred output and a non-preferred output:

{

  "input": {

    "messages": [

      {

        "role": "user",

        "content": "The product arrived broken and support is not responding."

      }

    "tools": [],

    "parallel_tool_calls": true

  "preferred_output": [

    {

      "role": "assistant",

      "content": "negative"

    }

  "non_preferred_output": [

    {

      "role": "assistant",

      "content": "The sentiment is negative."

    }

  ]

}

Something that is recommended before performing fine-tuning with this technique is to have previously done fine-tuning using SFT, for the purpose of obtaining quality responses. Next, it is recommended to take the fine-tuned model and perform a new fine-tuning on it using the DPO method.

The Reinforcement Fine-Tuning Method

I consider this last type of training to be one of the most powerful, as it allows for expert-level performance within a domain. Unlike previous methods, this type of fine-tuning is only available for reasoning models.

The training is based on generating a response to a prompt, which will be evaluated by one or more graders, assigning it a score. This allows the model weights to be adjusted so that responses with higher scores are preferred over those with lower scores.

In OpenAI's documentation, we can find several graders, which include those that allow for comparisons between strings, text similarity, using another LLM to score generated responses, classifying labels, among others.

Some use case scenarios for this method are:

Solving complex domain tasks requiring advanced reasoning

Autonomous agents that need to make complex decisions

Scenarios where strict rules are defined

Cases where responses may vary according to context

Following the example in OpenAI's documentation, let's assume we have the following question:

Do You Have a Dedicated Security Team?

Furthermore, we want the model to respond with a JSON that has 2 keys:

{

    "compliant": "yes",

    "explanation": "A dedicated security team follows strict protocols for handling incidents."

}

The possible values for compliant can be yes, no or needs review, while explanation is text that, based on a policy document, explains why the question is covered in the policy or not.

To evaluate the model's responses in the example, we use 2 graders: string_check and score_model. The first is used to validate that the compliant property is a valid value according to the options, while the second grader evaluates, using an LLM, the quality of the response:

{

  "type": "multi",

  "graders": {

    "explanation": {

      "name": "Explanation text grader",

      "type": "score_model",

      "input": [

        {

          "role": "user",

          "type": "message",

          "content": "# Overview: Evaluate the accuracy of the model-generated answer..."

        }

      "model": "gpt-4o-2024-08-06"

    "compliant": {

      "name": "compliant",

      "type": "string_check",

      "reference": "{{item.compliant}}",

      "operation": "eq",

      "input": "{{sample.output_json.compliant}}"

    }

  "calculate_output": "0.5 * compliant + 0.5 * explanation"

}

In the previous JSON object, you can see that after defining the graders, a calculate_output is also specified, in which the score obtained from both graders is combined with equal weight for the reward.

The next step is to define a training set, which will be used to perform evaluations of the graders concerning the responses generated during training. Each of these lines has the following format:

{

    "messages": [{

        "role": "user",

        "content": "Do you have a dedicated security team?"

    }],

    "compliant": "yes",

    "explanation": "A dedicated security team follows strict protocols for handling incidents."

}

You can see that the previous object contains, aside from the messages array, the properties compliant and explanation, which will serve as reference values for the fine-tuning. Finally, a test set is also defined that allows evaluations on the outputs once the model has completed its training.

Fine-Tuning Exercise in OpenAI

To ensure that the previous explanations don't remain purely theoretical, let's do a practical exercise. Suppose we are teachers at a school and we need to generate correct and incorrect answers to quickly create exams. To achieve this, we are going to fine-tune a model by showing it how we would like to extract the answers, which in this case will consist of one to two correct answers out of four possible options.

Let's start by defining the test file, which contains lines like the following:

{"messages":[{"role":"system","content":"You are a generator of single-choice or multiple-choice answers. Your task is to produce four possible responses to a question, with no more than two of them being correct. You must include a checkbox before each option indicating whether it is a correct answer."},{"role":"user","content":"Question: Which is the largest planet in the Solar System?"},{"role":"assistant","content":"[X] Jupiter\n- [ ] Mars\n- [ ] Venus\n- [ ] Mercury"}]}

In the previous code, you can see how we included in the system's role instructions about generating the response, as well as a general question and the output format, where we indicate that the correct answer should be marked with an [X], leaving the incorrect ones with an empty box [ ].

Next, you can head over to the OpenAI fine-tuning portal, where you need to click on the + Create button, which will show you a window like the following:

In the previous window, the available options are:

Method: Select among the methods Supervised, Direct Preference Optimization or Reinforcement fine-tuning, which we have already discussed earlier.

Base Model: The base model to be used for fine-tuning.

Suffix: A suffix that will be added to the model's name, which will help you identify the training.

Seed: Allows you to specify the randomness of a training session so that it can be reproducible.

Training data: This is where you select the JSONL file with the training data.

Validation data: Allows you to specify a JSONL file to validate how well the model is learning.

Configure hyper parameters: These are adjustments you can select for the training process, including the number of examples used before updating the weights (Batch size), the speed of weight adjustments (Learning rate multiplier) and the number of times the model goes through the dataset (Number of epochs). These can be left on auto for the platform to decide the best values.

With the previous information ready, just click the Create button to start creating the fine-tuned model. It is important to note that in case there is any issue during the process, we will be notified immediately, showing the cause of the error so that we can fix it, as shown below:

On the other hand, if the training was successfully executed, we will see information about the new model as shown below:

Some other interesting data that we will see in the interface is that we have a set of checkpoints, which are copies created during training at certain moments and are usable for testing, in addition to some graphs that show Loss and Accuracy:

In the previous image, we see how Loss decreases, which means that the model is improving in terms of prediction, enhancing its responses as training progresses. On the other hand, Accuracy is increasing, which translates to the model making fewer mistakes, providing correct answers more frequently as training advances.

Once we have the new model ready, we can test it in the playground:

In the previous image, we see a response generated by a general-use model on the left, while on the right, we see the response generated by the fine-tuned model, showing the correct format according to the training data.

Conclusion

Undoubtedly, fine-tuning is a powerful technique that can help you train AI models to fit the outputs you need. In this article, we explored the methods used in OpenAI, although there are many other methods you can use with different types of models, including those that generate images or other types of content.

As a point to keep in mind, I recommend using fine-tuning only if you are sure you need it and have confirmed that using prompting techniques is not suitable for your scenario. Although it is true that the costs of fine-tuning have decreased over time with the introduction of nano or mini models, you should consider that to have a good training session, you need to spend time gathering, cleaning, formatting and labeling the training dataset, which can be costly in terms of time and effort.

AI AI Perspectives

Héctor Pérez

Héctor Pérez is a Microsoft MVP with more than 10 years of experience in software development. He is an independent consultant, working with business and government clients to achieve their goals. Additionally, he is an author of books and an instructor at El Camino Dev and Devs School.