Data augmentation

How to Build a Dataset for LLM Fine-tuning

Building the right dataset for LLM fine-tuning makes all the performance for LLM fine-tuning. Here's how to build a dataset for LLM fine-tuning.

Sparsh Bhasin

Oct 24, 2024 • 7 min read

Fine-tuning a Large Language Model (LLM) can greatly enhance its performance on specialized tasks. A critical step in this process is crafting a high-quality dataset. At MonsterAPI, we provide a comprehensive set of tools to simplify and optimize dataset creation for fine-tuning.

This blog will guide you through leveraging MonsterAPI to build and refine datasets tailored to your LLM needs efficiently.

What is an LLM Dataset?

An LLM dataset is a curated collection of text used to train and fine-tune large language models. It includes various text samples, such as questions, answers, documents, or conversation snippets, tailored to specific tasks or domains.

The quality and relevance of these datasets are crucial, as they directly impact the performance and accuracy of the fine-tuned model.

Different Kinds of Datasets You Can Use to Fine-Tune LLMs

Text Classification Datasets: These datasets are used to train models to categorize text into predefined categories. Examples include sentiment analysis, topic classification, and spam detection.
Text Generation Datasets: These datasets consist of prompts and corresponding completions, ideal for training models to generate coherent and contextually relevant text.
Summarization Datasets: These include long-form documents paired with concise summaries, useful for training models to generate or refine summaries.
Question-Answering Datasets: These datasets consist of questions paired with their correct answers, often sourced from FAQs, customer support dialogues, or knowledge bases.
Mask Modeling Datasets: These datasets are used for training models with masked language modeling (MLM) objectives. In these datasets, certain parts of the text are masked (hidden), and the model is trained to predict the masked words or tokens. This technique is fundamental in pre-training models like BERT, where the model learns contextual representations by predicting the masked tokens. Mask modeling datasets often consist of large volumes of general or domain-specific text, allowing the model to understand and generate contextually relevant content.
Instruction Fine-Tuning Datasets: Instruction fine-tuning involves training models to follow specific instructions or prompts given by the user. These datasets typically consist of pairs of instructions and corresponding responses, guiding the model to understand and execute the instructions accurately. For example, the dataset might include a prompt like "Translate the following sentence into French" followed by the correct translation. This type of fine-tuning is crucial for enhancing the model's ability to follow user commands and generate more accurate, context-sensitive outputs. MonsterAPI’s tools can help you create and manage these specialized datasets, ensuring your model is fine-tuned for specific instructional tasks.
Conversational Datasets: Used for training dialogue models, these datasets include conversations between users and systems or between multiple users.
Named Entity Recognition (NER) Datasets: These datasets are used to train models to recognize and categorize entities like names, dates, locations, and other specific terms.

Way to Prepare the Dataset for LLM Fine-Tuning

1. Data Augmentation

Data augmentation involves expanding your existing dataset by generating additional data points. In the context of fine-tuning LLMs, this could mean creating more diverse instruction pairs or adding variations to your data that improve model performance. Augmenting data is especially useful when:

Your dataset is small or lacks diversity.
You want to enhance model generalization by introducing different phrasing, examples, or instructions.
You’re looking to improve the model’s ability to handle edge cases by increasing variability in the dataset.

When Should You Use Data Augmentation?

Limited Data Availability: If you have a small dataset and need to boost its size without collecting new data manually.
Performance Improvements: If your model struggles with specific tasks due to repetitive or homogeneous examples, augmenting the data with more diverse instructions can help.
Fine-Tuning for Specific Use Cases: When your goal is to train a model for a niche use case where original data is scarce.

Steps to Augment Data using MonsterAPI

1. Select the Dataset to Augment: First, choose a dataset from a source like Hugging Face. Load and visualize it to understand the prompt structure.

dataset = datasets.load_dataset('<hf_dataset_id>')
df = pd.DataFrame(dataset['test'])
df.head()

2. Configure the API Request: Set up your OpenAI API key and MonsterAPI access token, and then configure the augmentation request. This will specify how the data should be evolved—either by creating new instructions or generating a preference dataset.

body = {
  "data_config": {
    "data_path": "<hf_dataset_id>",
    "data_subset": None,
    "prompt_column_name": "prompt",
    "data_source_type": "hub_link",
    "split": "test"
  },
  "task": "evol_instruct",
  "generate_model1_name": "gpt-3.5-turbo",
  "generate_model2_name": "gpt-3.5-turbo",
  "judge_model_name": "gpt-3.5-turbo",
  "num_evolutions": 4,
  "openai_api_key": "YOUR_API_KEY"
}

headers = {'Authorization': f'Bearer {MONSTERAPI_KEY}'}
response = requests.post('https://api.monsterapi.ai/v1/generate/data-augmentation-service', json=body, headers=headers)

3. Monitor the Process: Use the process ID from the response to track the augmentation status until it’s completed. You can retrieve the augmented dataset once the process is finished.

# Check the status
def check_status(process_id):
    response = requests.get(f"https://api.monsterapi.ai/v1/status/{process_id}", headers=headers)
    return response.json()['status']

status = check_status(process_id)
while status != 'COMPLETED':
    status = check_status(process_id)
    print(f"Process status: {status}")
    time.sleep(10)

# Retrieve the results
response = requests.get(f"https://api.monsterapi.ai/v1/status/{process_id}", headers=headers)
results = response.json()['result']
output_path = results['output'][0]

4. Download and Use the Augmented Dataset: Once completed, you can download the dataset and integrate it into your fine-tuning workflow.

output_ds = datasets.load_dataset('csv', data_files=output_path)
pd.DataFrame(output_ds['train']).head()

Why Data Augmentation is Helpful

Diversity: By generating varied examples, you introduce more diverse data points, which helps in training a robust model.
Efficiency: You save time and effort by automating the generation of new data, rather than manually collecting and labeling it.

Cost-Effective: Compared to gathering real-world data, augmentation through services like MonsterAPI is cheaper and quicker.

2. Synthesize the Instruction Dataset

A synthesized instruction dataset involves generating custom instruction-response pairs. This dataset is useful when you want your model to learn from specific types of instructions, especially if real-world examples are scarce.

When Should You Use It?

Targeted Fine-Tuning: When you want the model to focus on specific instructions relevant to your domain.
Custom Instructions: When existing datasets do not provide the specific scenarios you need, synthesized datasets allow you to generate exactly what you require.
Scenario Training: For training your model on hypothetical scenarios that might not be available in standard datasets.

Steps to Synthesize Instruction Datasets:

1. Define the Instruction Type: Start by deciding what kind of instructions you want to generate (e.g., questions, commands, summaries).

synthesize_config = {
    "instruction_type": "questions",
    "num_samples": 500,
    "language": "en"
}

2. Set Up the API Request: Use MonsterAPI to generate the synthesized instructions.

body = {
    "task": "synthesize_instructions",
    "config": synthesize_config,
    "openai_api_key": "YOUR_API_KEY"
}

response = requests.post('https://api.monsterapi.ai/v1/generate/synthesize-dataset', json=body, headers=headers)

3. Track and Retrieve the Dataset: Use the process ID to monitor and download the synthesized dataset.

4. Incorporate the Dataset: Integrate the synthesized dataset into your fine-tuning pipeline.

How is it Helpful?

Customized Training: It allows you to generate domain-specific instructions tailored to your model.
Enhanced Generalization: By training on more varied and unique scenarios, your model becomes more versatile.

If you don’t have a proper dataset, you can refer to the following colab notebook that explains how you can create a dataset out of a PDF file:

3. Custom Dataset

A custom dataset is one that you create or curate specifically to meet your fine-tuning requirements. It usually includes information relevant to your specific use case, such as domain-specific text or specialized examples.

When Should You Use It?

Domain-Specific Fine-Tuning: When you have unique data that isn’t available in existing datasets.
Proprietary Data: When you need to train the model on confidential or sensitive information that cannot be shared.
Performance Tuning: When existing datasets don’t fully align with your task, and you need precise control over the data.

Steps to Prepare a Custom Dataset:

Upload your dataset in formats like JSON, CSV, or Parquet.
Select the task type (e.g., text classification or summarization).
Configure the dataset columns (e.g., specifying which column represents the prompt).

MonsterAPI’s FineTuner will take care of the rest, preparing the dataset for fine-tuning.

How is it Helpful?

Tailored Solutions: Your custom dataset ensures that the model is specifically trained for your needs.
Flexibility: You have complete control over the content and structure of the dataset, optimizing it for performance.

4. Hugging Face Datasets

Hugging Face hosts a wide range of pre-existing datasets that can be directly used for training or fine-tuning models. These datasets cover various domains like language translation, question answering, summarization, and more.

When Should You Use Them?

Quick Start: If you need a dataset immediately and don’t want to spend time creating one.
Standard Benchmarks: When you need a well-known dataset to benchmark your model’s performance.
Large-Scale Fine-Tuning: Hugging Face offers diverse and extensive datasets, making it easy to find one that suits your needs.

Steps to Use Hugging Face Datasets:

Select the task type and choose “Hugging Face Datasets” as your dataset source.
Provide the dataset’s path or select from the pre-listed options.
Customize the prompt configuration based on your dataset’s columns.

No additional steps are needed if you’re using a pre-curated dataset from Hugging Face.

Note: The Hugging Face dataset path should be formatted like username/dataset-name (e.g., user123/sample-dataset). Do not use the full web URL.

How is it Helpful?

Ready-to-Use: You save time by using pre-built datasets instead of creating your own.
Diverse Options: There’s a wide range of datasets covering different domains and languages, offering flexibility.
Community Support: Many Hugging Face datasets are well-maintained and come with active community support, making it easier to resolve issues.

Conclusion

Building the appropriate dataset for LLM fine-tuning does not have to be difficult. Depending on your requirements, MonsterAPI's various tools and methods can help you prepare, augment, or even create datasets from scratch. It streamlines the process of creating high-quality datasets, whether you are using custom data, synthesized instructions, or existing Hugging Face datasets.

FAQs

Q: Can I use a dataset from another platform?

A: Yes, you can import datasets from Hugging Face or upload your own in common formats like JSON or CSV.

Q: How long does the augmentation process take?

A: The processing time varies based on dataset size and model complexity but is typically completed in 5-10 minutes.

Q: Which dataset format is best for fine-tuning?

A: MonsterAPI supports JSON, JSONL, CSV, and Parquet. Choose based on your data’s structure and complexity.

Q: Can I fine-tune a model without a large dataset?

A: Yes, by using augmentation or instruction synthesis, you can enhance smaller datasets and still achieve effective fine-tuning.

Q: How long does the dataset preparation process take?

A: It depends on the dataset size and the chosen method. However, MonsterAPI’s automation and API services aim to streamline the process.

Q: Can I use a private Hugging Face dataset for fine-tuning?

A: Yes, you can use private datasets from Hugging Face by providing a key with read permissions.

Q: How can I generate a custom dataset if I don’t have one?

A: MonsterAPI’s Instruction Dataset Synthesis API allows you to generate datasets using pre-trained models, simplifying the data preparation process.

For more information and to explore MonsterAPI’s features, visit MonsterAPI or MonsterAPI Developer.