synthetic data

How to Generate Synthetic Data and Fine-Tune a Small Language Model (SLM) On MonsterAPI

Synthetic data allows you to generate task-specific instruction datasets at scale, with complete control over quality, diversity, and formatting. In this blog, we’ll focus on how to generate high-quality synthetic data. And fine-tune an SLM (Phi-4) using the synthetic data on MonsterAPI.

Nilofer

09 Apr 2025 • 13 min read

Introduction

Training language models to perform specific tasks often relies on access to high-quality, task-aligned datasets. However, real-world datasets come with significant limitations—limited availability, licensing restrictions, noise, and misalignment with intended use cases. This is where synthetic data has proven to be a game-changer.

Synthetic data allows you to generate task-specific instruction datasets at scale, with complete control over quality, diversity, and formatting. Whether you're building a chatbot, a coding assistant, or a summarization engine, synthetic data can help you tailor model behaviour precisely to your needs.

In this blog, we’ll focus on how to generate high-quality synthetic instruction data in a scalable way, and how to use that data to fine-tune a Small Language Model (SLM). We’ll then walk through a complete fine-tuning process using Phi-4, a lightweight open-weight SLM, on MonsterAPI.

What Is Synthetic Data?

Synthetic data refers to artificial data generated programmatically rather than being collected from real-world interactions. In the context of language models, it typically takes the form of instruction-style datasets made up of input-output pairs that simulate real user prompts and corresponding responses. This method has become foundational for improving model performance in a structured, scalable, and legally safe way.

The power of synthetic data lies in its flexibility and customizability. Developers can define the types of tasks a model should learn—such as reasoning, summarization, coding, or domain-specific queries—and generate diverse, high-quality examples tailored to those tasks. These examples are not scraped or annotated from external sources but are generated by prompting large instruction-following models or using templates. This enables teams to guide the model’s learning based on their specific use case.

Key advantages of synthetic data:

Task alignment: You can generate data that directly reflects your product's goals.
Scalability: With automation, it's easy to produce thousands of examples quickly.
Custom control: You control the domain, difficulty, length, and structure of the data.
Privacy and licensing freedom: No concerns about sensitive data or copyright.

Synthetic data has proven to be a powerful alternative to manually curated datasets. It enables a feedback loop of continuous improvement—fine-tune a model, observe its weaknesses, generate new synthetic examples targeting those gaps, and fine-tune again. This approach is especially effective for maximizing the performance.

Varieties of Synthetic Data

Synthetic data can take many forms, each tailored to different AI applications. The three most common types are text, tabular, and media data, each serving distinct modelling needs.

Text Data

Synthetic text data is widely used in NLP tasks like summarization, sentiment analysis, question answering, and code generation. It is typically generated using large language models through prompting or instruction tuning. Text data allows teams to simulate real-world user inputs and outputs without needing access to sensitive or proprietary datasets. It’s highly scalable, customizable by domain, and helps in bootstrapping task-specific training sets. However, maintaining natural language fluency and factual correctness remains a challenge, especially in tasks that require complex reasoning or specialized knowledge.

Tabular Data

Tabular synthetic data mimics structured datasets like spreadsheets or database tables and is used for training models in tasks such as classification, regression, and anomaly detection. It is commonly applied in domains like finance, healthcare, and retail—where privacy concerns restrict access to real data. Methods like GANs, probabilistic models, or diffusion-based approaches are used to generate synthetic rows that preserve the statistical properties and correlations of real data. While it's useful for privacy-safe data augmentation and model testing, care must be taken to maintain realistic patterns and avoid introducing bias.

Media Data (Image, Video, Audio)

Synthetic media includes artificially generated images, videos, and sounds used in computer vision, speech, and multimodal applications. In vision tasks, data is often generated using 3D simulation engines or generative models to simulate conditions like lighting, weather, or rare scenarios (e.g., for autonomous vehicles). Synthetic audio is used in speech recognition and voice generation, offering diverse accents, tones, or emotional variations. The main benefit of media data is its ability to generate well-annotated, diverse examples at scale. However, domain mismatch between synthetic and real-world data can impact model performance if not addressed through proper validation or fine-tuning.

Techniques Used in Synthetic Data Generation

Synthetic data generation is not a one-size-fits-all process. Depending on your model size, use case, domain, and available resources, you may need to use different techniques—or combine several.

Below are the most widely adopted and effective techniques used in the field, along with practical insights on when and how to use them.

Meta-Prompting

Meta-prompting is one of the foundational techniques in synthetic data generation. It involves using a high-performing instruction-following model, such as GPT-4, Claude, or a strong open-weight alternative, to generate instructions themselves—rather than input-output examples. The idea is to "prompt the model to write prompts." A carefully constructed meta-prompt is issued to the model asking it to generate a list of diverse, domain-specific instructions for a given task category. For instance, one might ask, “Generate 10 instructions for summarization tasks related to medical records.” The output will be a list of instructions like “Summarize the patient's discharge summary” or “Generate a brief overview of the following clinical note.” This method allows the creation of hundreds or even thousands of unique instructions across categories such as question answering, classification, code generation, and more. The strength of meta-prompting lies in its scalability and flexibility—it enables quick expansion of task diversity without requiring manual authoring. However, the quality of generated instructions depends heavily on the base model’s instruction-following ability and may require filtering to eliminate vague or redundant entries.

Two-Pass Prompting

Once a set of instructions has been generated, the next logical step is to complete these into full training triplets: instruction, input, and output. Two-pass prompting is a structured method where the instruction is first created (often via meta-prompting), and then the model is prompted again—with the instruction as context—to generate the input and output pair. For example, if the instruction is “Classify the sentiment of the following review,” the second prompt would ask the model to provide a suitable review text and the corresponding sentiment label. This method ensures better alignment between the instruction and its associated input and output, resulting in coherent and realistic examples that more closely reflect user queries in production environments. Compared to one-shot generation, where instruction and I/O are produced in a single pass, the two-pass approach allows greater control and often leads to higher-quality datasets. While it is slower and computationally more intensive, the trade-off is justified when coherence and task fidelity are critical.

Template-Based Generation

For domains that are highly structured or rule-based—such as translation, unit conversion, simple math, or factual lookups—template-based generation is a practical alternative to prompt-based methods. This technique involves creating fixed templates with placeholders that can be programmatically filled using predefined lists or knowledge bases. A simple example might be the template: “Convert [value] [unit1] to [unit2],” which could yield prompts like “Convert 100 Celsius to Fahrenheit.” Since the format is controlled and deterministic, this method is extremely efficient for generating large volumes of consistent data with minimal human involvement. Template-based generation is ideal for tasks that do not require creative variation or complex reasoning, and it significantly reduces the risk of hallucination, as the data is generated based on controlled rules. However, its limitations become apparent in open-ended tasks where linguistic variation and contextual understanding are necessary, as templates can lead to repetitive and unnatural examples if overused.

Mixture of Agents Pipeline

In more advanced setups, especially where output quality is critical (such as legal reasoning, medical diagnostics, or enterprise applications), a mixture of agents architecture can be employed. This pipeline simulates a collaborative environment by assigning different roles to different language model agents. Typically, one agent—called the proposer—is responsible for generating multiple candidate outputs for a given instruction-input pair. Another agent—the critiquer—evaluates each candidate based on predefined criteria like factual accuracy, helpfulness, clarity, or tone. Finally, an aggregator agent selects the best-performing output or combines elements from several candidates into a final version. This multi-agent approach introduces a layer of automated review and refinement that improves robustness and reduces noise. It mirrors a human feedback loop and is particularly useful when training data must be free from hallucinations, misalignment, or ambiguity. The trade-off, however, is that such a pipeline requires access to multiple capable models and can be computationally expensive, making it less suitable for low-resource environments or early-stage prototyping.

Programmatic Filtering

Even with well-designed prompting strategies, synthetic datasets often contain noisy or low-quality entries. Programmatic filtering is used to automatically identify and remove such samples before they are fed into the fine-tuning pipeline. This method can be as simple as applying structural heuristics (e.g., checking if all fields are present) or as sophisticated as running a small classification model to flag incoherent or off-topic outputs. For example, generated samples that have very short outputs, irrelevant completions, or missing instructions can be filtered out using Python scripts or rule-based validation functions. More advanced setups might include embedding-based similarity checks to detect and remove near-duplicate entries. Programmatic filtering dramatically improves dataset quality and consistency, especially when dealing with large-scale generation (e.g., 100k+ samples), and it saves significant human effort. However, it’s important to design filtering criteria carefully to avoid removing useful edge cases or introducing bias by over-penalizing certain types of variation.

Human-in-the-Loop Review

Although synthetic data generation is primarily automated, there are cases where incorporating human review becomes valuable—especially for high-importance tasks or benchmark-quality evaluation sets. Human-in-the-loop review involves manually inspecting, editing, or validating synthetic examples to ensure they meet the required standards of quality, clarity, and domain specificity. This can include validating labels in classification tasks, correcting language in outputs, or rejecting samples that don't meet ethical or stylistic guidelines. Human review is resource-intensive and doesn’t scale as well as automated techniques, but it can dramatically improve the trustworthiness and accuracy of critical training datasets. It's often used selectively—either on a small representative sample or on the final dataset after filtering—particularly in domains like healthcare, education, or finance, where precision and nuance matter.

End-to-End Process of Synthetic Data Generation

The synthetic data generation process can be viewed as a structured pipeline that mirrors the logic of real-world model development. It consists of the following stages:

Step 1: Task Definition and Categorization

Before generating any data, define the types of tasks the model needs to perform. This provides structure and relevance to the dataset and ensures coverage across various skill areas.

Common task categories:

Summarization – “Summarize the following customer complaint.”
Sentiment classification – “Label the sentiment of this product review.”
Question answering – “Answer the following based on the paragraph.”
Reasoning/Inference – “Why does the moon affect tides?”
Code generation – “Write a Python function to sort a list.”
Table-based tasks – “Extract the total revenue from this table.”

Task categorization not only promotes diversity but also allows focused data generation tailored to specific domains such as finance, medicine, education, or customer service.

Step 2: Instruction Generation via Meta-Prompting

Once task categories are defined, the next step is generating natural language instructions for each task. This is achieved using a technique called meta-prompting, where a strong instruction-following model is asked to generate other task instructions.

Example:

“Generate 10 diverse instructions for multi-hop reasoning tasks in the educational domain.”

This approach ensures:

Structural consistency
Creativity and variation
Domain relevance

You can generate hundreds (or thousands) of instructions across categories by looping through different task types, domains, or difficulty levels.

Step 3: Input-Output Completion (Two-Pass Prompting)

After instructions are generated, we need to add realistic input and output examples to complete the data triplets. This is often done using a two-pass prompting strategy:

First pass – Generate the instruction.
Second pass – Feed the instruction back into the model and ask it to generate an input and a matching output.

Example:

Instruction: “Classify the sentiment of the following tweet.”
Input: “Just got promoted today! Feeling awesome.”
Output: “Positive”

This ensures:

Coherence between instruction and response
Natural input formats
Realistic outputs based on the task

This two-stage design mimics how humans understand and solve tasks, and it prevents mismatched or poorly aligned examples.

Step 4: Filtering and Quality Control

Synthetic data, while scalable, can be noisy or inconsistent. So a filtering step is critical to ensure quality and reliability.

Filtering methods:

Heuristic-based filters: Remove samples with missing fields, invalid structures, or poor language.
Classifier-based filters: Use another model to rate outputs based on fluency, helpfulness, or correctness.
Manual review (for small datasets): Human reviewers inspect and remove low-quality samples.

What to filter:

Redundant examples
Unaligned input-output pairs
Irrelevant or hallucinated responses
Overly simplistic or overly complex examples
Grammar or syntax issues

The aim is to maximize data quality per token, ensuring that the fine-tuned model doesn't learn from faulty signals.

Step 5: Formatting for Fine-Tuning

After the data is filtered and validated, it must be formatted to suit your fine-tuning framework. The most popular formats include:

1. Alpaca JSON format

Simple JSON structure used in many open-source LoRA/QLoRA projects.

json

CopyEdit

{

"instruction": "Explain the concept of overfitting.",

"input": "",

"output": "Overfitting occurs when a model memorizes training data rather than generalizing to unseen examples."

}

2. ChatML format

Role-based formatting used for chat-style models like ChatGPT or Claude.

json

CopyEdit

[

{"role": "system", "content": "You are a helpful assistant."},

{"role": "user", "content": "What is reinforcement learning?"},

{"role": "assistant", "content": "Reinforcement learning is..."}

]

3. ShareGPT-style multi-turn dialogues

Used for chat models that support conversational context over multiple exchanges.

Challenges and Limitations of Synthetic Data

While synthetic data brings scalability, flexibility, and privacy benefits to AI development, it comes with several limitations that must be addressed for reliable usage.

Data Reliability and Bias: Synthetic data quality is only as good as the source it’s derived from. If the original data or generation model contains bias or inconsistencies, these flaws will likely be reflected in the synthetic output. Without careful validation, synthetic datasets can misrepresent the target domain, leading to misleading model behaviour.

Missing Outliers and Rare Cases: Synthetic data often models the average patterns in real data but may fail to replicate outliers or rare edge cases accurately. In applications where anomalies are critical—like fraud detection or medical diagnostics—this can limit the model’s effectiveness.

Technical Effort and Expertise: Generating high-quality synthetic data isn’t plug-and-play. It requires expertise in prompt design, filtering, data validation, and alignment with downstream tasks. For complex datasets or workflows, the time and effort involved can be substantial.

Trust and Adoption Barriers: As a relatively new approach, synthetic data may face skepticism, especially in regulated or risk-sensitive industries. Without strong validation and transparency, users may hesitate to rely on models trained with synthetic datasets.

Quality Control and Verification: Automated generation pipelines can introduce errors, inconsistencies, or unrealistic examples. Quality assurance—whether through manual checks, rule-based filters, or sample audits—is essential to ensure that synthetic data is usable and trustworthy.

How To Generate Synthetic Data on MonsterAPI

Launching Data augmentation job

We are sending a request to MonsterAPI to generate synthetic instruction data. We specify the dataset we want to augment, which is hosted on Hugging Face. MonsterAPI then uses GPT-4o models to evolve and improve this data over several iterations. One model generates new instructions, another model reviews them, and the best outputs are selected. Once the job runs, MonsterAPI returns a higher-quality dataset that can be used for fine-tuning.

With curl command:

curl --request POST \
     --url https://api.monsterapi.ai/v1/generate/data-augmentation-service \
     --header 'accept: application/json' \
     --header 'content-type: application/json' \
     --data '
{
  "data_config": {
    "data_source_type": "hub_link",
    "data_subset": null,
    "prompt_column_name": "prompt",
    "split": "train",
    "data_path": "distilabel-internal-testing/instruction-dataset-mini"
  },
  "generate_model1_name": "gpt-4o",
  "generate_model2_name": "gpt-4o",
  "judge_model_name": "gpt-4o",
  "num_evolutions": 5,
  "openai_api_key": "YOUR_OPENAPI_KEY",
  "task": "evol_instruct"
}
'

Python Code:

import requests
import json
url = "https://api.monsterapi.ai/v1/generate/data-augmentation-service"
headers = {
    "accept": "application/json",
    "content-type": "application/json"
}
payload = {
    "data_config": {
        "data_source_type": "hub_link",
        "data_subset": None,
        "prompt_column_name": "prompt",
        "split": "train",
        "data_path": "distilabel-internal-testing/instruction-dataset-mini"
    },
    "generate_model1_name": "gpt-4o",
    "generate_model2_name": "gpt-4o",
    "judge_model_name": "gpt-4o",
    "num_evolutions": 5,
    "openai_api_key": "YOUR_OPENAPI_KEY",
    "task": "evol_instruct"
}
response = requests.post(url, headers=headers, json=payload)
print(response.json())

Now you can use the dataset to finetune your model

Step-by-step Guide To Fine-tune Phi-4 on MonsterAPI

To fine-tune the Phi-4 model using the synthetic dataset, we send a request to MonsterAPI's fine-tuning service. In the request, we specify the model to fine-tune, the location of the dataset on Hugging Face Hub, and the training settings like learning rate, number of epochs, and LoRA configuration. Once submitted, MonsterAPI handles the entire training process on its servers

import requests
url = "https://api.monsterapi.ai/v1/finetune/llm"
payload = {
    "deployment_name": "phi4-finetuned",
    "pretrainedmodel_config": {
        "model_path": "microsoft/phi-4",  
        "use_lora": True,
        "lora_r": 8,
        "lora_alpha": 16,
        "lora_dropout": 0.1,
        "lora_bias": "none",
        "use_quantization": False,
        "use_unsloth": False,
        "use_gradient_checkpointing": False,
        "parallelization": "nmp"
    },
    "data_config": {
        "data_path": "your_username/your_dataset_name",
        "data_subset": "train",
        "data_source_type": "hub_link",
        "prompt_template": "{text}",
        "cutoff_len": 1024,
        "prevalidated": False
    },
    "training_config": {
        "early_stopping_patience": 3,
        "num_train_epochs": 3,
        "gradient_accumulation_steps": 4,
        "warmup_steps": 100,
        "learning_rate": 3e-4,
        "lr_scheduler_type": "linear",
        "group_by_length": False,
        "preference_optimization": "DONT",
        "optimizer": "adamw_hf"
    },
    "logging_config": {
        "use_wandb": False
    }
}
headers = {
    "accept": "application/json",
    "content-type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
print(response.json())

Feel free to customize the settings as you like and start fine-tuning. If you feel like making a request is difficult just log into your account and trigger the job using our simplistic UI.

Conclusion:

Synthetic data offers a flexible, scalable way to train models when real-world datasets are limited or unavailable. With MonsterAPI, you can generate high-quality instruction data and fine-tune language models with ease. By customizing data generation to match your target tasks, you gain greater control over model behavior, improve alignment, and accelerate development cycles. As AI continues to evolve, synthetic data is proving to be a practical, reliable foundation for building smarter, domain-aware systems—without the overhead of manual data collection.