finetuning

How to fine-tune a Large Language Model (LLM) and deploy it on MonsterAPI

Souvik Datta

Jan 10, 2024 • 4 min read

Learn how to deploy an LLM finetuned with MonsterAPI’s no-code LLM finetuning using our new Deploy service with just a few clicks. Using this service anyone can deploy an LLM with or without a LoRA Adapter with just a few clicks or commands. See this blog to learn how we delivered 10M tokens/hr on Zephyr-7b for a cost of $1.25 here.

Large Language Models (LLMs), like ChatGPT, have transformed Natural Language Processing (NLP) with human-like text generation and swift interaction. They learn from vast text data, creating coherent responses and understanding context. They stand out due to their size and learning mechanism, yet for tasks like medical diagnosis, they need fine-tuning.

Fine-tuning comes after general learning, tailoring the LLM to specific datasets such as medical records. It ensures accuracy, context, safety, and reduces biases in critical fields like healthcare. But we're already well aware of the challenges associated with fine-tuning large language models resulting in making the process of finetuning LLMs a lot more complex and thus restricted to developers with ML/Ops skillset.

At Monster API, we designed our no-code LLM fine-tuner that simplifies the process of finetuning by:

Automatically configuring GPU computing environments,
Optimizes memory usage by finding the optimal batch size,
Integrates experiment tracking with WandB, and
Auto configures the pipeline to complete without any errors on our cost-optimised GPU cloud

The above resulted in an affordable and super easy approach to finetuning LLMs without writing a single line of code. In a recent experiment, we finetuned Mistral-7B using a no-robots dataset and the final model outperformed the base model on various benchmarks. The entire model finetuning cost us less than a cup of coffee. To learn how to set up a finetuning job, read this blog on how to launch an LLM finetuning job on MonsterAPI.

In this blog, we'll discuss how we deployed the above finetuned Large Language Model (LLM) on MonsterAPI. Once deployed, the model is hosted as an API endpoint, allowing users to query and fetch results from the finetuned LLM deployment.

📙

Check out this Google Colab Notebook for the complete code demo

Once you have finetuned an LLM on MonsterAPI, you will receive adapter weights as the final output. This adapter contains your fine-tuned model’s weights that we will host as an API endpoint using Monster Deploy.

MonsterDeploy optimizes its backend operations using vLLM framework. vLLM is a rapid and user-friendly library for large language model inference and serving, notable for its state-of-the-art serving throughput. It efficiently manages memory with PagedAttention and enhances performance through continuous batching of requests. The library leverages CUDA/HIP graph technologies for fast model execution and employs advanced quantization techniques such as GPTQ, AWQ, and SqueezeLLM. Additionally, optimized CUDA kernels further augment its speed, making vLLM an effective solution for large language model operations.

Here’s how you can deploy an LLM using Monster Deploy:

Step 1: Initialize MonsterAPI Client

First, import the necessary libraries and initialise MonsterAPI as a client using your MonsterAPI key. If you don’t have a key then sign up on MonsterAPI and copy your key.

!python3 -m pip install monsterapi==1.0.2b3
api_key = "YOUR_MONSTER_API_KEY"
from monsterapi import client as mclient
deploy_client = mclient(api_key = api_key)

Step 2: Launch the Deployment

The prompt_template defines a structured template used for generating responses from the Large Language Model (LLM). It employs placeholders within a text structure to facilitate the generation of specific responses.
"basemodel_path": Specifies the HuggingFace path to the base model ("mistralai/Mistral-7B-v0.1") that serves as the foundational Large Language Model.
"loramodel_path": Indicates the HuggingFace path or URL to adapter weights zip of the fine-tuned adapter of model ("qblocks/mistral_7b_norobots"), which has been finetuned for better instruction following.

prompt_template = """
<|system|>
{system} </s>
<|user|>
{prompt} </s>
<|assistant|>
{response}
"""
launch_payload = {
    "basemodel_path":  "mistralai/Mistral-7B-v0.1",
    "loramodel_path": "qblocks/mistral_7b_norobots" ,
    "prompt_template": prompt_template,
    "per_gpu_vram": 24,
    "gpu_count": 1
}

# Launch a deployment
ret = deploy_client.deploy("llm", launch_payload)
deployment_id = ret.get("deployment_id")
print(deployment_id)

Step 3: Track Deployment Progress

Keep in mind that it takes a few minutes to spin up the instance. The 'status' will transition from 'building' to 'live' as the build progresses. You can access the logs from the 'building' state to track its progress:

status_ret = deploy_client.get_deployment_status(deployment_id)
print(status_ret)
logs_ret = deploy_client.get_deployment_logs(deployment_id, n_lines = 25)
if 'logs' not in logs_ret:
  raise Exception("Please wait until status changes to building!")
for i in logs_ret['logs']:
  print(i)

Step 4: Use the Deployed LLM Endpoint

Once the deployment is ‘live’, you can utilise the Monster Deploy LLM endpoint to generate text based on your model.

import json

assert status_ret.get("status") == "live", "Please wait until status is live!"

service_client  = mclient(api_key = status_ret.get("api_auth_token"), base_url = status_ret.get("URL"))

payload = {
    "input_variables": {"system": "You are a friendly chatbot",
        "prompt": "what is a solar system"},
    "stream": False,
    "temperature": 0.6,
    "max_tokens": 512
}

output = service_client.generate(model = "deploy-llm", data = payload)

if payload.get("stream"):
    for i in output:
        print(i[0])
else:
    print(json.loads(output)['text'][0])

Step 5: Terminate the Deployment

Once you're done using the deployed model, don't forget to terminate your deployment and thus avoid billing charge.

Please note: All Deployments are billed per minute of runtime.

terminate_return = deploy_client.terminate_deployment(deployment_id)
print(terminate_return)

For a detailed code implementation, refer to this Colab Notebook

By following these steps, you can easily fine-tune and deploy your preferred Large Language Model on Monster API. This powerful tool enables you to tune LLMs to specific tasks efficiently, making them ready for deployment in various applications.

Benefits of using Monster Deploy:

Monster Deploy offers seamless and fast deployment of pre-trained and finetuned Large language models, with:

Option to choose between different GPU configurations,
Optimization for higher throughput with auto-batching of requests,
Low-cost deployments on MonsterAPI’s cost-optimized GPU cloud infrastructure.
Simplified launching and management of LLM deployments via a single command.

Additionally, it supports a range of open-source LLMs and LoRa-compatible adapters for LLMs like Zephyr, LLaMA-2, Mistral, and more.

Benefits of using MonsterAPI’s No Code LLM Finetuner:

The no-code fine-tuning approach is a game-changer, simplifying the complex process of fine-tuning language models. It reduces setup complexity, optimizes resource usage, and minimizes costs.

This makes it easier for developers to harness the power of large language models, ultimately driving advancements in natural language understanding and AI applications.

Resource Section:

Check out our documentation on: