deploying a fine-tuned LLM

How to Host a Fine-Tuned LLM?

Hosting a fine-tuned LLM can be a major challenge because of a range of GPU-infra hosting options and technical problems. In this blog we'll cover how to deploy your fine-tuned LLM with a single click.

Sparsh Bhasin

02 Sep 2024 • 6 min read

Large Language Models (LLMs) have changed the face of AI, enabling applications ranging from chatbots to complex data analysis. Fine-tuning these models enables businesses and developers to tailor them to specific tasks, increasing their efficiency and effectiveness. However, deploying a fine-tuned LLM can be difficult, requiring a variety of GPU infra-hosting options and technical considerations.

This blog will walk you through the entire process, from understanding how to host an LLM inference endpoint to deploying your fine-tuned model with one-click deployment on MonsterAPI.

What is LLM Hosting?

LLM hosting is the process of deploying a trained or fine-tuned Large Language Model to a server or cloud infrastructure so that it can be used for inference tasks. This process entails creating the necessary environment to serve the model, handle requests, and ensure that it scales in accordance with demand.

In traditional setups, hosting an LLM may involve self-managed GPU servers, with developers in charge of the entire infrastructure ops, including hardware maintenance, security, and resource scaling.

Alternatively, cloud-based solutions such as AWS, Google Cloud, and Azure offer managed services that make deployment easier by providing pre-configured environments and auto-scaling.

These platforms make it easier to integrate LLMs into production, but they still require some level of management and configuration. Additionally, cloud platforms get quite costly at scale and require expertise on managing auto-scaling infrastructure.

For developers, this includes establishing the technical infrastructure to handle requests, adding a load balancer, managing and monitoring the hosted API endpoint, and ensuring that the model integrates seamlessly with existing systems.

For businesses, it is important to ensure that the AI tool is dependable, scalable, and accessible without delving into technical complexities. Even non-technical users can understand how LLM hosting transforms a powerful AI model into a practical, usable tool for a variety of applications.

Different Types of Hosting

Private Hosting: The LLM is deployed on a server that is entirely managed by the user. This configuration provides complete control over the environment, making it ideal for sensitive applications in which data privacy is critical. However, managing it requires significant technical expertise and might not be scalable due to geographical limitations and the limited availability of servers in a private facility.
Cloud Hosting: Cloud providers like AWS, Google Cloud, and Azure offer managed scalable services for deploying LLMs. These platforms handle the majority of the heavy lifting, including scaling, load balancing, and security, allowing developers to focus on application logic rather than infrastructure. Cloud hosting is highly flexible, scalable, and often the preferred choice for businesses needing robust and reliable performance.
Hybrid Hosting: Some organizations use a hybrid approach, with sensitive data and models hosted on private servers and less critical components leveraging the scalability of cloud infrastructure. This approach strikes a balance between control and flexibility.

💡

Here’s a quick overview of how developers used to host a fine-tuned model (the traditional way):

Hosting a fine-tuned LLM traditionally can be a long and tedious process, involving several intricate steps:

Model Fine-Tuning: After fine-tuning your model, which can take hours or even days depending on the dataset size and hardware, you’re left with a model that’s ready to deploy.
Environment Setup: Now, you need to set up a server environment. This involves choosing a hosting provider, configuring a virtual machine, installing dependencies, setting up the network, and ensuring security protocols are in place. This can take several hours or even days, depending on the complexity of the setup.
Model Deployment: You’ll need to manually load your model onto the server, write custom scripts to create an API for inference, and configure the server to handle requests. This step often involves debugging issues that arise during deployment, which can be time-consuming.
Scaling and Maintenance: Finally, you must ensure your setup can scale with demand. This might involve setting up load balancers, configuring auto-scaling groups, and monitoring server performance continuously. Maintenance includes regular updates, security patches, and possibly troubleshooting outages.

Imagine a company that needs to deploy a fine-tuned LLM to power a customer support chatbot. Using the traditional method, their development team might spend weeks setting up the environment, writing custom deployment scripts, and ensuring everything runs smoothly. This method, while offering control, demands substantial time, resources, and expertise.

How to Deploy Fine-Tuned LLM in 1 Click on MonsterAPI

MonsterAPI simplifies the hosting process with its 1-click LLM deployment feature, specifically designed for those who want to avoid the complexity of traditional methods. Here's how it works:

Start by fine-tuning your LLM directly on MonsterAPI’s platform or bring your own fine-tuned transformer architecture-based LLM. Once the fine-tuning is complete, you’re just one click away from deployment.

Option 1: Deploying Directly from the Fine-Tuning Page

After your model is fine-tuned, you may simply click the "Deploy" button next to the logs on your fine-tuned LLM card. You’ll be prompted to configure some basic settings, like your model name, number of GPUs and GPU vRAM.

With that done, MonsterAPI will automatically configure the right environment and GPU infrastructure to host your specified LLM, and within a few minutes, you’ll receive an API endpoint accessible only through a secured API key associated with that deployment. This API endpoint will be the inference API for your fine-tuned LLM.

To see how to fine-tune an LLM, see our blog here.

Option 2: Deploying from the Deploy section

Alternatively, if you already have your fine-tuned model ready, you can simply navigate to the Monster Deploy section.

Go to your MonsterAPI dashboard and deploy your fine-tuned LLM with quick configurations.

MonsterAPI takes care of all the backend tasks for you, including GPU orchestration, setting up the environment with in-built auto-scaling, and optimizing the endpoints with vLLM powered LLM inference engine which results in up to 10x faster throughput, thus making the deployment process smooth and hassle-free.

Option 3: Hosting an LLM via our LLM Deployment API

For programmatic deployment of your fine-tuned LLM, use the following code snippet:


url = "https://api.monsterapi.ai/v1/deploy/llm"

payload = {
    "deployment_name": "brave_mendel",
    "prompt_template": "{prompt}{completion}",
    "per_gpu_vram": 8,
    "gpu_count": 1,
    "api_auth_token": "9f0a4e33-f529-4104-b4c4-8e592a46f657",
    "use_nightly": False,
    "multi_lora": False,
    "basemodel_path": "google/gemma-2-2b-it",
    "loramodel_path": "monsterapi/gemma-2-2b-hindi-translator" 
}
headers = {
    "accept": "application/json",
    "content-type": "application/json",
    "authorization": "Bearer YOUR_API_KEY"
}

response = requests.post(url, json=payload, headers=headers)

print(response.text)

Replace "YOUR_API_KEY" in the authorization part with your MonsterAPI account key. This will be required for authenticating your LLM deployment request. You will be able to copy that key in your dashboard.

You may also change the “api_auth_token”. This will be the token used for authentication when you send requests to your deployed API endpoint.

This code handles deployment by sending a POST request with your model configuration. Feel free to customize the basemodel_path, VRAM allocation, and GPU count according to your deployment needs.

For example, if you had fine-tuned Gemma 2 2B-it model and built a fine-tuned LoRA adapter with it then, the basemodel_path will be the huggingface path of that model i.e. “google/gemma-2-2b-it” in this case and the LoRA adapter is the path name of the adapter on Huggingface i.e. “monsterapi/gemma-2-2b-hindi-translator” in this case. It could be a zip file link as well if it contains the adapter files in it.

This flexibility lets you optimize performance based on your specific requirements.

Deployment and processing logs can be tracked using /logs/{deployment_id} API, while status updates can be retrieved via /status/{deployment_id} API. This API approach is ideal for automating deployments and integrating into CI/CD workflows.

This process eliminates the need for deep technical expertise, a significant time investment, and the frustration that can accompany traditional deployment. Using MonsterAPI, anyone can deploy a fine-tuned LLM, regardless of their technical background.

Conclusion

Hosting a fine-tuned LLM can be a challenging process, but it doesn't have to be. With the right tools and platforms, such as MonsterAPI, deploying your model can be as simple as a single click.

Whether you prefer the control of traditional hosting methods or the ease of a managed service, understanding the options available will help you make the best choice for your needs.

By leveraging MonsterAPI’s platform, you can streamline the deployment process, allowing you to focus on what truly matters—building innovative applications that leverage the power of fine-tuned LLMs.