top of page

How to Create AI Training Data for Fine-Tuning / training LLMs

  • Writer: Rahul Gupta
    Rahul Gupta
  • Feb 21
  • 2 min read
Online AI Dataset Builder, an AI training data generator and JSON or JSONL Alpaca format creator for custom LLM fine-tuning.
AI Dataset Builder and Training Data Generator

What is AI training data?


When you fine-tune an AI model, you are essentially teaching a generalized AI to become a specialist in a specific task, tone, or industry. To do this, you have to feed it "training data"—a collection of examples showing the AI exactly how to respond to certain prompts.

The problem is that AI models can't just read a Word document or a spreadsheet. They require your data to be formatted in highly specific ways, most commonly as JSON or JSONL. Furthermore, many popular models use specific structures, like the "Alpaca format," which requires every single training example to be perfectly broken down into specific categories.


What is the instruction-output format?


To successfully train a model, you need to use the instruction-output format. This structure breaks down every interaction into three components:


1. Instruction: The task or command you are giving the AI.


2. Input (Optional): The context or specific information the AI needs to complete the task.

3. Output: The exact, perfect response you want the AI to generate.



Example 1: The Studio Pipeline (Teaching an AI strict formatting) Fine-tuning is perfect for teaching an AI your exact studio naming conventions or pipeline rules.


• Instruction: Convert the following render details into the studio's standard file naming convention.


• Input: Project: Nike DN8. Shot: 12. Version: 4. Type: FX Pass.


• Output: prj_nikeDN8_sh012_v004_fx.exr1

---------------------------------------------------------------------------------------------------------


Example 2: The Client Feedback Translator (Teaching an AI to structure unstructured data) This shows how an AI can take messy, human text and turn it into organized, actionable data for a project management tool.


• Instruction: Convert the following unstructured client feedback into an actionable task list with specific department tags.


• Input: "Can we make the eternal fire look a bit more orange? Also, the smoke in the stealth shot is too thick, and the cloud giants need to move slower."


• Output:

◦ [Lighting/FX] Eternal Fire: Color correct fire simulation to increase orange hues.

◦ [FX] Stealth: Reduce density of smoke volume.

◦ [Animation] Cloud Giants: Retime animation to slow down movement.


The Formatting Headache: Understanding the JSON Structure


To actually use data to fine-tune an AI model, you have to convert those plain-English examples into code.

Most popular open-source models (like LLaMA) expect your dataset to be formatted as a JSON (JavaScript Object Notation) or JSONL (JSON Lines) file. Specifically, they often use what is known as the "Alpaca format."


If we take one of the examples above, here is exactly how it needs to look in JSON format for the AI to read it:


[

{

"instruction": "Convert the following render details into the studio's standard file naming convention.",

"input": "Project: Nike DN8. Shot: 12. Version: 4. Type: FX Pass.",

"output": "prj_nikeDN8_sh012_v004_fx.exr"

}

]


A No-Code JSON Export Tool


To solve this, I built the Online AI Dataset Builder.


Instead of writing out JSON brackets and checking for missing commas, you simply paste your text into a clean user interface. The Online AI Dataset Builder automatically structures inputs into the perfect, error-free JSON code required for fine-tuning, through terminal or software like LlamaFactory.


Online AI Dataset Builder, an AI training data generator and JSON or JSONL Alpaca format creator for custom LLM fine-tuning.
AI Dataset Builder and Training Data Generator


 
 
 

Recent Posts

See All
Collected Thoughts

A small collection of passages I return to. .. Every man I meet is my superior in some way. In that, I learn of him. - Ralph Waldo Emerson .. Pushing through fear is less frightening than living with

 
 
 

Comments


About 

Based in Miami, Gnomon graduate with over 4 years of experience, recognized with four best-of-term awards and Rookie of the Year Finalist. Proficient in creating photorealistic effects using Houdini & Unreal and implementing procedural pipeline workflows for pre-rendered and real-time FX.

Follow Me

  • LinkedIn
  • Instagram
  • Youtube
  • Artstation

Contact Info

Rahul Gupta

Miami, FL

rahul18gpt@gmail.com

bottom of page