How to Create AI Training Data for Fine-Tuning / training LLMs

Rahul Gupta
Feb 21
2 min read

Online AI Dataset Builder, an AI training data generator and JSON or JSONL Alpaca format creator for custom LLM fine-tuning. — AI Dataset Builder and Training Data Generator

What is AI training data?

When you fine-tune an AI model, you are essentially teaching a generalized AI to become a specialist in a specific task, tone, or industry. To do this, you have to feed it "training data"—a collection of examples showing the AI exactly how to respond to certain prompts.

The problem is that AI models can't just read a Word document or a spreadsheet. They require your data to be formatted in highly specific ways, most commonly as JSON or JSONL. Furthermore, many popular models use specific structures, like the "Alpaca format," which requires every single training example to be perfectly broken down into specific categories.

What is the instruction-output format?

To successfully train a model, you need to use the instruction-output format. This structure breaks down every interaction into three components:

1. Instruction: The task or command you are giving the AI.

2. Input (Optional): The context or specific information the AI needs to complete the task.

3. Output: The exact, perfect response you want the AI to generate.

Example 1: The Studio Pipeline (Teaching an AI strict formatting) Fine-tuning is perfect for teaching an AI your exact studio naming conventions or pipeline rules.

• Instruction: Convert the following render details into the studio's standard file naming convention.

• Input: Project: Nike DN8. Shot: 12. Version: 4. Type: FX Pass.

• Output: prj_nikeDN8_sh012_v004_fx.exr1

---------------------------------------------------------------------------------------------------------

Example 2: The Client Feedback Translator (Teaching an AI to structure unstructured data) This shows how an AI can take messy, human text and turn it into organized, actionable data for a project management tool.

• Instruction: Convert the following unstructured client feedback into an actionable task list with specific department tags.

• Input: "Can we make the eternal fire look a bit more orange? Also, the smoke in the stealth shot is too thick, and the cloud giants need to move slower."

• Output:

◦ [Lighting/FX] Eternal Fire: Color correct fire simulation to increase orange hues.

◦ [FX] Stealth: Reduce density of smoke volume.

◦ [Animation] Cloud Giants: Retime animation to slow down movement.

The Formatting Headache: Understanding the JSON Structure

To actually use data to fine-tune an AI model, you have to convert those plain-English examples into code.

Most popular open-source models (like LLaMA) expect your dataset to be formatted as a JSON (JavaScript Object Notation) or JSONL (JSON Lines) file. Specifically, they often use what is known as the "Alpaca format."

If we take one of the examples above, here is exactly how it needs to look in JSON format for the AI to read it:

[

{

"instruction": "Convert the following render details into the studio's standard file naming convention.",

"input": "Project: Nike DN8. Shot: 12. Version: 4. Type: FX Pass.",

"output": "prj_nikeDN8_sh012_v004_fx.exr"

}

]

A No-Code JSON Export Tool

To solve this, I built the Online AI Dataset Builder.

Instead of writing out JSON brackets and checking for missing commas, you simply paste your text into a clean user interface. The Online AI Dataset Builder automatically structures inputs into the perfect, error-free JSON code required for fine-tuning, through terminal or software like LlamaFactory.

Rahul Gupta)
FX ARTIST

rahul18gpt@gmail.com

How to Create AI Training Data for Fine-Tuning / training LLMs

What is AI training data?

What is the instruction-output format?

The Formatting Headache: Understanding the JSON Structure

A No-Code JSON Export Tool

Recent Posts

Comments