2026-02-21

Engineering a Personal Prompt System: Library, Versioning, and Automated Tests

7 min readAutomationTutorialsEngineeringPythonPrompt EngineeringAI AutomationLLM OpsWorkflow Optimization

A technical guide to treating prompts like code. Build a repository, standardized frameworks, and an automated testing harness.

If you are pasting raw, unstructured text into ChatGPT and hoping for the best, you aren't doing prompt engineering. You’re gambling.

As developers, we don't write code without version control. We don't deploy without testing. Yet, most people treat AI interaction as a stream-of-consciousness chat log. This approach is unscalable and produces inconsistent results.

To 2× your output quality, you need to shift your mental model: Prompts are code. They require structure, storage, and evaluation.

In this build log, I’m going to show you how to set up a personal "Prompt Ops" system. We will build a structured library, apply a strict framework, and write a Python harness to test prompt performance.

1. The Theory: Structure Beats Creativity

Before we write the harness, we need to standardize the input. The biggest drop in LLM performance comes from context leakage and vague constraints.

I use a modified version of the R-C-C-O Framework for every system prompt I build. If a prompt doesn't have these four components, it doesn't enter my library.

Role (R): Who is the model? (e.g., "Senior Python Backend Engineer").
Context (C): What is the background? (e.g., "Refactoring a legacy Flask app to FastAPI").
Constraints (C): What is forbidden? (e.g., "No external libraries other than Pydantic. Max 50 lines of code.").
Output (O): What is the exact format? (e.g., "Return only JSON. No markdown blocks. No conversational filler.").

The Baseline vs. The Engineered

Baseline (Weak):
"Write me a python script to scrape a website."

Engineered (R-C-C-O):

ROLE: Expert Web Scraper & Data Engineer.
CONTEXT: Building a robust scraper for a news site that changes DOM structure frequently.
CONSTRAINTS: Use Selenium with headless Chrome. Must handle pagination (Next button). Add 2-second random sleep delays. Error handling for timeouts is mandatory.
OUTPUT: Provide only the Python class structure. No usage examples.

The second prompt yields production-ready code. The first yields a tutorial snippet.

2. The Build: Your Prompt Library

Don’t store prompts in Apple Notes or a text file. We need structured data that we can pull programmatically.

For this build, we will use a JSON-based repository. This allows us to inject prompts into our applications or test scripts easily.

Step 1: The `prompts.json` Structure

Create a file named prompts.json. This serves as your single source of truth.

[
  {
    "id": "code-refactor-v1",
    "description": "Strict refactoring for clean code compliance",
    "template": "You are a Clean Code expert. Refactor the following function to adhere to PEP8 and DRY principles.\n\nCODE:\n{{user_code}}\n\nCONSTRAINTS:\n1. Add docstrings.\n2. Type hint inputs/outputs.\n3. Return ONLY code."
  },
  {
    "id": "email-summarizer-v2",
    "description": "Summarize tech newsletters",
    "template": "You are an executive assistant. Summarize this text into 3 bullet points. Focus on actionable insights only.\n\nTEXT:\n{{newsletter_text}}"
  }
]

Note the use of {{variable}} placeholders. This separates the system instruction from the dynamic data.

3. The Test Harness

Now, let's automate the testing. We want to see how the model handles specific inputs using our prompt library.

I’ll use Python and the OpenAI SDK (or Anthropic/local LLM via LangChain) to iterate through our library.

Prerequisites

pip install openai python-dotenv rich

The Script: `test_harness.py`

This script loads your library, injects test data, runs the prompt, and saves the output for review.

import json
import os
from openai import OpenAI
from dotenv import load_dotenv
from rich.console import Console
from rich.markdown import Markdown

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
console = Console()

# 1. Load the Library
def load_prompts(filepath="prompts.json"):
    with open(filepath, "r") as f:
        return json.load(f)

# 2. The Generator
def run_prompt(template, variable_data):
    # Inject data into the {{placeholder}}
    # In a real app, use Jinja2. Here, simple replace works.
    prompt_content = template.replace("{{user_code}}", variable_data)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful engine."}, 
            {"role": "user", "content": prompt_content}
        ],
        temperature=0.2 # Low temp for deterministic coding tasks
    )
    return response.choices[0].message.content

# 3. The Test Case
test_code_snippet = """
def calc(a,b):
  print(a+b)
  return a+b
"""

def main():
    library = load_prompts()
    
    # filter for specific prompt ID
    target_prompt = next(p for p in library if p["id"] == "code-refactor-v1")
    
    console.print(f"[bold green]Testing Prompt:[/bold green] {target_prompt['id']}")
    
    result = run_prompt(target_prompt['template'], test_code_snippet)
    
    # 4. Log Results
    console.print("[bold blue]Output:[/bold blue]")
    console.print(Markdown(result))
    
    # Save for diffing later
    with open(f"logs/{target_prompt['id']}_log.md", "w") as f:
        f.write(result)

if __name__ == "__main__":
    main()

4. Evaluating the Output (The Loop)

Running the script is easy. Knowing if the output is good is the hard part. In AI engineering, we call this "Evals."

For a personal system, you don't need a complex vector database evaluation immediately. You need a Diff View.

The "Before/After" Workflow

Run V1: Execute the script with your current prompt. Save output to log_v1.md.
Iterate: Modify the prompts.json. Add a constraint (e.g., "Use Google style docstrings"). Change ID to v2.
Run V2: Execute against the exact same test_code_snippet.
Compare: Open both files in VS Code and use the "Compare Selected" feature.

If V2 hallucinations are lower or the formatting is tighter, commit the change to your JSON file.

Advanced: LLM-as-a-Judge

If you want to automate the grading, you can write a second prompt that acts as a judge.

def grade_output(original_prompt, generated_output):
    grading_prompt = f"""
    You are a QA Engineer. Grade the following code snippet on a scale of 1-10 based on PEP8 compliance.
    
    CODE:
    {generated_output}
    
    Return JSON: {{ "score": int, "reason": string }}
    """
    # ... call LLM with this grading_prompt ...

This allows you to run a suite of 50 test inputs and get a spreadsheet of scores back, telling you objectively if your prompt changes improved the system or broke it.

5. Summary

Building a personal prompt system separates the casual user from the AI Engineer.

Centralize: Move prompts out of chat history and into a json or code file.
Structure: Use Role-Context-Constraint-Output frameworks.
Automate: Use a script to run prompts against fixed test cases.
Evaluate: Compare versions to ensure quality is actually increasing.

Start small. Create your prompts.json today and add your three most used workflows. Treat it like a software project, and the quality of your output will compound.

Comments

Loading comments...

2026-02-21

Engineering a Personal Prompt System: Library, Versioning, and Automated Tests

7 min readAutomationTutorialsEngineeringPythonPrompt EngineeringAI AutomationLLM OpsWorkflow Optimization

A technical guide to treating prompts like code. Build a repository, standardized frameworks, and an automated testing harness.

If you are pasting raw, unstructured text into ChatGPT and hoping for the best, you aren't doing prompt engineering. You’re gambling.

To 2× your output quality, you need to shift your mental model: Prompts are code. They require structure, storage, and evaluation.

1. The Theory: Structure Beats Creativity

Before we write the harness, we need to standardize the input. The biggest drop in LLM performance comes from context leakage and vague constraints.

I use a modified version of the R-C-C-O Framework for every system prompt I build. If a prompt doesn't have these four components, it doesn't enter my library.

Role (R): Who is the model? (e.g., "Senior Python Backend Engineer").
Context (C): What is the background? (e.g., "Refactoring a legacy Flask app to FastAPI").
Constraints (C): What is forbidden? (e.g., "No external libraries other than Pydantic. Max 50 lines of code.").
Output (O): What is the exact format? (e.g., "Return only JSON. No markdown blocks. No conversational filler.").

The Baseline vs. The Engineered

Baseline (Weak):
"Write me a python script to scrape a website."

Engineered (R-C-C-O):

ROLE: Expert Web Scraper & Data Engineer.
CONTEXT: Building a robust scraper for a news site that changes DOM structure frequently.
CONSTRAINTS: Use Selenium with headless Chrome. Must handle pagination (Next button). Add 2-second random sleep delays. Error handling for timeouts is mandatory.
OUTPUT: Provide only the Python class structure. No usage examples.

The second prompt yields production-ready code. The first yields a tutorial snippet.

2. The Build: Your Prompt Library

Don’t store prompts in Apple Notes or a text file. We need structured data that we can pull programmatically.

For this build, we will use a JSON-based repository. This allows us to inject prompts into our applications or test scripts easily.

Step 1: The `prompts.json` Structure

Create a file named prompts.json. This serves as your single source of truth.

[
  {
    "id": "code-refactor-v1",
    "description": "Strict refactoring for clean code compliance",
    "template": "You are a Clean Code expert. Refactor the following function to adhere to PEP8 and DRY principles.\n\nCODE:\n{{user_code}}\n\nCONSTRAINTS:\n1. Add docstrings.\n2. Type hint inputs/outputs.\n3. Return ONLY code."
  },
  {
    "id": "email-summarizer-v2",
    "description": "Summarize tech newsletters",
    "template": "You are an executive assistant. Summarize this text into 3 bullet points. Focus on actionable insights only.\n\nTEXT:\n{{newsletter_text}}"
  }
]

Note the use of {{variable}} placeholders. This separates the system instruction from the dynamic data.

3. The Test Harness

Now, let's automate the testing. We want to see how the model handles specific inputs using our prompt library.

I’ll use Python and the OpenAI SDK (or Anthropic/local LLM via LangChain) to iterate through our library.

Prerequisites

pip install openai python-dotenv rich

The Script: `test_harness.py`

This script loads your library, injects test data, runs the prompt, and saves the output for review.

import json
import os
from openai import OpenAI
from dotenv import load_dotenv
from rich.console import Console
from rich.markdown import Markdown

load_dotenv()
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
console = Console()

# 1. Load the Library
def load_prompts(filepath="prompts.json"):
    with open(filepath, "r") as f:
        return json.load(f)

# 2. The Generator
def run_prompt(template, variable_data):
    # Inject data into the {{placeholder}}
    # In a real app, use Jinja2. Here, simple replace works.
    prompt_content = template.replace("{{user_code}}", variable_data)
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a helpful engine."}, 
            {"role": "user", "content": prompt_content}
        ],
        temperature=0.2 # Low temp for deterministic coding tasks
    )
    return response.choices[0].message.content

# 3. The Test Case
test_code_snippet = """
def calc(a,b):
  print(a+b)
  return a+b
"""

def main():
    library = load_prompts()
    
    # filter for specific prompt ID
    target_prompt = next(p for p in library if p["id"] == "code-refactor-v1")
    
    console.print(f"[bold green]Testing Prompt:[/bold green] {target_prompt['id']}")
    
    result = run_prompt(target_prompt['template'], test_code_snippet)
    
    # 4. Log Results
    console.print("[bold blue]Output:[/bold blue]")
    console.print(Markdown(result))
    
    # Save for diffing later
    with open(f"logs/{target_prompt['id']}_log.md", "w") as f:
        f.write(result)

if __name__ == "__main__":
    main()

4. Evaluating the Output (The Loop)

Running the script is easy. Knowing if the output is good is the hard part. In AI engineering, we call this "Evals."

For a personal system, you don't need a complex vector database evaluation immediately. You need a Diff View.

The "Before/After" Workflow

Run V1: Execute the script with your current prompt. Save output to log_v1.md.
Iterate: Modify the prompts.json. Add a constraint (e.g., "Use Google style docstrings"). Change ID to v2.
Run V2: Execute against the exact same test_code_snippet.
Compare: Open both files in VS Code and use the "Compare Selected" feature.

If V2 hallucinations are lower or the formatting is tighter, commit the change to your JSON file.

Advanced: LLM-as-a-Judge

If you want to automate the grading, you can write a second prompt that acts as a judge.

def grade_output(original_prompt, generated_output):
    grading_prompt = f"""
    You are a QA Engineer. Grade the following code snippet on a scale of 1-10 based on PEP8 compliance.
    
    CODE:
    {generated_output}
    
    Return JSON: {{ "score": int, "reason": string }}
    """
    # ... call LLM with this grading_prompt ...

This allows you to run a suite of 50 test inputs and get a spreadsheet of scores back, telling you objectively if your prompt changes improved the system or broke it.

5. Summary

Building a personal prompt system separates the casual user from the AI Engineer.

Centralize: Move prompts out of chat history and into a json or code file.
Structure: Use Role-Context-Constraint-Output frameworks.
Automate: Use a script to run prompts against fixed test cases.
Evaluate: Compare versions to ensure quality is actually increasing.

Start small. Create your prompts.json today and add your three most used workflows. Treat it like a software project, and the quality of your output will compound.

Comments

Loading comments...

Engineering a Personal Prompt System: Library, Versioning, and Automated Tests

1. The Theory: Structure Beats Creativity

The Baseline vs. The Engineered

2. The Build: Your Prompt Library

Step 1: The `prompts.json` Structure

3. The Test Harness

Prerequisites

The Script: `test_harness.py`

4. Evaluating the Output (The Loop)

The "Before/After" Workflow

Advanced: LLM-as-a-Judge

5. Summary

Comments

Add a comment

Engineering a Personal Prompt System: Library, Versioning, and Automated Tests

1. The Theory: Structure Beats Creativity

The Baseline vs. The Engineered

2. The Build: Your Prompt Library

Step 1: The `prompts.json` Structure

3. The Test Harness

Prerequisites

The Script: `test_harness.py`

4. Evaluating the Output (The Loop)

The "Before/After" Workflow

Advanced: LLM-as-a-Judge

5. Summary

Comments

Add a comment

1. The Theory: Structure Beats Creativity

The Baseline vs. The Engineered

2. The Build: Your Prompt Library

Step 1: The prompts.json Structure

3. The Test Harness

Prerequisites

The Script: test_harness.py

4. Evaluating the Output (The Loop)

The "Before/After" Workflow

Advanced: LLM-as-a-Judge

5. Summary

Comments

Add a comment

1. The Theory: Structure Beats Creativity

The Baseline vs. The Engineered

2. The Build: Your Prompt Library

Step 1: The prompts.json Structure

3. The Test Harness

Prerequisites

The Script: test_harness.py

4. Evaluating the Output (The Loop)

The "Before/After" Workflow

Advanced: LLM-as-a-Judge

5. Summary

Comments

Add a comment

Step 1: The `prompts.json` Structure

The Script: `test_harness.py`

Step 1: The `prompts.json` Structure

The Script: `test_harness.py`