Tuesday, December 2, 2025

Open Source AI Models Building a Development Team

The Question We're Finally Asking

For years we've debated: Can AI replace software engineers? The question was always a bit theatrical. The real question—the one that actually matters—is a different one entirely: Can AI augment the engineering process in ways that make better code happen faster?

I think we're closer to a practical answer than we realize.

There's a concept that's been brewing in the open source and commercial AI spaces, one that mirrors something we've known in software engineering for decades: diverse perspectives catch what homogeneous ones miss. Single engineers make mistakes. Teams catch them. The question becomes: can we build a team out of AI models, each with distinct expertise, and orchestrate them to produce better outcomes?

I've been working on a proof of concept with this team-based approach. I started back in Aug 2025 and picked it up again recently. It's a component of my broader ApocryiaAI framework (apocryia.com will be the public facing frontend). For this POC, what I've built is a set of Python scripts that integrate with our private backend infrastructure and Percona database, orchestrating local open source models to work as a unified team. They collaborate to solve whatever task is requested, each bringing specialized perspective and expertise. All of this team communication is visible in real-time via a private IRC server—allowing me to observe the interactions, understand their reasoning process, and even interject during the workflow when needed. What I'm describing here is specifically that IRC-based autonomous development team. It's running, it's working, and the results are worth thinking about. Yes, it's still a proof of concept, and I'm the first to admit that. But this concept supports what I try to do for my team and myself: work smarter, not harder. This isn't AI replacing developers. It's AI working alongside developers, providing diverse opinions across different points of view and models.

The Architecture: A Team, Not a Model

The architecture is deceptively simple but conceptually important. Instead of throwing a single large language model at a development task and hoping it produces good code, we've created four specialized roles:

  1. ProjectManager (qwen2.5-coder) - Creates detailed implementation plans
  2. Developer (deepseek-coder-v2) - Generates complete, runnable code
  3. Reviewer (apocryiaai-unified - A private model I trained with MIT business and computer science grad-level data) - Evaluates code for correctness and business logic
  4. Lucifer (mistral) - Challenges everything, finds edge cases and flaws

Each model is chosen specifically for that role. The ProjectManager isn't trying to write code—it's thinking about requirements, dependencies, and success criteria. The Developer is optimized for code generation. The Reviewer has business-level understanding. And Lucifer's job is to be difficult.

This is not novel in human teams. It's basic team dynamics. What's novel is making it work with open source models that fit in a local machine.


The Workflow: Iteration With Purpose

Here's where it gets interesting. The workflow doesn't end with code generation. It's iterative:

Cycle 1:

  • Developer writes initial code
  • Reviewer evaluates it (rating it 1-10)
  • Lucifer challenges the code with critical analysis

Decision Logic:

  • If Reviewer gives it 8+/10 AND Lucifer says it's acceptable → Approve
  • If either agent rejects it → Trigger refinement

Cycle 2 (if needed):

  • Developer refines code based on specific feedback from both agents
  • Reviewer re-evaluates the refined code
  • Lucifer reviews the improvements

Real Example: Hello World Task

I tested this with a specific task: "Create a Python script that prints 'Hello World' a configurable number of times based on passed input (like: hello_world.py -X 3 would print it 3 times)."


Cycle 1 - Initial Development:

The Developer generated:

 import sys


def hello(times):

    for i in range(times):

        print("Hello World")


if __name__ == "__main__":

    if len(sys.argv) > 1:

        try:

            times = int(sys.argv[1])

            hello(times)

        except ValueError:

            print("Please enter a valid integer.")

    else:

        print("Usage: python hello_world.py <number of times>")

The Reviewer evaluated it: Rating: 7/10 with feedback:

"The code works but is missing proper argument handling. It uses positional arguments instead of flags. Should support -X or --number flag format. Error handling could be more robust."


The Lucifer agent challenged it:

"This implementation is naive. What happens if someone passes -5 or 1000000? No bounds checking. The positional argument approach is error-prone. Consider using argparse for professional flag handling."


Decision: Not approved. 7/10 < 8/10 threshold. Trigger refinement.


Cycle 2 - Code Refinement:


Based on both feedbacks, the Developer refined the code:

import sys

import argparse


def hello(times):

    if times <= 0:

        print("Error: Number must be positive")

        return False

    if times > 1000:

        print("Error: Maximum 1000 iterations allowed")

        return False


    for i in range(times):

        print("Hello World")

    return True


if __name__ == "__main__":

    parser = argparse.ArgumentParser(description='Print Hello World N times')

    parser.add_argument('-X', '--number', type=int, required=True,

                       help='Number of times to print (1-1000)')


    try:

        args = parser.parse_args()

        success = hello(args.number)

        sys.exit(0 if success else 1)

    except SystemExit:

        raise

    except Exception as e:

        print(f"Error: {e}")

        sys.exit(1)

The Reviewer re-evaluated: Rating: 9/10

"Significant improvement. Now uses proper argparse, validates inputs, has bounds checking, explicit error messages. Professional argument handling. Only minor suggestion: consider logging instead of print for errors."


The Lucifer agent reviewed: Approved

"Much better. Edge cases are handled. Bounds checking prevents abuse. Error messages are clear. No issues with this implementation."

Decision: Approved. Both agents agree. Deploy.


Real iteration. Real improvement. From 7→9 rating. Visible in the logs.


Another Example: "Why is the Sky Blue?"

This example shows the system handling non-coding tasks—explanatory content rather than utility code.

Task: "Write a 50 word sentence about why the sky is blue"


Cycle 1 - Initial Generation:

The Developer generated:

def explain_sky_blue():

    """

    Returns an explanation of why the sky appears blue.

    """

    return "The sky appears blue because of a phenomenon called Rayleigh scattering..."


if __name__ == "__main__":

    print(explain_sky_blue())

The Reviewer evaluated: Rating: 10/10

"Well-structured code with clear documentation and correct scientific explanation. Excellent."


The Lucifer agent challenged:

"As Lucifer, here are my findings: While functionally correct, the explanation lacks depth about wavelength specifics. The code structure is good but could benefit from type hints for better Python practices."

Decision: Despite Reviewer's perfect score, Lucifer insisted on refinement. System triggered improvement.


Cycle 2 - Code Refinement:

Based on Lucifer's feedback, Developer improved:

def why_is_the_sky_blue() -> str:

    """

    Explain the reason behind the sky appearing blue due to Rayleigh scattering and

    other related phenomena.


    Returns:

        str: A detailed explanation of why the sky appears blue, including information

             about Rayleigh scattering and its role in color dispersion.

    """

    return (

        "The sky appears blue because of a phenomenon called Rayleigh scattering. "

        "When sunlight enters Earth's atmosphere, shorter wavelength (blue) light is "

        "scattered more by air molecules than longer wavelength (red/yellow) light. "

        "This causes the blue color we see in the sky."

    )


if __name__ == "__main__":

    print(why_is_the_sky_blue())

The Reviewer re-evaluated: Rating: 8/10

"Improved significantly. Type hints added, docstring is comprehensive, explanation is clear and scientifically accurate."


The Lucifer agent approved:

"Much better. The technical details about wavelength are now clear. Code follows Python best practices. This is a solid implementation."


Decision: Both agents approve. Task completed.

Verified Output:

The sky appears blue because of a phenomenon called Rayleigh scattering. When sunlight

enters Earth's atmosphere, shorter wavelength (blue) light is scattered more by air

molecules than longer wavelength (red/yellow) light. This causes the blue color we see

in the sky.

What This Example Shows:

  • The system handles diverse task types (not just utilities)
  • Even a "perfect" 10/10 from Reviewer doesn't bypass the approval gate
  • Lucifer's critical eye catches improvements that pure quality metrics miss
  • Type hints, docstrings, and clarity matter to the team
  • Code goes through refinement even when it works, pushing toward excellence

Why This Matters: The Approval Problem

Here's something most AI code generation tools gloss over: How do you know when code is actually ready?

Most systems have a single decision gate: "Is this acceptable yes/no?" That's the wrong question. The better question is: "Have multiple perspectives—operating from different priorities and expertise—agreed this is good?"

The approval logic in ApocryiaAI requires both the Reviewer and Lucifer to explicitly approve. Not a loose "looks fine" but explicit agreement:

  • Reviewer must give it a rating of 8/10 or higher, OR explicitly say "approved/looks good"
  • Lucifer must explicitly say "no issues/acceptable/approved"

This creates a natural tension. The Reviewer wants the code to work correctly and follow best practices. Lucifer wants to find what's wrong. Code that satisfies both perspectives has genuinely passed multiple tests.


Why Explicit Approval Matters: A Cautionary Tale

This is harder than you'd think. We initially had a system that used loose keyword matching for approval. Words like "looks good" would trigger approval even when the model was just introducing its analysis. Here's an example of what went wrong:


Initial (Broken) System:

Lucifer: "In order to provide a comprehensive review, I'll delve deeper into

the edge cases. The input validation looks good in principle..."

System detected: "looks good" → APPROVED ✅ (WRONG!)

Lucifer was about to identify critical issues, but the system approved the code prematurely because it detected the phrase "looks good" mid-sentence as the model was introducing its analysis.


Fixed System: Now we require explicit approval phrases only when they appear as standalone conclusions:

Lucifer: "After thorough analysis, no issues found. This implementation

is acceptable and ready for deployment."

System detected: "no issues found" + "acceptable" → APPROVED ✅ (CORRECT!)

The difference? We distinguish between:

  • Positive mentions in analysis: "This approach looks good, but..." (not approval)
  • Explicit approval conclusions: "No issues. This is approved." (approval)

This seemingly small change prevents false positives where models talk about good code while actually criticizing it.


The Practical Side: GPU Memory and Open Source Realities

Here's something I haven't seen discussed enough: Open source models sitting in GPU memory between tasks is wasteful.

We added model unloading via Ollama API calls. After each agent completes its task, we explicitly unload its model from GPU memory. This keeps the system usable on real hardware, not just theoretical deployments.

This is a small detail but reveals something important: we're not building a research project. We're trying to make something that actually runs on machines people have.


Model Selection: Why Each Role Gets Its Specific Model

The models we're using:

ProjectManager: qwen2.5-coder:7b

  • Lightweight (7B parameters) so planning doesn't bottleneck the workflow
  • Excels at breaking tasks into structured plans with dependencies
  • When asked to plan the "Hello World" task, it produced:
  • Clear understanding of requirements (handle variable counts, validate input)
  • Step-by-step plan (arg parsing → validation → output loop)
  • Potential issues (negative numbers, bounds checking)
  • Success criteria (clean exit codes, proper error messages)
  • Not wasted generating code—just strategic thinking.

Developer: deepseek-coder-v2:latest

  • Largest and most specialized for code generation in our lineup
  • Produces complete, runnable code blocks on first pass
  • Handles complex scaffolding (argparse setup, error handling, proper exit codes)
  • When asked to refine based on feedback, actually understands what "add bounds checking" means and implements it correctly

Reviewer: apocryiaai-unified:latest

  • Rare combination: technical correctness evaluation + business logic understanding
  • Doesn't just say "this code works" but thinks about use cases and edge cases
  • Example feedback on our script: "Professional argument handling. Only minor suggestion: consider logging instead of print for errors."
  • That's not just technical critique—that's production thinking

Lucifer: mistral:latest

  • Sharp critical analysis without being a code expert
  • Asks hard questions: "What happens if someone passes -5 or 1000000?"
  • Thinks about failure modes and abuse cases
  • Doesn't get lost in syntax—focuses on fundamental flaws

All open source. All fit on consumer hardware. None require cloud APIs.


Why This Mix Works Better Than a Single Model

A single large model trying all four roles would either:

  1. Excel at one role, mediocre at others
  2. Produce bloated, slow responses trying to cover everything
  3. Approve its own code (alignment problem—it defends its earlier decisions)

With specialized models:

  • Planning is fast and focused
  • Code generation leverages the best tool available
  • Review is genuinely independent critique
  • Lucifer isn't trying to write code—just finding problems

What Works. What Doesn't. Honest Assessment.

What Actually Works:

  • The iterative refinement genuinely improves code. 7→9 isn't a coincidence.
  • Diverse perspectives catch real issues. When Lucifer finds edge cases, they're usually valid.
  • The approval mechanism creates a quality gate that's harder to game than single-model evaluation.
  • Locally-run models mean no API costs, no privacy concerns, no rate limiting.

What's Still Hard:

  • Computational cost: 4+ LLM calls per task. For trivial tasks, this is overkill.
  • Model reliability: The system depends on models actually being critical and honest. If a model learns to approve things to move forward, the whole thing breaks.
  • Specification problems remain. If the initial requirement is fundamentally wrong, refinement helps but doesn't fix it.
  • Scaling: One successful task doesn't prove it scales across diverse problem types.

What Needs More Data:

  • Does 2 cycles converge on actually better code, or is that specific to this task?
  • What's the failure rate on production deployments?
  • At what complexity level does the overhead justify the quality improvement?
  • How do these systems perform on different categories of problems (utility scripts, system programming, web backends)?

The Bigger Question: What's This For?

If you're thinking "this seems like a lot of machinery for hello_world.py," you're right.

The value emerges at scale and complexity. Consider:

  1. Team Augmentation - Your actual team has a senior engineer, a junior, and a critical reviewer. Adding an automated adversarial agent (Lucifer) that catches what you'd miss? That scales.
  2. Knowledge Preservation - When the critical feedback is logged, you can learn why code was rejected. Over time, you understand the approval patterns. That's institutional knowledge.
  3. Specification Evolution - The PM learning mechanism captures when critical issues would have been caught by better specifications. Feed that back to requirements.
  4. Local Autonomy - No cloud dependency. No API costs. You control your development pipeline.

The right comparison isn't "can this replace engineers" but "can this augment the engineering process in ways that produce better outcomes per unit of human effort?"

On that question, the early data looks promising.


The Open Source Angle

Here's why open source models matter for this:

You're not dependent on a commercial company's moods about pricing, availability, or model changes. You're not sending your code to external APIs. You're not at risk of waking up to a terms-of-service change that affects your workflow.

The community around Ollama, the models themselves (qwen, deepseek, mistral), and the frameworks we're using are all genuinely open. You can inspect them. You can run them on your hardware. You can contribute back.

That's different from cloud-based AI. It's also different from the single-model approach most people take. It's team-based thinking applied to open source infrastructure.


Where This Goes

The next phase is validation. More diverse tasks. Different problem types. Real production code, not just examples.

We need to understand:

  • Does the approval mechanism hold up when models encounter truly novel situations?
  • How does cost-per-task scale as complexity increases?
  • Can the PM learning feedback actually improve specification quality over time?
  • What happens when the team disagrees and can't converge?
  • I have hypotheses on these. But hypotheses aren't evidence. Evidence comes from running it.

What I like the most about this: 

YOU can do it also. You can apply the same concepts to whatever architecture and infrastructure you want. Don't want an IRC server, ok, no problem, I wanted insights into what the team was doing, but you don't have to. Do you want more team members, ok sure... The concept is based on you using AI to help you work smarter, not harder. 

The Philosophy

What we're experimenting with here is: building development automation with tools you control, from models you understand, running on hardware you own.

That matters more than people realize.

My AI team concept isn't trying to replace developers or even me. It's trying to be the kind of colleague that works with me and who catches bugs, asks hard questions, and pushes back on mediocre code. That colleague exists in every good team. Automation is making it possible to have that colleague always present.

Whether this specific approach is the right one, I'm not sure yet. But the direction—toward distributed expertise, adversarial review, and local autonomy—that direction feels right.

The code is working. The team is functional. The quality improvements are measurable.

Now we find out if it scales, the real work now begins....