Saturday, February 21, 2026

MySQL + Neo4j for AI Workloads: Why Relational Databases Still Matter

So I figured it was about time I documented how to build persistent memory for AI agents using the databases you already know. Not vector databases - MySQL and Neo4j.

This isn't theoretical. I use this architecture daily, handling AI agent memory across multiple projects. Here's the schema and query patterns that actually work.

The Architecture

AI agents need two types of memory:

  • Structured memory - What happened, when, why (MySQL)
  • Pattern memory - What connects to what (Neo4j)

Vector databases are for similarity search. They're not for tracking workflow state or decision history. For that, you need ACID transactions and proper relationships.

The MySQL Schema

Here's the actual schema for AI agent persistent memory:

-- Architecture decisions the AI made
CREATE TABLE architecture_decisions (
    id INT AUTO_INCREMENT PRIMARY KEY,
    project_id INT NOT NULL,
    title VARCHAR(255) NOT NULL,
    decision TEXT NOT NULL,
    rationale TEXT,
    alternatives_considered TEXT,
    status ENUM('accepted', 'rejected', 'pending') DEFAULT 'accepted',
    decided_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    tags JSON,
    INDEX idx_project_date (project_id, decided_at),
    INDEX idx_status (status)
) ENGINE=InnoDB;

-- Code patterns the AI learned
CREATE TABLE code_patterns (
    id INT AUTO_INCREMENT PRIMARY KEY,
    project_id INT NOT NULL,
    category VARCHAR(50) NOT NULL,
    name VARCHAR(255) NOT NULL,
    description TEXT,
    code_example TEXT,
    language VARCHAR(50),
    confidence_score FLOAT DEFAULT 0.5,
    usage_count INT DEFAULT 0,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    updated_at DATETIME ON UPDATE CURRENT_TIMESTAMP,
    INDEX idx_project_category (project_id, category),
    INDEX idx_confidence (confidence_score)
) ENGINE=InnoDB;

-- Work session tracking
CREATE TABLE work_sessions (
    id INT AUTO_INCREMENT PRIMARY KEY,
    session_id VARCHAR(255) UNIQUE NOT NULL,
    project_id INT NOT NULL,
    started_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    ended_at DATETIME,
    summary TEXT,
    context JSON,
    INDEX idx_project_session (project_id, started_at)
) ENGINE=InnoDB;

-- Pitfalls to avoid (learned from mistakes)
CREATE TABLE pitfalls (
    id INT AUTO_INCREMENT PRIMARY KEY,
    project_id INT NOT NULL,
    category VARCHAR(50),
    title VARCHAR(255) NOT NULL,
    description TEXT,
    how_to_avoid TEXT,
    severity ENUM('critical', 'high', 'medium', 'low'),
    encountered_count INT DEFAULT 1,
    created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_project_severity (project_id, severity)
) ENGINE=InnoDB;

Foreign keys. Check constraints. Proper indexing. This is what relational databases are good at.

Query Patterns

Here's how you actually query this for AI agent memory:

-- Get recent decisions for context
SELECT title, decision, rationale, decided_at
FROM architecture_decisions
WHERE project_id = ?
  AND decided_at > DATE_SUB(NOW(), INTERVAL 30 DAY)
ORDER BY decided_at DESC
LIMIT 10;

-- Find high-confidence patterns
SELECT category, name, description, code_example
FROM code_patterns
WHERE project_id = ?
  AND confidence_score >= 0.80
ORDER BY usage_count DESC, confidence_score DESC
LIMIT 20;

-- Check for known pitfalls before implementing
SELECT title, description, how_to_avoid
FROM pitfalls
WHERE project_id = ?
  AND category = ?
  AND severity IN ('critical', 'high')
ORDER BY encountered_count DESC;

-- Track session context across interactions
SELECT context
FROM work_sessions
WHERE session_id = ?
ORDER BY started_at DESC
LIMIT 1;

These are straightforward SQL queries. EXPLAIN shows index usage exactly where expected. No surprises.

The Neo4j Layer

MySQL handles the structured data. Neo4j handles the relationships:

// Create nodes for decisions
CREATE (d:Decision {
  id: 'dec_123',
  title: 'Use FastAPI',
  project_id: 1,
  embedding: [0.23, -0.45, ...]  // Vector for similarity
})

// Create relationships
CREATE (d1:Decision {id: 'dec_123', title: 'Use FastAPI'})
CREATE (d2:Decision {id: 'dec_45', title: 'Used Flask before'})
CREATE (d1)-[:SIMILAR_TO {score: 0.85}]->(d2)
CREATE (d1)-[:CONTRADICTS]->(d3:Decision {title: 'Avoid frameworks'})

// Query: Find similar past decisions
MATCH (current:Decision {id: $decision_id})
MATCH (current)-[r:SIMILAR_TO]-(similar:Decision)
WHERE r.score > 0.80
RETURN similar.title, r.score
ORDER BY r.score DESC

// Query: What outcomes followed this pattern?
MATCH (d:Decision)-[:LEADS_TO]->(o:Outcome)
WHERE d.title CONTAINS 'Redis'
RETURN d.title, o.type, o.success_rate

How They Work Together

The flow looks like this:

  1. AI agent generates content or makes a decision
  2. Store structured data in MySQL (what, when, why, full context)
  3. Generate embedding, store in Neo4j with relationships to similar items
  4. Next session: Neo4j finds relevant similar decisions
  5. MySQL provides the full details of those decisions

MySQL is the source of truth. Neo4j is the pattern finder.

Why Not Just Vector Databases?

I've seen teams try to build AI agent memory with just Pinecone or Weaviate. It doesn't work well because:

Vector DBs are good for:

  • Finding documents similar to a query
  • Semantic search (RAG)
  • "Things like this"

Vector DBs are bad for:

  • "What did we decide on March 15th?"
  • "Show me decisions that led to outages"
  • "What's the current status of this workflow?"
  • "Which patterns have confidence > 0.8 AND usage_count > 10?"

Those queries need structured filtering, joins, and transactions. That's relational database territory.

MCP and the Future

The Model Context Protocol (MCP) is standardizing how AI systems handle context. Early MCP implementations are discovering what we already knew: you need both structured storage and graph relationships.

MySQL handles the MCP "resources" and "tools" catalog. Neo4j handles the "relationships" between context items. Vector embeddings are just one piece of the puzzle.

Production Notes

Current system running this architecture:

  • MySQL 8.0, 48 tables, ~2GB data
  • Neo4j Community, ~50k nodes, ~200k relationships
  • Query latency: MySQL <10ms, Neo4j <50ms
  • Backup: Standard mysqldump + neo4j-admin dump
  • Monitoring: Same Percona tools I've used for years

The operational complexity is low because these are mature databases with well-understood operational patterns.

Too Much Work? Let AI Build It For You

Look, I get it. This is a lot of schema to set up, a lot of queries to write, a lot of moving parts.

Here's the thing: you don't have to type it all yourself. Copy the schema above, paste it into Claude Code or Kimi CLI, and tell it what you want to build. The AI will generate the Python code, the connection handling, the query patterns - all of it.

If you want to understand what's happening under the hood, start here:

Building a Simple MCP Server in Python

Then let your AI tool do the heavy lifting. That's literally what I did. The schema is mine, the architecture decisions are mine, but the implementation? Claude wrote most of it while I watched and corrected.

Use the tools. That's what they're for.

When to Use What

Use CaseDatabase
Workflow state, decisions, audit trailMySQL/PostgreSQL
Pattern detection, similarity, relationshipsNeo4j
Semantic document search (RAG)Vector DB (optional)

Start with MySQL for state. Add Neo4j when you need pattern recognition. Only add vector DBs if you're actually doing semantic document retrieval.

Summary

AI agents need persistent memory. Not just embeddings in a vector database - structured, relational, temporal memory with pattern recognition.

MySQL handles the structured state. Neo4j handles the graph relationships. Together they provide what vector databases alone cannot.

Don't abandon relational databases for AI workloads. Use the right tool for each job, which is using both together.

For more on the AI agent perspective on this architecture, see the companion post on 3k1o.

MySQL 8.0 JSON Functions: Practical Examples and Indexing

This post covers a hands-on walkthrough of MySQL 8.0's JSON functions. JSON support has been in MySQL since 5.7, but 8.0 added a meaningful set of improvements — better indexing strategies, new functions, and multi-valued indexes — that make working with JSON data considerably more practical. The following documents several of the most commonly needed patterns, including EXPLAIN output and performance observations worth knowing about.

This isn't a "JSON vs. relational" debate post. If you're storing JSON in MySQL, you probably already have your reasons. The goal here is to make sure you're using the available tooling effectively.

Environment

mysql> SELECT @@version, @@version_comment\G
*************************** 1. row ***************************
        @@version: 8.0.36
@@version_comment: MySQL Community Server - GPL

Testing was done on a VM with 8GB RAM and innodb_buffer_pool_size set to 4G. One housekeeping note worth mentioning: query_cache_type is irrelevant in 8.0 since the query cache was removed entirely. If you migrated a 5.7 instance and still have that variable in your my.cnf, remove it — MySQL 8.0 will throw a startup error.

Setting Up a Test Table

The test table simulates a fairly common pattern — an application storing user profile data and event metadata as JSON blobs:

CREATE TABLE user_events (
  id          INT UNSIGNED NOT NULL AUTO_INCREMENT,
  user_id     INT UNSIGNED NOT NULL,
  event_data  JSON NOT NULL,
  created_at  DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (id),
  INDEX idx_user (user_id)
) ENGINE=InnoDB;

INSERT INTO user_events (user_id, event_data) VALUES
(1, '{"action":"login","ip":"192.168.1.10","tags":["mobile","vpn"],"score":88}'),
(1, '{"action":"purchase","ip":"192.168.1.10","tags":["desktop"],"score":72,"amount":49.99}'),
(2, '{"action":"login","ip":"10.0.0.5","tags":["mobile"],"score":91}'),
(3, '{"action":"logout","ip":"10.0.0.9","tags":["desktop","vpn"],"score":65}'),
(2, '{"action":"purchase","ip":"10.0.0.5","tags":["mobile"],"score":84,"amount":129.00}');

Basic Extraction: JSON_VALUE vs. JSON_EXTRACT

JSON_VALUE() was introduced in MySQL 8.0.21 and is the cleaner way to extract scalar values with built-in type casting. Before that, you were using JSON_EXTRACT() (or the -> shorthand) and casting manually, which works but adds noise to your queries.

-- Pre-8.0.21 approach
SELECT user_id,
       JSON_EXTRACT(event_data, '$.action') AS action,
       CAST(JSON_EXTRACT(event_data, '$.score') AS UNSIGNED) AS score
FROM user_events;

-- Cleaner 8.0.21+ approach
SELECT user_id,
       JSON_VALUE(event_data, '$.action') AS action,
       JSON_VALUE(event_data, '$.score' RETURNING UNSIGNED) AS score
FROM user_events;

Output from the second query:

+---------+----------+-------+
| user_id | action   | score |
+---------+----------+-------+
|       1 | login    |    88 |
|       1 | purchase |    72 |
|       2 | login    |    91 |
|       3 | logout   |    65 |
|       2 | purchase |    84 |
+---------+----------+-------+
5 rows in set (0.00 sec)

The RETURNING clause is genuinely useful. It eliminates the awkward double-cast pattern and makes intent clearer when reading query code later.

Multi-Valued Indexes: The Real Game Changer

This is where 8.0 actually moved the needle for JSON workloads. Multi-valued indexes, available since MySQL 8.0.17, let you index array elements inside a JSON column directly. Here's what that looks like in practice:

ALTER TABLE user_events
  ADD INDEX idx_tags ((CAST(event_data->'$.tags' AS CHAR(64) ARRAY)));

Here is what EXPLAIN shows before and after on a query filtering by tag value:

-- Without the multi-valued index:
EXPLAIN SELECT * FROM user_events
WHERE JSON_CONTAINS(event_data->'$.tags', '"vpn"')\G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: user_events
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 5
     filtered: 100.00
        Extra: Using where

-- After adding the multi-valued index:
EXPLAIN SELECT * FROM user_events
WHERE JSON_CONTAINS(event_data->'$.tags', '"vpn"')\G

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: user_events
   partitions: NULL
         type: range
possible_keys: idx_tags
          key: idx_tags
      key_len: 67
          ref: NULL
         rows: 2
     filtered: 100.00
        Extra: Using where

Full table scan down to a range scan. On 5 rows this is trivial, but on a table with millions of rows and frequent tag-based filtering, that difference is significant. The improvement scales directly with table size and query frequency.

One important gotcha: MEMBER OF() and JSON_OVERLAPS() also benefit from multi-valued indexes, but JSON_SEARCH() does not. This matters when choosing your query pattern at design time:

-- This WILL use the multi-valued index:
SELECT * FROM user_events
WHERE 'vpn' MEMBER OF (event_data->'$.tags');

-- This will NOT use it:
SELECT * FROM user_events
WHERE JSON_SEARCH(event_data->'$.tags', 'one', 'vpn') IS NOT NULL;

Aggregating and Transforming JSON

A few aggregation functions worth knowing well:

-- Build a JSON array of actions per user
SELECT user_id,
       JSON_ARRAYAGG(JSON_VALUE(event_data, '$.action')) AS actions
FROM user_events
GROUP BY user_id;

+---------+----------------------+
| user_id | actions              |
+---------+----------------------+
|       1 | ["login","purchase"] |
|       2 | ["login","purchase"] |
|       3 | ["logout"]           |
+---------+----------------------+
3 rows in set (0.01 sec)

-- Summarize into a JSON object keyed by action
SELECT user_id,
       JSON_OBJECTAGG(
         JSON_VALUE(event_data, '$.action'),
         JSON_VALUE(event_data, '$.score' RETURNING UNSIGNED)
       ) AS score_by_action
FROM user_events
GROUP BY user_id;

+---------+--------------------------------+
| user_id | score_by_action                |
+---------+--------------------------------+
|       1 | {"login": 88, "purchase": 72}  |
|       2 | {"login": 91, "purchase": 84}  |
|       3 | {"logout": 65}                 |
+---------+--------------------------------+
3 rows in set (0.00 sec)

JSON_OBJECTAGG() will throw an error if there are duplicate keys within a group. This is worth knowing before you encounter it in a production ETL pipeline. In that case, you'll need to deduplicate upstream or handle it in application logic before the data reaches this aggregation step.

Checking SHOW STATUS After JSON-Heavy Queries

When evaluating query patterns, checking handler metrics is a useful habit:

FLUSH STATUS;

SELECT * FROM user_events
WHERE JSON_VALUE(event_data, '$.score' RETURNING UNSIGNED) > 80;

SHOW STATUS LIKE 'Handler_read%';

+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_read_first         | 1     |
| Handler_read_key           | 0     |
| Handler_read_last          | 0     |
| Handler_read_next          | 4     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 6     |
+----------------------------+-------+
7 rows in set (0.00 sec)

The Handler_read_rnd_next value confirms a full scan — no surprise since there's no functional index on the score value. For score-based filtering at scale, a generated column with an index is the right answer:

ALTER TABLE user_events
  ADD COLUMN score_val TINYINT UNSIGNED
    GENERATED ALWAYS AS (JSON_VALUE(event_data, '$.score' RETURNING UNSIGNED)) VIRTUAL,
  ADD INDEX idx_score (score_val);

After adding that, the same query drops to a proper index range scan. Generated columns on JSON fields are available in both MySQL 8.0 and Percona Server 8.0, and they remain the most reliable path for scalar JSON field filtering at any meaningful scale.

If you're running Percona Server, pt-query-digest from the Percona Toolkit is still the most practical way to identify which JSON-heavy queries are actually causing pain in production before you start adding indexes speculatively.

Practical Observations

  • Multi-valued indexes (8.0.17+) are a long overdue improvement and work well when your query patterns align with JSON_CONTAINS() or MEMBER OF()
  • JSON_VALUE() with RETURNING (8.0.21+) is cleaner than the old cast-after-extract pattern and worth adopting consistently
  • Generated columns plus indexes remain the most reliable path for scalar JSON field filtering at scale
  • Watch for JSON_OBJECTAGG() duplicate key errors in grouped data — it surfaces as a hard error in ETL pipelines and can be easy to miss in testing if your sample data happens to be clean
  • Always verify index usage with EXPLAIN — the optimizer doesn't always pick up multi-valued indexes in complex WHERE clauses, and it's worth confirming rather than assuming

Summary

MySQL 8.0's JSON improvements are genuinely useful, particularly multi-valued indexes and JSON_VALUE() with type casting. They don't replace good schema design, but for cases where JSON storage is appropriate or inherited, you now have real tools to work with rather than just hoping the optimizer figures it out. The generated column pattern in particular is worth evaluating early if you know certain JSON fields will be used in WHERE clauses regularly.

Useful references: