# Research: Structured JSON Output from Local LLMs (March 2026)

## Key Finding: Switch Ollama from JSON Mode to Schema-Enforced Format

Your current `ollama_backend.py` uses `response_format: json_object` (basic JSON mode). Ollama v0.5+ supports passing the actual JSON Schema to the `format` parameter for grammar-enforced output. This is the single biggest quick win.

---

## Constrained Decoding Libraries (Ranked)

| Library | Approach | Speed | Best For |
|---------|----------|-------|----------|
| **XGrammar** | PDA-based batch decoding | <40μs/token, 100x speedup | Production (default in vLLM, SGLang) |
| **llguidance** | Rust, derivative-based | ~50μs/token | Complex grammars (credited by OpenAI) |
| **Outlines** | FSM token masking | Good | Pydantic models, JSON schemas |
| **Instructor** | Validation + retry layer | Backend-dependent | Easy reliability layer on any backend |
| **llama.cpp GBNF** | Grammar-constrained | Good single-user | What Ollama uses under the hood |

**Key insight:** Constrained decoding no longer has a performance penalty. XGrammar can actually speed up generation by 50%.

---

## Ollama JSON Capabilities

### Two Levels Available

1. **Basic** (what you use now): `response_format={"type": "json_object"}`
   - Forces valid JSON but NO schema enforcement

2. **Schema-enforced** (what you should use):
   ```python
   from ollama import chat
   response = chat(
       model='qwen2.5-coder:7b',
       messages=[...],
       format=MantaraSchema.model_json_schema(),
   )
   ```
   - Uses llama.cpp GBNF grammars under the hood
   - Real constrained decoding at token level
   - Caveat: `pattern` (regex) constraints can cause GBNF errors

### Limitations
- No batching (single-threaded)
- Less sophisticated than XGrammar for complex schemas
- Can't inject custom logit processors (unlike vLLM)

---

## vLLM Structured Output (Production Path)

vLLM now supports structured output natively via XGrammar (default):

```python
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-32B-Instruct",
    messages=[...],
    extra_body={"guided_json": MantaraSchema.model_json_schema()},
)
```

- 16x+ throughput vs Ollama
- Grammar-enforced, near-zero structural errors
- Supports concurrent users
- Same OpenAI-compatible API

---

## Nesting Depth Problem

Mantara schema has 4-5 levels: root → menus → submenus → tables → columns/FKs

Research shows LLMs degrade at 5-7 level nesting (17-37% accuracy drop). Strategies:

1. **Multi-step decomposition** (your V2 pipeline — correct approach)
2. **Generate-then-format** — free reasoning first, format second
3. **Section-by-section** — generate skeleton, then fill tables per submenu
4. **SLOT framework** — lightweight model as post-processor (99.5% schema accuracy)

---

## Recommended Phased Approach for Mantara

### Phase 1: This Week — Upgrade Ollama Integration
Switch from `json_object` to schema-enforced `format` parameter. Single code change, grammar-level enforcement.

### Phase 2: 1-2 Weeks — Add Instructor Layer
```
pip install instructor
```
Automatic Pydantic validation + retry. Replaces some of your manual repair loop logic.

### Phase 3: When Server Access — Switch to vLLM
Deploy vLLM + Qwen2.5-32B + XGrammar. Same API surface, 16x throughput, grammar-enforced output.

### Phase 4: Long-Term — SLOT-Style Post-Processing
Decouple "thinking about database design" from "producing valid JSON". Use lightweight model (3B) as format enforcer.

---

## Model-Specific JSON Capabilities

| Model | JSON Reliability | Notes |
|-------|-----------------|-------|
| **Hermes 2 Pro 7B** | 91% function calling, 84% JSON | Best small model for structured output |
| **Qwen3 (cloud)** | Native JSON Schema mode | Must disable thinking mode |
| **Qwen2.5-Coder** | Good baseline | Your current model |
| **Functionary** | High | Interprets JSON Schema natively |
| **Ministral-3B** | Designed for function calling | Tiny but capable |

---

## Sources
- dottxt-ai/outlines (GitHub)
- mlc-ai/xgrammar (GitHub)
- guidance-ai/llguidance (GitHub)
- python.useinstructor.com
- docs.vllm.ai/en/latest/features/structured_outputs
- docs.ollama.com/capabilities/structured-outputs
- arXiv 2501.10868 (JSONSchemaBench)
- arXiv 2509.25922 (DeepJSONEval)
- arXiv 2505.04016 (SLOT framework)
