# Codegen Reliability — Implementation Log

Tracks what was built for each phase of `CODEGEN_RELIABILITY_PLAN.md`.
Update this file as each phase ships.

---

## Phase 0 — Telemetry foundation

**Status:** complete  
**Date:** 2026-05-13  
**File changed:** `shared/run_log.py`

### What was added

#### `ValidationCategory`
A `Literal` type alias for the six coarse error buckets:
`syntax`, `missing_import`, `undefined_name`, `type_mismatch`, `contract_violation`, `other`.

#### `classify_error(tool, code) → ValidationCategory`
Maps a tool name + error code string to a category via a priority-ordered prefix
table. First match wins. Coverage:

| Tool | Codes → Category |
|------|-----------------|
| `ruff` | `E9`, `W6` → syntax; `F401`, `F811` → missing_import; `F821` → undefined_name |
| `pyright` | `reportMissingImports`, `reportMissingModuleSource` → missing_import; `reportUndefinedVariable` → undefined_name; `reportGeneralTypeIssues`, `reportArgumentType`, `reportReturnType` → type_mismatch |
| `tsc` | `TS1xxx` → syntax; `TS2304`, `TS2339`, `TS2551` → undefined_name; `TS2307` → missing_import; `TS2322`, `TS2345` → type_mismatch |
| `ast` | `SyntaxError` → syntax |
| any | unmatched → other |

#### `ValidationReport` dataclass
Fields: `step`, `file`, `line`, `column`, `code`, `category`, `message`, `tool`.  
Serialises via `.to_dict()`.

#### `RepairAttempt` dataclass
Fields: `step`, `file`, `attempt` (1-based), `errors_in`, `errors_out`,
`success`, `input_tokens`, `output_tokens`.  
Serialises via `.to_dict()`.

#### `StepEvent` extensions
Two new fields appended (default empty list — fully backwards-compatible):
- `validation_errors: list[ValidationReport]`
- `repair_attempts: list[RepairAttempt]`

Two new computed properties:
- `repair_success_rate → float | None` — fraction of attempts that succeeded
- `tokens_spent_on_repair → int`

`to_dict()` now emits the plan's six success metrics per step:
`validation_errors_before_repair`, `repair_attempts`, `repair_success_rate`,
`tokens_spent_on_repair`, plus full `validation_errors` and `repair_detail` arrays.

#### `RunLog.to_markdown()` extensions
- Step table gains **Val errors** and **Repairs** columns.
- A **Validation errors** section appears below the step table whenever any step
  has recorded validation errors (file, line, tool, code, category, message).

### How to use from a step orchestrator

```python
from shared.run_log import ValidationReport, RepairAttempt, classify_error, get_active

log = get_active()
step_event = ...   # the StepEvent yielded by log.step(...)

# Record a validation error
report = ValidationReport(
    step=step_event.name,
    file="orders/handler.py",
    line=10, column=1,
    code="F821",
    category=classify_error("ruff", "F821"),
    message="Undefined name 'Session'",
    tool="ruff",
)
step_event.validation_errors.append(report)

# Record a repair attempt
attempt = RepairAttempt(
    step=step_event.name,
    file="orders/handler.py",
    attempt=1,
    errors_in=[report],
    errors_out=[],      # empty = all errors fixed
    success=True,
    input_tokens=500,
    output_tokens=200,
)
step_event.repair_attempts.append(attempt)
```

---

## Phase 1.1 — Backend validation (step 03)

**Status:** complete  
**Date:** 2026-05-13  
**File changed:** `pipeline/step-03-backend-generation/pipeline/backend_gen/validation_runner.py`

### What was added

#### `ValidationRunner.__init__` — new `step_name` parameter
Optional constructor arg (default `"step-03-backend-generation"`) passed through
to all `ValidationReport`s so RunLog can identify which step generated each error.

#### `validate_collect() → list[ValidationReport]`
Collects every error without raising. Runs:
1. All existing structural checks (`_check_root_files`, `_check_main_py`, etc.)
   in individual try/except blocks — converts caught `ValidationError` and
   `SyntaxError` to `ValidationReport` with appropriate `tool`/`code`/`category`.
2. `ruff check --select F,E9 --output-format json <output_dir>` — parses JSON
   output into `ValidationReport`s. Only runs when `ruff` is on PATH.
3. `pyright --outputjson <output_dir>` — parses JSON output, 0-based lines
   corrected to 1-based. Filters to `severity == "error"` only. Only runs when
   `pyright` is on PATH.

#### `validate()` — updated thin wrapper
Calls `validate_collect()`, attaches all reports to the active RunLog step event
via `get_active()._stack[-1].validation_errors`, then raises `ValidationError`
with the first error's message. Preserves exact existing caller behavior.

#### `pyproject.toml` changes
Added `ruff>=0.4.0` and `pyright>=1.1.0` to `[dependency-groups] dev`.

---

## Phase 1.2 — Frontend validation (step 05)

**Status:** complete  
**Date:** 2026-05-13  
**New file:** `pipeline/step-05-react-generation/pipeline/react_pipeline/services/validation_service.py`  
**File changed:** `pipeline/master-pipeline/pipeline/orchestrator.py`

### What was added

#### `validate_frontend(frontend_dir, step_name) → list[ValidationReport]`
New function in `validation_service.py`. Never raises — all problems are
returned as `ValidationReport` objects.

Steps:
1. Pre-flight: checks `package.json` and `tsconfig.json` exist.
2. `_ensure_node_modules(frontend_dir)`:
   - If `node_modules/` already exists: no-op.
   - If `runs/.cache/frontend_deps/<pkg_hash>/node_modules/` exists: symlink it in.
   - Otherwise: run `npm ci --prefer-offline` (fallback: `npm install`),
     then copy result to cache for future runs.
3. Runs `npx --no-install tsc --noEmit --project tsconfig.json` with `NO_COLOR=1`.
4. Parses tsc line-based output (`file(line,col): error TSxxxx: message`) via
   `_TSC_LINE_RE` regex into `ValidationReport`s. Errors only (warnings skipped).
   File paths are made relative to `frontend_dir`.

#### Master orchestrator wiring
After `write_generated_files()` in `orchestrator.py`, the step-04-05 block now:
- Calls `validate_frontend(frontend_dir, step_name=ev.name)`
- Attaches all reports to `ev.validation_errors`
- Prints a summary of up to 5 errors
- Raises `RuntimeError` if any errors found (fails the step cleanly)

---

## Phase 1 baseline

**Status:** not started

Run one full pipeline after 1.1 and 1.2 ship. Capture
`validation_errors_before_repair` per step. This is the baseline all later
phases are measured against.

---

## Phase 2 — Repair loop

**Status:** complete  
**Date:** 2026-05-13

### 2.1 — Shared repair primitive

**New file:** `shared/repair.py`

#### `RepairResult` dataclass
Fields: `success: bool`, `final_content: str`, `attempts: list[RepairAttempt]`.
`final_content` is the repaired content on success, or the best partially-repaired
content on failure (fewest errors among all attempts).

#### `repair_file(...) → RepairResult`
Synchronous (matches the codebase — no async). Key parameters:
- `file_path`, `original_content`, `errors: list[ValidationReport]`
- `llm_client` — any object with `.generate(prompt: str) -> str`
- `system_prompt` — prepended to the user repair message
- `max_attempts=2`
- `validator: Callable[[str], list[ValidationReport]]`
- `lang` — `"python"` or `"tsx"` (used in code fence)
- `api_manifest_summary` — optional contract context for TSX repairs

Per attempt:
1. Build full prompt: `system_prompt + task + file content in code block + errors`
2. Call `llm_client.generate(prompt)` → raw response
3. Extract repaired content via `extract_code_block()`
4. **Diff size guardrail**: if changed fraction vs original > 50 %, reject attempt,
   add `_PRESERVATION_WARNING` to next attempt's prompt, keep previous content
5. Run `validator(repaired_content)` → new error list
6. Record `RepairAttempt` (errors_in, errors_out, success, tokens) → auto-attached
   to active RunLog step via `get_active()._stack[-1].repair_attempts`
7. If clean: return success. Else: feed new errors + repaired content into next attempt.

#### `make_py_file_validator(file_path, output_dir, step_name)`
Returns a `Callable[[str], list[ValidationReport]]` that validates a Python source
string in-memory:
1. `ast.parse(content)` → syntax check (returns immediately if broken)
2. `ruff check --select F,E9 --output-format json --stdin-filename <rel> -`
   via stdin (no temp file needed). No-op if ruff not on PATH.

#### `PYTHON_REPAIR_SYSTEM_PROMPT` / `TSX_REPAIR_SYSTEM_PROMPT`
Module-level constants with project-specific context:
- Python: FastAPI, SQLAlchemy 2, Pydantic v2, jose, module structure, forbidden patterns
- TSX: React 19, Ant Design 5, react-router-dom 6, project layout, service layer rules

### 2.2 — Wire into step 03

**File changed:** `pipeline/step-03-backend-generation/pipeline/backend_gen/orchestrator.py`

Step 9 in `Orchestrator.run()` now calls `_repair_and_validate()`:
1. `ValidationRunner.validate_collect()` → get all initial errors, attach to RunLog
2. Group tool-reported errors (`ruff`, `pyright`, `ast`) by file
3. For each broken `.py` file: call `repair_file()` with `make_py_file_validator`
4. On success: write repaired content to disk
5. `ValidationRunner.validate()` — final full structural + tool check (raises if still broken)

### 2.3 — Wire into step 05

**File changed:** `pipeline/master-pipeline/pipeline/orchestrator.py`  
**File changed:** `pipeline/step-05-react-generation/pipeline/react_pipeline/services/validation_service.py`

Added `make_tsx_file_validator(file_path, frontend_dir, step_name)` to `validation_service.py`.
The returned validator writes candidate content to the file, runs `validate_frontend()`,
filters to errors for that file, then always restores the original (so `repair_file` never
leaves a partially-repaired file on disk).

Master orchestrator step-04-05 block:
1. `validate_frontend(frontend_dir)` → initial errors → attach to `ev.validation_errors`
2. Group `tsc` errors by file
3. For each broken TSX file: `repair_file()` with `make_tsx_file_validator` + manifest summary
4. On success: write repaired content to disk
5. Re-run `validate_frontend()` — if errors remain: raise RuntimeError (fails the step)

`_manifest_summary_for_repair(api_manifest)` helper added to master orchestrator:
compact text summary of module prefixes and endpoints passed as `api_manifest_summary`
to `repair_file()` so the model knows the actual API contract.

### 2.4 — SDK tool-use rewrite

**Replaced whole-file regeneration with structured tool calls.**

Previous approach: sent the file + errors to `llm_client.generate()`, got back the
entire file in a code block, parsed with `extract_code_block()`.

New approach (matches `step-06-edit` pattern):
- Uses `anthropic.AnthropicBedrock` directly (no `llm_client` parameter)
- `REPAIR_MODEL_ID` added to `shared/config.py` (defaults to `DEFAULT_SONNET_MODEL`,
  overridable via `REPAIR_BEDROCK_MODEL` env var)
- `_REPAIR_TOOLS` — two tools with no `path` field (repair_file already knows the file):
  - `patch_file(old_string, new_string)` — exact unique-match replacement, preferred
  - `write_file(content)` — full rewrite fallback for 5+ scattered changes
- `_call_with_retry()` — exponential backoff on 429/503 (matching executor.py)
- `_apply_tool_calls(content, tool_uses)` — applies patches in order; `write_file`
  overrides all prior patches; returns errors for non-unique `old_string` matches
- `tool_choice={"type": "any"}` — forces model to always call a tool

Callers updated:
- Step-03 orchestrator: removed `llm_client=self._llm` from `repair_file()` call
- Master orchestrator: removed `_create_repair_client()` import and call; removed
  `llm_client=repair_client` from `repair_file()` call

### 2.5 — Guardrails (all in `repair_file()`)

| Guardrail | Implementation |
|-----------|---------------|
| Max attempts | Loop bound at `max_attempts` (default 2) |
| Diff size cap | `_changed_fraction(original, repaired) > 0.50` → reject attempt, add `_PRESERVATION_WARNING` to next prompt |
| Atomic writes | `repair_file()` never writes to disk; callers write only on `result.success` |
| Single-file scope | Prompt and tool schema have no `path` field; model can only edit the one file it was given |
| No recursive escalation | `repair_file()` returns after `max_attempts`; no self-invocation |
| Patch uniqueness | `patch_file` rejects non-unique `old_string` matches with an error logged |

---

## Phase 3 — Typed API contract

**Status:** not started

### 3.1 — TS type generator
**Target file:** `pipeline/step-03-backend-generation/pipeline/backend_gen/ts_types_generator.py` (new)

After step 03 produces `api_manifest.json`, emit `api_types.ts` with:
- TypeScript interfaces for every Pydantic schema in the manifest
- Typed API client surface (`apiClient.orders.list()`, etc.)
- Enums for every backend enum

Output to `runs/outputs/<slug>/frontend/src/api_types.ts`.

### 3.2 — Wire into step 05
Update step 05 prompt to import and use types from `api_types.ts`.
`tsc --noEmit` from phase 1.2 then catches all contract drift automatically.

---

## Phase 4 — Agentic fixer (measurement-driven)

**Status:** not started — gate: build only if cross-file/cascading/investigation
errors dominate `errors_surviving_repair` after 2 weeks of phase 1–3 data.

If green-lit:
- `pipeline/step-03-backend-generation/pipeline/backend_gen/fixer_agent.py`
- `pipeline/step-05-react-generation/pipeline/react_pipeline/services/fixer_agent.py`
- Re-enable `step-06-edit` in `main.py`

Guardrails: 15-turn limit, snapshot-and-restore, edit-scope sandbox,
30% diff cap, one invocation per step per run.

---

## Phase 5 — Opportunistic hardening

**Status:** ongoing / as telemetry justifies

- Move static files to templates (package.json, tsconfig.json, FastAPI wiring)
- Descriptor-based TSX generation (only if TSX errors survive everything above)
- Verify IR cache is wired (`shared/ir_cache.py`)
- `--resume <run_id>` flag to skip steps with existing output