LLM Categorization Contract¶
Rules and specifications for LLM usage in the Dev Health platform.
Overview¶
LLMs are used in two contexts:
| Context | When | Purpose | Constraints |
|---|---|---|---|
| Compute-time | During data processing | Categorize work into investment themes | Strict schema, persisted |
| UX-time | On user request | Explain persisted categorizations | Read-only, no recomputation |
Compute-Time Categorization¶
Purpose¶
Map messy human text to canonical investment categories with subcategory distributions.
Schema Compliance¶
Output MUST be strict JSON matching:
work_graph/investment/llm_schema.py
Output Requirements¶
| Requirement | Details |
|---|---|
| Keys | From canonical subcategory registry only |
| Probabilities | Valid (0–1), normalized |
| Evidence | Extractive substrings from input text |
| Theme roll-up | Computed deterministically from subcategories |
Example Output¶
{
"subcategories": {
"operational.external": 0.45,
"operational.incident": 0.25,
"feature_delivery.customer": 0.20,
"maintenance.refactor": 0.10
},
"evidence": {
"operational.external": ["customer-facing issue", "support ticket"],
"operational.incident": ["incident response", "outage window"],
"feature_delivery.customer": ["requested by customer"],
"maintenance.refactor": ["cleanup", "technical debt"]
},
"uncertainty": "moderate"
}
Two-Stage Process¶
- LLM Stage: Map text → subcategory distribution
- Deterministic Stage: Roll subcategories → themes (no LLM)
This separation prevents category drift.
Retry Policy¶
When to Retry¶
- Whitespace/empty response
- Truncated response (
finish_reason=length) - Invalid JSON structure
- Missing required keys
Retry Strategy¶
- First attempt: Standard prompt, standard tokens
- Retry attempt:
- Double
max_completion_tokens(minimum 512) - Simplify and harden JSON prompt
- Add explicit JSON instruction in system AND user message
Failure Handling¶
After one retry failure:
1. Mark categorization as invalid
2. Apply deterministic fallback
3. Persist with fallback_applied=true
Audit Fields¶
Every categorization run must persist:
| Field | Description |
|---|---|
categorized_at |
Timestamp |
model_version |
LLM model identifier |
prompt_hash |
Hash of prompt template |
raw_response |
Original LLM output |
fallback_applied |
Boolean |
retry_count |
Number of retries |
finish_reason |
OpenAI finish reason |
token_usage |
Tokens consumed |
OpenAI-Specific Handling¶
JSON Mode¶
Include explicit JSON instruction in both: - System message - User message
Token Configuration¶
- Use
max_completion_tokens(notmax_tokens) - Minimum: 512 tokens
- Double on retry
Observability¶
Log on every call:
- finish_reason
- content_length
- Token parameters
- Response time
UX-Time Explanation¶
Purpose¶
Generate human-readable explanations of persisted categorizations.
Constraints¶
| Allowed | Forbidden |
|---|---|
| Read persisted distributions | Recompute categories |
| Read stored evidence | Change edges/weights |
| Generate narrative text | Introduce new conclusions |
| Cite specific evidence | Modify persisted decisions |
Required Labeling¶
All explanation output MUST be labeled as AI-generated.
Explanation Prompt¶
Canonical prompt (use verbatim):
You are explaining a precomputed investment view.
You are not allowed to:
- Recalculate scores
- Change categories
- Introduce new conclusions
- Be conversational (no "Hello", "As an AI", or interactive follow-ups)
Explain the investment view in three distinct sections:
1. **SUMMARY**: Provide a high-level narrative (max 3 sentences) using
probabilistic language (appears, leans, suggests) explaining why
the work leans toward the primary categories.
2. **REASONS**: List the specific evidence (structural, contextual,
textual) that contributed most to this interpretation.
3. **UNCERTAINTY**: Disclose where uncertainty exists based on the
evidence quality and evidence mix.
Always include evidence quality level and limits.
Language Rules¶
Allowed Language¶
Use probabilistic, uncertain phrasing:
- appears
- leans
- suggests
- indicates
- may be
Forbidden Language¶
Avoid definitive, deterministic phrasing:
- is
- was
- detected
- determined
- definitely
- clearly
Rationale¶
The distinction maintains appropriate uncertainty. LLM categorization is inference, not detection.
Evidence Handling¶
Extractive Quotes¶
Evidence quotes MUST be: - Direct substrings from input text - Not paraphrased - Not summarized - Traceable to source
Evidence Types¶
| Type | Source | Example |
|---|---|---|
| Textual | Issue/PR title, description, commits | "hotfix for production bug" |
| Structural | Relationships, links | "Linked to incident #123" |
| Contextual | Timing, patterns | "Merged during outage window" |
Forbidden Patterns¶
Do Not¶
- Invent categories not in canonical list
- Use free-form reasoning in output
- Override canonical vocabulary
- Return "unknown" or "uncategorized"
- Hallucinate evidence not in input
- Apply categories based on author identity
Immediate Failure Conditions¶
- Output contains non-canonical keys
- Probabilities don't sum correctly
- Evidence quotes not found in input
- Missing required output sections
Testing¶
Unit Tests Must Cover¶
- Valid JSON output parsing
- Probability normalization
- Evidence extraction validation
- Retry logic
- Fallback application
- Audit field persistence
Mock Requirements¶
- Mock LLM API responses
- Test various failure modes
- Verify retry behavior
- Test fallback categorization