# Story 5.1: Python/PHP Boundary & Categorization Infrastructure

## Story

**As a** developer,
**I want** a clearly defined Python directory structure and database schema for predictions and evidence,
**So that** the AI categorization work has a consistent home and all future categorization stories build on the same foundation.

## Status

done

## Acceptance Criteria

All ACs met.

## Tasks / Subtasks

- [x] **Task 1: Migrations**
  - [x] `20260525000001_create_categorization_batches_table.php` — `id`, `status`, `product_ids` (JSONB), `total_products`, `processed_products`, `success_products`, `error_products`, `retry_count`, `created_by` (FK users.id SET_NULL), `created_at`, `started_at`, `completed_at`, `error_message`. Index on `status`.
  - [x] `20260525000002_create_predictions_table.php` — `id`, `product_id` (FK CASCADE), `batch_id` (FK CASCADE), `suggested_category_code`, `suggested_category_label`, `confidence_score` (decimal 5,2), `confidence_level`, `status`, `created_at`, `updated_at`. Indexes on `product_id`, `batch_id`, `status`.
  - [x] `20260525000003_create_evidence_table.php` — `id`, `prediction_id` (FK CASCADE), `source_type`, `source_label`, `evidence_value`, `weight` (decimal 4,3), `created_at`. Index on `prediction_id`.
  - [x] `20260525000004_create_audit_log_table.php` — `id`, `entity_type`, `entity_id`, `action`, `actor_type`, `actor_id`, `metadata` (JSONB), `created_at`. Indexes on `(entity_type, entity_id)` and `created_at`.
  - [x] `20260525000005_alter_akeneo_workflow_add_prediction_fk.php` — adds FK `akeneo_workflow.prediction_id → predictions.id` ON DELETE SET NULL.

- [x] **Task 2: Python directory structure**
  - [x] `python/requirements.txt` — `psycopg2-binary`, `python-dotenv`
  - [x] `python/.gitignore` — excludes `.venv/`, `__pycache__/`, `*.pyc`
  - [x] `python/config.py` — reads `DB_HOST/PORT/NAME/USER/PASSWORD` from env via dotenv; `DEFAULT_CONFIDENCE_THRESHOLD = 90.0`
  - [x] `python/models/__init__.py`
  - [x] `python/models/database.py` — `get_connection()` via psycopg2 with `RealDictCursor`
  - [x] `python/services/__init__.py`
  - [x] `python/services/confidence_scorer.py` — `compute_score()` (weighted average → [10, 100]), `derive_level(score, threshold)`
  - [x] `python/services/evidence_builder.py` — `build_evidence()` (part-number patterns, manufacturer category map, enrichment attributes), `pick_category()`, `fallback_signal()`
  - [x] `python/services/categorizer.py` — `categorize_product()`: skip-if-already-predicted guard, inserts `predictions` + `evidence` rows, updates `products.status = 'predicted'`
  - [x] `python/categorize.py` — entry point; SAPI guard; loads batch; enforces `retry_count < 3` cap; marks processing; paginates product_ids; calls `categorizer.categorize_product()`; updates counters per product; finalises status; writes audit log entries for retries

- [x] **Task 3: `docs/python-setup.md`** — venv setup, env var table, manual run command, PHP/Python communication contract

## Dev Notes

### PHP/Python contract

PHP writes `categorization_batches` and spawns `python3 python/categorize.py {batch_id}`. Python updates `processed_products` after each product so PHP polling at `/api/categorization-batches/{id}/status` shows live progress. PHP never reads Python stdout/stderr.

### File list

**New files:** `db/migrations/20260525000001-5_*.php`, `python/` (full tree), `docs/python-setup.md`
