# Story 1.3: Staged Transaction Contracts & Duplicate Detector (TDD)

Status: review

## Story

As a developer,
I want `StagedTransaction` and `ParseError` defined as plain dataclasses, and `duplicate_detector` built test-first before any migration runs,
so that the core deduplication contract is locked in code with tests before the database schema is created.

## Acceptance Criteria

1. `app/services/pdf_parsers/base.py` defines two plain `@dataclass` classes:
   - `StagedTransaction` with fields: `date`, `merchant_raw`, `merchant_normalized`, `amount: Decimal`, `is_credit: bool`, `issuer`, `dedup_hash`, `confidence_score: float`, `raw_text`
   - `ParseError` with fields: `page_number: int`, `raw_text`, `reason`, `parser_version`
2. `app/services/duplicate_detector.py` implements `flag_duplicates(new_fingerprints: list[str], existing_fingerprints: list[str]) -> list[bool]` accepting only plain Python types — no ORM objects, no `db.session`.
3. `dedup_hash` is computed as `SHA256(normalized_merchant + str(amount) + date.isoformat())` — a helper function `compute_dedup_hash()` is defined in `base.py` for use by parsers and the transactions blueprint.
4. `tests/test_services/test_duplicate_detector.py` passes with tests covering: exact match detected, no match, empty `new_fingerprints` → empty list, empty `existing_fingerprints` → all False.
5. A `@pytest.mark.skip("wired in Epic 9")` stub test for the staged-vs-main dedup path is present in the test file.
6. **`pytest tests/test_services/test_duplicate_detector.py` passes BEFORE any `flask db init` or migration is run.**

## Tasks / Subtasks

- [x] **Task 1: Write the failing test file FIRST (TDD red phase)** (AC: 4, 5, 6)
  - [x] Create `tests/test_services/test_duplicate_detector.py`
  - [x] Write `test_exact_match_detected` — new hash that appears in existing → returns `[True]`
  - [x] Write `test_no_match` — new hash not in existing → returns `[False]`
  - [x] Write `test_empty_new_fingerprints_returns_empty_list` — `flag_duplicates([], [...])` → `[]`
  - [x] Write `test_empty_existing_fingerprints_returns_all_false` — `flag_duplicates([h1, h2], [])` → `[False, False]`
  - [x] Write the skip-annotated stub: `@pytest.mark.skip("wired in Epic 9")` `test_staged_vs_main_dedup_path`
  - [x] Run `pytest tests/test_services/test_duplicate_detector.py` and confirm tests **FAIL** (ImportError expected) — this validates the TDD setup
  - [x] **Do NOT implement anything yet — confirm failure first**

- [x] **Task 2: Implement `app/services/pdf_parsers/base.py`** (AC: 1, 3)
  - [x] Define `StagedTransaction` as a `@dataclass` with all 9 fields (see Dev Notes for exact types)
  - [x] Define `ParseError` as a `@dataclass` with all 4 fields (see Dev Notes for exact types)
  - [x] Define `compute_dedup_hash(normalized_merchant: str, amount: Decimal, date: datetime.date) -> str` as a module-level function using `hashlib.sha256`
  - [x] Zero ORM imports — stdlib only (`dataclasses`, `decimal`, `datetime`, `hashlib`)

- [x] **Task 3: Implement `app/services/duplicate_detector.py`** (AC: 2)
  - [x] Implement `flag_duplicates(new_fingerprints: list[str], existing_fingerprints: list[str]) -> list[bool]`
  - [x] Logic: return `True` for each new fingerprint that appears in `set(existing_fingerprints)`
  - [x] Zero ORM imports — no SQLAlchemy, no `db`, no Flask
  - [x] Zero imports from `pdf_parsers` — plain Python only

- [x] **Task 4: Run tests to confirm green** (AC: 4, 5, 6)
  - [x] Run `pytest tests/test_services/test_duplicate_detector.py -v`
  - [x] Confirm: 4 tests pass, 1 skipped — **zero failures, zero errors**
  - [x] Confirm: **no `flask db init` has been run** (check that `migrations/versions/` directory is empty)
  - [x] Run full suite `pytest tests/ -v` to confirm no regressions

## Dev Notes

### ⚠️ TDD Order is Mandatory (AR-2)

The architectural constraint AR-2 states: `test_duplicate_detector.py` is written and passing **BEFORE** the first Alembic migration. The implementation sequence is:

1. Write test file → confirm it **FAILS** (ImportError)
2. Implement `base.py` dataclasses
3. Implement `duplicate_detector.py`
4. Confirm tests **PASS**
5. Story 1.4 then runs `flask db init && flask db migrate && flask db upgrade`

Do NOT run `flask db init` as part of this story.

### Exact Implementation: `app/services/pdf_parsers/base.py`

```python
"""
Canonical data contracts for the PDF import pipeline.

StagedTransaction and ParseError are the only types parsers ever return.
Parsers NEVER import from ORM models, Flask, or SQLAlchemy.

compute_dedup_hash() is the canonical hash function used by:
  - All PDF parsers (when building StagedTransaction instances)
  - The transactions blueprint (Story 3.3, manual entry dedup)
  - staging_pipeline.py (Story 9.1, pre-commit dedup check)
"""
from __future__ import annotations

import hashlib
from dataclasses import dataclass
from datetime import date
from decimal import Decimal


@dataclass
class StagedTransaction:
    """Canonical output of any PDF/CSV parser. Never touches the database."""
    date: date
    merchant_raw: str
    merchant_normalized: str
    amount: Decimal
    is_credit: bool
    issuer: str
    dedup_hash: str
    confidence_score: float
    raw_text: str


@dataclass
class ParseError:
    """Structured parse failure — surfaced per-row, not raised as exception."""
    page_number: int
    raw_text: str
    reason: str
    parser_version: str


def compute_dedup_hash(
    normalized_merchant: str,
    amount: Decimal,
    txn_date: date,
) -> str:
    """
    Canonical dedup hash: SHA256(normalized_merchant + str(amount) + date.isoformat()).

    Args:
        normalized_merchant: Cleaned merchant name (post-normalization).
        amount: Transaction amount as Decimal.
        txn_date: Transaction date.

    Returns:
        64-character lowercase hex digest.
    """
    payload = f"{normalized_merchant}{str(amount)}{txn_date.isoformat()}"
    return hashlib.sha256(payload.encode("utf-8")).hexdigest()
```

### Exact Implementation: `app/services/duplicate_detector.py`

```python
"""
Duplicate detection for transaction import and manual entry.

flag_duplicates() is shared infrastructure — called from:
  - staging_pipeline.py (Epic 9): detects duplicates across staged + committed transactions
  - transactions blueprint (Story 3.3): warns on manual duplicate entry

Architectural boundary: accepts ONLY plain Python types. No ORM session, no db imports.
The caller is responsible for querying the DB and passing fingerprint strings.
"""
from __future__ import annotations


def flag_duplicates(
    new_fingerprints: list[str],
    existing_fingerprints: list[str],
) -> list[bool]:
    """
    Flag which new fingerprints already exist in the committed/staged set.

    Args:
        new_fingerprints: Dedup hashes for transactions being evaluated.
        existing_fingerprints: Dedup hashes from existing transactions
            (caller queries the DB within a ±3-day window before calling this).

    Returns:
        List[bool] of same length as new_fingerprints.
        True  → this hash exists in existing_fingerprints (probable duplicate).
        False → not found (safe to commit).
    """
    existing_set = set(existing_fingerprints)
    return [fp in existing_set for fp in new_fingerprints]
```

### Exact Implementation: `tests/test_services/test_duplicate_detector.py`

```python
"""
TDD spec for duplicate_detector.flag_duplicates().

Written BEFORE the first Alembic migration (AR-2 compliance).
Tests use only plain Python — no DB, no Flask app context required.
"""
import pytest
from app.services.duplicate_detector import flag_duplicates


HASH_A = "aaa111"
HASH_B = "bbb222"
HASH_C = "ccc333"


def test_exact_match_detected():
    """A new fingerprint that exists in existing set is flagged as duplicate."""
    result = flag_duplicates([HASH_A], [HASH_A, HASH_B])
    assert result == [True]


def test_no_match():
    """A new fingerprint not in the existing set is not flagged."""
    result = flag_duplicates([HASH_C], [HASH_A, HASH_B])
    assert result == [False]


def test_empty_new_fingerprints_returns_empty_list():
    """Empty input produces empty output."""
    result = flag_duplicates([], [HASH_A, HASH_B])
    assert result == []


def test_empty_existing_fingerprints_returns_all_false():
    """No existing transactions means nothing can be a duplicate."""
    result = flag_duplicates([HASH_A, HASH_B], [])
    assert result == [False, False]


@pytest.mark.skip("wired in Epic 9")
def test_staged_vs_main_dedup_path():
    """
    Staged transactions must be checked against BOTH the staging DB and the main DB.

    Implementation note (Epic 9):
      - staging_pipeline.begin_import() collects existing hashes from:
          1. Current import batch's staged rows (catches intra-batch dupes)
          2. Main DB transactions within ±3-day window of each new transaction's date
      - Both sets are merged before calling flag_duplicates()
      - This ensures statement-period overlaps across multiple imports are caught
    """
    pass
```

### Key Design Decisions

**Why `compute_dedup_hash` lives in `base.py` (not `duplicate_detector.py`)?**

`base.py` defines the *data contract* — what a `StagedTransaction` is and how its identity hash is computed. The hash is a property of the transaction, not of the detector. By placing it in `base.py`, parsers (and later, the transactions blueprint in Story 3.3) can compute it without importing from `duplicate_detector`. Keeps `duplicate_detector.py` focused on the single function it needs to do.

**Why `flag_duplicates` takes `list[str]` not `list[StagedTransaction]`?**

Architectural boundary (AR-7): `duplicate_detector` is shared infrastructure called from both the import pipeline AND the manual transactions blueprint. If it accepted `StagedTransaction` objects, the transactions blueprint would depend on the import pipeline contracts. Plain strings decouple the two callers.

**The 3-day window is the CALLER's responsibility.**

`flag_duplicates` has no concept of dates or windows. The caller (staging_pipeline.py in Epic 9; transactions blueprint in Story 3.3) queries the DB for hashes within the relevant date window and passes them in. This keeps `duplicate_detector.py` free of all DB and ORM dependencies.

**`amount` is `Decimal` in the hash computation.**

`str(Decimal('29.99'))` → `'29.99'` (exact representation). Using `float` would risk `str(29.99)` → `'29.99'` but `str(0.1 + 0.2)` → `'0.30000000000000004'`. All monetary amounts in this project are `Decimal` — never `float`.

### Files Modified in This Story

| File | Status | Notes |
|------|--------|-------|
| `tests/test_services/test_duplicate_detector.py` | **NEW** | Written first (TDD red) |
| `app/services/pdf_parsers/base.py` | **MODIFY** | Replace stub with dataclasses + compute_dedup_hash |
| `app/services/duplicate_detector.py` | **MODIFY** | Replace stub with flag_duplicates |

### Files That Must NOT Be Touched

| File | Why |
|------|-----|
| `migrations/` | Must remain empty — no `flask db init` in this story |
| `tests/conftest.py` | Complete and working — do not modify |
| `app/__init__.py` | Minimal factory from Story 1.2 — do not modify |
| Any model file | Models are Story 1.4 — stubs must stay as stubs |

### Previous Story Learnings

- **Story 1.2:** `tests/conftest.py` is live with `app`, `db`, `client` fixtures. The test file for Story 1.3 does NOT need these fixtures — `flag_duplicates` takes no DB or Flask context.
- **Story 1.1:** Python 3.12.3, pytest 9.0.3 in `.venv/`. Run with `source .venv/bin/activate`.
- Exit code 5 = "no tests collected" is normal for empty suites; after this story there will be 4 passing + 1 skipped.

### Architecture References

- **AR-2** — [Source: docs/epics.md#Story-1.3] test file written before migration
- **AR-7** — [Source: docs/architecture.md#Architectural-Constraints] duplicate_detector is shared infrastructure
- **Boundary 3 (Parser/DB)** — [Source: docs/architecture.md] parsers never import from ORM; base.py has zero SQLAlchemy imports
- **Boundary 6 (Amortization/Everything)** — same isolation principle applies to `duplicate_detector.py`
- **Dedup key spec** — [Source: docs/architecture.md#Data-Architecture] `SHA256(normalized_merchant + str(amount) + date.isoformat())`

## Dev Agent Record

### Agent Model Used

claude-sonnet-4-6

### Debug Log References

None — clean implementation. TDD red phase confirmed ImportError; green phase passed on first attempt.

### Completion Notes List

- **Task 1 (TDD red):** Created `tests/test_services/test_duplicate_detector.py` with 4 test functions + 1 `@pytest.mark.skip` stub. Confirmed ImportError on run — TDD red phase validated.
- **Task 2 (base.py):** Replaced stub in `app/services/pdf_parsers/base.py` with `StagedTransaction` (9 fields), `ParseError` (4 fields), and `compute_dedup_hash()` using `hashlib.sha256`. Zero ORM imports — stdlib only.
- **Task 3 (duplicate_detector.py):** Replaced stub with `flag_duplicates(new_fingerprints, existing_fingerprints) -> list[bool]` using a set for O(1) lookups. Zero ORM/Flask imports. Zero imports from `pdf_parsers`.
- **Task 4 (green):** `pytest tests/test_services/test_duplicate_detector.py -v` → 4 passed, 1 skipped. Full suite `pytest tests/ -v` → same result, zero regressions. Confirmed `migrations/versions/` does not exist — AR-2 compliance maintained.

### File List

- `tests/test_services/test_duplicate_detector.py` — **NEW**: TDD spec (4 active tests + 1 skip stub for Epic 9)
- `app/services/pdf_parsers/base.py` — **MODIFIED**: replaced stub with StagedTransaction dataclass, ParseError dataclass, compute_dedup_hash()
- `app/services/duplicate_detector.py` — **MODIFIED**: replaced stub with flag_duplicates() implementation

### Change Log

- 2026-05-27: Story 1.3 implemented — TDD cycle complete: test file written first (red), base.py + duplicate_detector.py implemented (green). 4 tests passing, 1 skipped, no migration run. (claude-sonnet-4-6)