10 Commits

Author SHA1 Message Date
chahinebrini
0533fcad71 fix(mail): classifier v1.2 — FS-token + extreme-percent + casino-name + threshold 40
- SUBJECT_PERCENT_PATTERN weight 10 → 15
- SCORE_BLOCK_MIDRANGE 50 → 40
- new FS-token pattern (+20): \b\d+\s*FS\b (free-spins abbreviation)
- new extreme-percent pattern (+20): \b[1-9]\d{2,}\s*% (>=100% signals gambling)
- casino in sender display-name (+30): targeted v1.2 reactivation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:17:25 +02:00
chahinebrini
38811820e6 feat(backend): Public-Domain-Guard + Mail-Detection (spins/%-Pattern)
Public-Domain-Guard (icloud.com/gmail.com etc. nie blockbar/veröffentlichbar):
- neue utils/public-email-domains.ts (shared Freemail-Liste)
- custom-domains/index.post + custom-domains/suggest + curated-domains/suggest
  lehnen Public-Domains mit 400 PUBLIC_DOMAIN ab (defense-in-depth)

Mail-Detection (mo): "spins" zu GAMBLING_KEYWORDS + Subject-%-Pattern (Score 10)
→ fängt "Spins + 400% Bonus"-Spam von Freemail-Absendern. 61/61 Tests grün.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 01:06:06 +02:00
chahinebrini
c3de7055a5 feat(mail): Sucht-Compound-Regel + Phase-1-Training-Foundation
Task B — linguistische FP-Fix:
- mail-classifier.ts: Subject-Keyword-Loop überspringt Keyword-Score wenn
  Subject das Keyword als Sucht-Compound enthält (z.B. "glücksspiel" in
  "Glücksspielsucht" → kein +50 Score). Globale linguistische Invariante
  Deutsch — Gambling-Marketer schreiben nie "Glücksspielsucht-Bonus".
- gambling-keywords.mjs: GAMBLING_WHITELIST erweitert um Stamm-Varianten
  (wettsucht, spielsucht, suchtberatung, suchthilfe) als Fallback für
  Compounds wo keyword ≠ exakter Stamm.
- 4 neue Tests: Forum Glücksspielsucht → PASS, Hilfe bei Spielsucht → PASS,
  Wettsucht-Selbsthilfe → PASS, Glücksspiel-Bonus 100€ → BLOCK.

Task C — Phase-1-Data-Foundation:
- mail-training-utils.ts: sanitizeSubjectForTraining() (PII-Stripping via
  Regex: EMAIL/URL/NUM/Greeting/ALL-CAPS) + detectSubjectLanguage() via
  franc (iso639-3). 26 Unit-Tests.
- franc@6.2.0 installiert (~50KB ESM).
- mail.ts insertMailClassificationSample(): ruft sanitizeSubjectForTraining()
  auf, schreibt detectedLang + subjectSanitized in features-JSON
  (Interim bis Schema-Migration).
- mail-retention-cron.ts: Subject-Nullification nach 30 Tagen (täglich) +
  Sample-Purge nach 12 Monaten (monatlich). DSGVO Art. 5 Abs. 1e.

105 Tests grün (58 classifier + 26 training-utils + 11 display-name + 10 gmail).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-29 08:14:57 +02:00
chahinebrini
807847381f fix(mail): Junk-Folder IDLE-Gap + Layer 2.6 global Display-Name-Patterns
Task 1 — Junk-Folder Fix:
- noopTimer (alle 2min) ruft jetzt triggerScan(conn) fire-and-forget auf
- Outlook/Hotmail-Mails die direkt in "Junk Email" landen werden damit
  innerhalb von max. 2min erfasst (IDLE hört nur INBOX, kein exists-Event)
- Consent-Guard analog exists-Event: nur wenn conn.consentAt gesetzt

Task 2 — Layer 2.6 global Display-Name-Patterns:
- getMailDisplayNamePatterns(userId) neu in db/domains.ts:
  lädt aus global_mail_display_names (admin-curated, pending Migration)
  + user_custom_domains type=mail_display_name (backward-compat)
  mit try/catch-Fallback bis Schema-Migration deployed ist
- getCustomMailDisplayNames() als @deprecated markiert (bleibt für Übergangszeitraum)
- scan-internal.post.ts: Import auf getMailDisplayNamePatterns umgestellt
- mail-classifier.ts: Layer 2.6 Kommentar von "dead code" auf "live v1.1" aktualisiert

Schema-Migration (global_mail_display_names) ist Aufgabe von rebreak-backend.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-28 16:07:05 +02:00
chahinebrini
4573d16e1a refactor(mail-classifier): display-name aus Score-Pfad entfernen (v1.0)
SENDER_NAME_GAMBLING_KEYWORD (+30) und SENDER_NAME_BRAND_MATCH (+20) aus
SCORE_WEIGHTS entfernt. Layer-2.5-Brand-Match prüft nur noch Domain-Root
und Relay-Domain, nicht mehr displayNameNorm. Sender-Name-Keywords-Block
in computeScore() entfernt. keywordHitsName bleibt im Interface für v1.1.

Tests: Brand+Random-Tests die Display-Name als einzige Brand-Source hatten
auf neues v1.0-Verhalten (PASS) umgeschrieben. Zwei neue Tests: Display-Name-
only Casino-Signal → Score=0 → PASS verifiziert.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 05:18:00 +02:00
chahinebrini
00ec716694 fix(mail): skip Gmail system folders in scan + raise subject-keyword score to 50
Fix 1 (scan-internal): filter out \All, \Drafts, \Sent, \Trash, \Flagged via
specialUse — stops [Gmail]/All Mail from consuming the SCAN_LIMIT=200 and
blocking new INBOX mails from reaching fetch range. \Junk/\Spam stay in scope.
Folders without specialUse (iCloud, GMX) pass through untouched — no false
exclusions without confirmed metadata.

Fix 2 (mail-classifier): raise SUBJECT_GAMBLING_KEYWORD from 35 to 50 so a
single unambiguous casino/jackpot/freispiel subject hit alone reaches the
SCORE_BLOCK_MIDRANGE threshold and triggers a block. Previously 35 pts fell
short when sender domain was generic and display name empty.

Tests: 9 new cases added (2 Fix-2 classifier + 4 Fix-1 folder-filter unit +
1 computeScore score=50 exact assertion). All 265 tests green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-16 05:12:14 +02:00
chahinebrini
c1250836a3 fix(backend): remove display-name pattern support for v1.0
User explicitly chose to drop display-name matching from v1.0 after
the UX trap surfaced — a user typing "EXTRASPIN" without a domain got
a 400 INVALID_DOMAIN back, which is a confusing dead-end. v1.1 will
ship a dedicated display-name UI; until then mail input is domain-only.

- resolveTypeAndValue returns a discriminated union — kind='mail' with
  no dot or @ now resolves to { ok: false, error: 'INVALID_MAIL_DOMAIN' }
  instead of silently turning into a mail_display_name row.
- Full-address mail input (local@domain.tld) still gets its local-part
  stripped server-side so the stored value is always a clean domain.
- Variant-B body { type: 'mail_display_name' } returns 400
  DISPLAY_NAME_NOT_SUPPORTED for direct API consumers.
- The DISPLAY_NAME_PATTERN regex is gone — the path that used it can
  no longer be reached.
- classifyMail's Layer 2.6 (the display-name substring match) is
  intentionally left in place as dead code with a v1.1 marker, so
  re-enabling later is just wiring the input field back up and feeding
  the customDisplayNames array.
- Tests rewritten: the two pre-existing display-name tests now assert
  the 400 INVALID_MAIL_DOMAIN path, plus a new positive case for the
  full-address local-part strip. 217 vitest passes, 4 pre-existing skips.

Staging DB clean — the type column hasn't been deployed yet so no
mail_display_name rows exist to backfill.
2026-05-16 02:17:50 +02:00
chahinebrini
7dbcac6700 feat(backend): custom mail patterns — display-name match + type-aware api
Completes the custom-mail-patterns feature (schema + migration shipped
in ba170af alongside the chat-tab-badge commit — apologies for the
mishap, agent staging collided with mine). This is the actual logic
that makes the new type column do work:

- mail-classifier.ts: new layer 2.6 between brand+random-token detect
  and the score-based heuristic. Case-insensitive substring match of
  the From-display-name against the user's customDisplayNames list.
  Hard-block when matched, skip score entirely.
- db/domains.ts: getCustomMailDisplayNames(userId) reads the new
  type=mail_display_name rows. countActiveCustomDomains stays a shared
  total — matches the user's pick of a single 5/5/10 pool spanning
  web + mail patterns rather than separate counts per type.
- scan-internal.post.ts and scan.post.ts both preload the display-name
  list per user before the message loop and thread it into classifyMail.
- POST /api/custom-domains accepts { pattern, kind: 'web' | 'mail' }
  with the server inferring the concrete type — 'mail' splits into
  mail_domain when the input contains a TLD-like shape, otherwise
  mail_display_name. Existing { domain } body shape stays accepted
  for backwards compatibility with older clients.
- POST /api/custom-domains/:id/submit treats both mail types as
  community-submittable. The user explicitly chose this; the admin
  review pipeline is the backstop against display-name false positives.
- vitest cases cover: substring match, case insensitivity, no-match
  fallthrough to score, mail_domain still flowing through the existing
  domain-set path, and shared-pool slot counts (3 web + 2 mail_domain
  + 1 mail_display_name = 6 against the 10-slot legend cap).
2026-05-16 01:53:59 +02:00
chahinebrini
f2e3c00943 refactor(mail): remove groq llm layer — deterministic pipeline only
User-Direktive: Mail-Filter bleibt auf dem deterministischen Score+Layer-2.5-Stack.
Groq-LLM Borderline-Call (Layer 4) entfernt. Layer 2.5 Brand+Random fängt den
Apple Hide-My-Email Fall (icloud.com-Adressen mit kryptischen Local-Parts +
Brand-DisplayName) weiterhin sauber via Hard-Block.

Score-Mid-Range 25-79 entscheidet jetzt deterministisch: ≥50 → BLOCK, sonst PASS.

Damit auch DSGVO-P0-Items aus dem Hans-Müller-Review obsolet
(AVV-Annex Groq, Drittland-USA-Consent-Toggle, Datenschutzerklärung-Absatz).

- mail-classifier.ts: callGroqClassifier + redactLocalPartForLLM + groq-Feld raus
- scan.post.ts + scan-internal.post.ts: groqApiKey-Param raus, groq*-Sample-Felder raus
- mail-classifier.test.ts: Groq-Tests + redactLocalPart-Tests entfernt, 46 Tests grün

DB-Spalten in mail_classification_samples (groq_*) bleiben als legacy nullable —
Cleanup-Migration optional in späterem Sprint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 22:15:32 +02:00
chahinebrini
bdd93668ae feat(mail): multi-layer classifier — Brand+Random, Relay-Decoder, Score, Groq + ML-Sampling
Layer 0–4 Klassifikations-Pipeline in mail-classifier.ts:
- Layer 2: Domain-Hard-Block + Relay-Decoder (=domain.tld aus SendGrid/Mailchimp-Bounces)
- Layer 2.5: Brand+Random-Token-Hard-Block (Gambling-Brand-Normalisierung + Random-Token-Detection)
  verhindert LLM-Call für bekannte Gambling-Relayer (Gamblezen, BetandPlay etc.)
- Layer 3: Score 0–100 (TS-Gewichte: Domain-Keywords, Subject-Keywords, Name-Match,
  Geld-Pattern, Urgency, All-Caps, Short-Random-Domain, Brand/Random-Ergänzungen)
- Layer 4: Groq Llama 3.3 70B Borderline-Klassifikation (Score 25–75)
  mit Local-Part-Redaction (DSGVO: nur behalten wenn local-part selbst Keyword enthält)
- Layer 5: MailClassificationSample-Insert nach jeder Klassifikation (ML-Phase 3)

Migrations:
- 20260514_add_mail_blocked_trigger_source: ADD COLUMN trigger_source auf mail_blocked
- 20260514_add_mail_classification_sample: CREATE TABLE mail_classification_samples

50 neue Tests (mail-classifier.test.ts): alle Layer, beide Screenshot-Beispiele (Gamblezen +
BetandPlay) bestätigt als Layer-2.5-Hard-Block ohne LLM-Call, Whitelist, Score, Redaction.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-14 22:05:35 +02:00