Training Data

4 datasets · 4000 records · 12 features · 10 training rounds
How each dataset shapes what /comprendre returns

The pipeline — data to product

Customer message
    |
    v
Extraction (regex or Haiku) --> 12 features
    |
    +---> routing_model trained on routing dataset --> routage.service
    |
    +---> urgency_model trained on urgency dataset --> routage.urgence
    |
    +---> sub_routing_model trained on sub_routing dataset --> routage.sous_categorie
    |
    +---> churn_model trained on churn dataset --> risque_client.churn
    |
    v
Combined --> priorite + actions_suggerees + trust_score + xai

Each model reads the same 12 features but listens to different signals. Routing cares about intention and keywords. Urgency cares about frustration and escalade. Churn cares about menace and emotion. The datasets teach each model what to focus on.

The 12 features

Feature	Type	Routing	Urgency	SubCat	Churn
`intention_primaire`	categorical	primary	moderate	primary	moderate
`mots_cles`	text	primary	moderate	primary	strong
`frustration`	float	weak	primary	weak	primary
`menace_churn`	boolean	weak	primary	weak	primary
`escalade`	string	weak	strong	weak	strong
`retard_jours`	integer	moderate	strong	moderate	moderate
`urgence_percue`	categorical	weak	strong	weak	moderate
`intention_secondaire`	categorical	moderate	weak	moderate	weak
`ref_commande`	string	weak	weak	weak	weak
`produit_mentionne`	string	weak	weak	moderate	weak
`canal`	categorical	weak	weak	weak	weak
`date_livraison_prevue`	string	moderate	weak	weak	weak

Snake doesn't use feature importance — it builds SAT clauses. But the training data encodes which features carry signal via the label-feature correlations. The table above reflects the synthetic data design: routing labels correlate with intention/keywords, urgency labels correlate with frustration/menace.

The 4 datasets

routing_model strong

0.990

Routes to the right department. 1000 records, 5 balanced classes (~20% each).

Class	Count	Distribution	Per-class AUROC
SAV	206	20.6%	0.998
Technique	205	20.5%	0.985
Comptabilite	202	20.2%	0.988
Commercial	194	19.4%	0.981
Logistique	193	19.3%	0.999

Product impact: Drives routage.service — which department handles this request. A wrong routing = client waits, gets bounced between departments, frustration rises. At 0.990 AUROC, misrouting is rare.

Real data potential: Every resolved ticket has a department tag. 500 labeled tickets would build a production-grade model. Current synthetic model is already near-ceiling — real data would refine edge cases (multi-department requests).

urgency_model strong

0.979

Classifies how fast to act. 1000 records, 4 imbalanced classes. Critique is the rare-but-critical class (11.4%).

Class	Count	Distribution	Per-class AUROC
Normale	357	35.7%	0.972
Haute	299	29.9%	0.962
Basse	230	23.0%	0.993
Critique	114	11.4%	0.988

Product impact: Drives routage.urgence and routage.priorite. A missed Critique = chantier bloque, client escalates, churn risk materializes. The model feeds the trust score via model_agreement (urgence + churn alignment).

Real data potential: Use time-to-first-response or SLA tier as proxy label. 800 tickets minimum. The Haute/Normale boundary is where real data helps most — agents themselves disagree on this.

sub_routing_model needs real data

0.853

Classifies the specific problem type. 1000 records, 6 imbalanced classes. "Autre" catch-all dominates at 41%.

Class	Count	Distribution	Per-class AUROC
Autre	410	41.0%	0.700
Erreur produit	191	19.1%	0.839
Facturation	132	13.2%	0.965
Retard livraison	117	11.7%	0.964
Casse transport	91	9.1%	0.814
Info stock	59	5.9%	0.837

Product impact: Drives routage.sous_categorie. At 54% accuracy, this is advisory, not actionable. The model correctly ranks sub-categories (AUROC 0.853) but can't commit to a hard prediction reliably. Facturation and Retard livraison are sharp (0.96+); Autre and Casse transport are blurry.

Real data potential: Highest priority for real data. Real messages distinguish "verre arrive casse" from "commande toujours pas recue" trivially — the synthetic generator can't. Expected jump: 0.853 → 0.92+ AUROC, 54% → 80%+ accuracy. 1500 labeled tickets minimum.

churn_model strong

0.901

Detects customer flight risk. 1000 records, 3 classes. The middle class (Risque modere) is the hard one.

Class	Count	Distribution	Per-class AUROC
Pas de risque	449	44.9%	0.934
Risque modere	297	29.7%	0.859
Risque eleve	254	25.4%	0.911

Product impact: Drives risque_client.churn, risque_client.facteurs, and escalation actions. A "Risque eleve" triggers: immediate callback, commercial gesture, management alert. This is where revenue is protected. Also feeds the trust score via model_agreement (aligned urgence + churn = higher trust).

Real data potential: Ground truth = customers who actually left (6-12 month lookback). 600 labeled records minimum. The payoff: Risque modere (0.859 AUROC) would sharpen with subtle signals — order frequency decline, tone shift across multiple tickets, payment cycle changes.

How datasets interact in the product

The 4 models don't run in isolation — their outputs combine to produce the final response:

Priority = f(urgence, churn)

priorite is computed from urgence + churn predictions. Critique OR Risque eleve → P1. Haute OR Risque modere → P2. This means the urgency and churn datasets jointly determine priority. If both agree (trust_score.model_agreement = 100), the priority is confident.

Actions = f(service, urgence, churn)

Suggested actions depend on all three: "Rappel immediat par responsable logistique" uses the routing result. "< 2h" uses urgency. "Geste commercial" triggers on churn risk. Wrong routing → wrong department in the action. Wrong urgency → wrong timing.

Trust = f(all 4 models)

The trust score's classification_confidence component averages the top probability across all 4 models. If routing is 99% confident but sub_routing is 50/50, confidence drops. model_agreement checks urgence vs churn alignment — datasets that produce contradictory signals lower trust.

XAI = f(extraction + all models)

The audit trail cites each model's decision with the features that drove it. routing_audit references intention + keywords (from extraction). churn_audit references menace + frustration. Better datasets → more precise audits → more trust from the human operator.

What to label next

Priority	Dataset	Why	Minimum	Impact
1	sub_routing	54% accuracy is advisory, not actionable. Real lexical distinctions ("casse" vs "retard") would unlock 80%+.	1500 tickets	AUROC 0.85 → 0.92+
2	churn	Risque modere (0.859 AUROC) is the weak link. Real churn labels (customers who actually left) would sharpen the boundary.	600 tickets	AUROC 0.90 → 0.94+
3	urgency	Haute/Normale boundary is noisy. Real SLA data would clarify.	800 tickets	AUROC 0.98 → 0.98 (calibration)
4	routing	Already 0.990. Real data for edge cases (multi-dept requests) only.	500 tickets	AUROC 0.99 → 0.99 (marginal)

One NDJSON record

All 4 datasets share the same 12 features. Only the label column changes.

{
  "intention_primaire": "reclamation_livraison",
  "intention_secondaire": "menace_depart",
  "ref_commande": "CMD-2026-1847",
  "produit_mentionne": "feuillete 44.2",
  "date_livraison_prevue": "2026-04-03",
  "retard_jours": 7,
  "frustration": 0.92,
  "urgence_percue": "critique",
  "menace_churn": 1,
  "escalade": "troisieme contact",
  "mots_cles": "attends bloque poseurs concurrent troisieme",
  "canal": "portail",
  "label": "Logistique"           // routing dataset
  "label": "Critique"              // urgency dataset
  "label": "Retard livraison"      // sub_routing dataset
  "label": "Risque eleve"          // churn dataset
}

To add real data: take a resolved ticket, run it through extraction (regex or Haiku), add the label from your ticket system, append to the NDJSON. Retrain with python3 -m requestclassifier.train_models.