Engine Precision Analysis
Micro-F1 · 76.1%Measured on 14 datasets · 185 GT events
This page documents the measured classification accuracy of the 8-pass structural diff engine. All numbers come from running the engine on transcript data and comparing its output against ground-truth labels derived from what human annotators actually did — not a synthetic benchmark.
How ground truth was derived
There is no pre-labeled test set for this task. Instead, ground truth is derived by structurally analyzing what the human annotator actually did when transforming original → reworked — a behavioral labeling approach. The idea of inferring labels from before/after pairs is related to weak supervision (programmatic labeling) in NLP literature; what is specific here is the ordered 4-phase reconstruction of split/merge/modify/add/delete events from timed transcript data.
Phase 1 — MERGES first (N orig → 1 rewk)
For each reworked row, try combining a small window of contiguous original rows. If the combined text similarity exceeds the best single-row match by a required margin, the annotator merged those rows. MERGES are resolved first to avoid misclassifying merge sources as false DELETED.
Phase 2 — SPLITS (1 orig → N rewk)
Same logic in reverse: for each unused original row, try combining a small window of contiguous reworked rows. If the combined text similarity exceeds the best single match by a required margin, the annotator split that original row.
Phase 3 — 1:1 matching (greedy best-first by similarity)
Remaining rows are matched greedily by text similarity within a time window. UNCHANGED: transcript identical AND timestamps within tolerance AND same speaker. Otherwise: MODIFIED.
Phase 4 — Leftovers
Unmatched original rows → DELETED. Unmatched reworked rows → ADDED.
Dataset scope
14 datasets from transcription annotation jobs. All datasets include NSE (non-speech event) markers, overlap tags ([overlap]), timestamps, speaker labels, and metadata tags.
| Dataset | Orig | Rewk | GT events | Correct | Accuracy |
|---|---|---|---|---|---|
| DS6 | 14 | 19 | 17 | 16 | 94.1% |
| DS7 | 22 | 13 | 14 | 14 | 100.0% |
| DS8 | 11 | 9 | 9 | 5 | 55.6% |
| DS9 | 11 | 9 | 9 | 9 | 100.0% |
| DS10 | 24 | 22 | 23 | 15 | 65.2% |
| DS11 | 28 | 17 | 21 | 17 | 81.0% |
| DS12 | 1 | 15 | 15 | 8 | 53.3% |
| DS13 | 9 | 14 | 11 | 8 | 72.7% |
| DS14 | 10 | 10 | 10 | 9 | 90.0% |
| DS15 | 6 | 8 | 8 | 5 | 62.5% |
| DS16 | 3 | 5 | 3 | 2 | 66.7% |
| DS17 | 12 | 13 | 14 | 9 | 64.3% |
| DS18 | 21 | 17 | 18 | 12 | 66.7% |
| DS19 | 9 | 11 | 13 | 5 | 38.5% |
| Total | 181 | 182 | 185 | 134 | 72.4% (recall) |
Class distribution
The dataset is heavily skewed toward MODIFIED, which makes up over half of all GT events. This is a natural property of transcript reworking — annotators correct rather than restructure the majority of segments.
The micro-F1 of 76.1% is influenced by this skew: MODIFIED (the dominant class, F1=82.7%) contributes disproportionately to the micro-average. The macro-F1 (70.4%) treats all 6 categories equally and gives a more balanced view of overall performance. UNCHANGED and SPLIT each have only 6 GT events — their per-category P/R/F1 numbers carry wide uncertainty.
| Category | GT events | Share of total | Distribution |
|---|---|---|---|
| MODIFIED | 94 | 50.8% | |
| ADDED | 46 | 24.9% | |
| MERGED | 22 | 11.9% | |
| DELETED | 11 | 5.9% | |
| UNCHANGED | 6 | 3.2% | |
| SPLIT | 6 | 3.2% |
The engine also shows a mild bias toward predicting MODIFIED: 3 GT-UNCHANGED rows are classified as MODIFIED (over-detecting changes), and several GT-ADDED rows are pulled into MODIFIED or SPLIT. This is partly by design — MODIFIED is the safest fallback when similarity is moderate — but it reduces recall for UNCHANGED and ADDED.
Per-category precision / recall / F1
Computed across 185 GT events from 14 datasets. TP = engine and GT agree. FP = engine predicted this category but GT says otherwise. FN = GT says this category but engine predicted something else.
| Category | Support | TP | FP | FN | Precision | Recall | F1 |
|---|---|---|---|---|---|---|---|
| MODIFIED | 94 | 74 | 11 | 20 | 87.1% | 78.7% | 82.7% |
| ADDED | 46 | 28 | 8 | 18 | 77.8% | 60.9% | 68.3% |
| MERGED | 22 | 18 | 8 | 4 | 69.2% | 81.8% | 75.0% |
| DELETED | 11 | 6 | 2 | 5 | 75.0% | 54.5% | 63.2% |
| UNCHANGED | 6 | 3 | 0 | 3 | 100.0% | 50.0% | 66.7% |
| SPLIT | 6 | 5 | 4 | 1 | 55.6% | 83.3% | 66.7% |
| Micro avg | 185 | 134 | 33 | 51 | 80.2% | 72.4% | 76.1% |
| Macro avg | 77.4% | 68.2% | 70.4% | ||||
Global engine micro-F1: 76.1% — 134 of 185 GT events correctly classified across 14 datasets.
Confusion matrix
Rows = GT labels. Columns = engine predictions. · denotes zero. UNMATCHED = GT row had no engine output at that anchor (absorbed as merge source / split child, or fell below all similarity gates).
| GT \ Engine | UNCHANGED | MODIFIED | SPLIT | MERGED | DELETED | ADDED | UNMATCHED |
|---|---|---|---|---|---|---|---|
| UNCHANGED | 3 | 3 | · | · | · | · | · |
| MODIFIED | · | 74 | · | 6 | 1 | 5 | 8 |
| SPLIT | · | 1 | 5 | · | · | · | · |
| MERGED | · | 1 | · | 18 | 1 | 2 | · |
| DELETED | · | 3 | 1 | · | 6 | 1 | · |
| ADDED | · | 3 | 3 | 2 | · | 28 | 10 |
The UNMATCHED column accounts for 18 events (9.7%): 8 GT-MODIFIED, 10 GT-ADDED. These are cases where the engine produces fewer output rows than GT expects — the most common cause is DS12, which has 1 original row expanding to 15 reworked rows, and the engine simply cannot produce enough events.
Root cause analysis
Breakdown of the 51 misclassifications by failure pattern and the engine parameter that controls each risk.
| Risk | Frequency | Root cause | Tunable via |
|---|---|---|---|
| MODIFIED → UNMATCHED | 8/94 = 9% | No engine match found at all | ↑ residual match window |
| MODIFIED → MERGED (over-merging) | 6/94 = 6% | Merge pass absorbs 1:1 matches | ↑ merge similarity threshold |
| ADDED → UNMATCHED | 10/46 = 22% | Engine produces fewer rows than GT (DS12) | Event count mismatch |
| ADDED → SPLIT | 3/46 = 7% | New rows misclassified as splits | ↑ split similarity threshold |
| UNCHANGED → MODIFIED | 3/6 = 50% | Minor timestamp / whitespace drift | ↑ unchanged time tolerance |
| DELETED → MODIFIED | 3/11 = 27% | Engine finds weak match for deleted row | ↑ text similarity floor |
Similarity metric benchmarking
Computed against 6 datasets (DS6–DS11), 116 original rows, 1,754 total (original, reworked) pairs.
matchScore = w_time × timeSim + w_text × txtSim
timeSim = time-proximity score (saturates beyond a max time delta)
txtSim = Jaccard bigram similarity after NFKC + diacritics normalization
[blocked] if txtSim falls below a minimum text-floor thresholdPairwise score coverage
| Metric | Value |
|---|---|
| Total (orig, reworked) pairs evaluated | 1,754 |
| Pairs blocked by TEXT_SIM_FLOOR (txtSim < 0.10) | 1,459 (83.2%) |
| Pairs with non-zero scores | 295 (16.8%) |
| Mean score — all pairs | 0.065 |
| Mean score — non-zero pairs only | 0.385 |
Accepted match confidence distribution (64 matches)
| Confidence band | Count | Share |
|---|---|---|
| HIGH (score ≥ 0.70) | 38 | 59.4% |
| MED (0.40 – 0.70) | 22 | 34.4% |
| LOW (0.20 – 0.40) | 4 | 6.3% |
Decision margin above acceptance threshold
| Metric | Value |
|---|---|
| Mean margin above acceptance threshold | 0.522 |
| Borderline matches (margin < 0.05) | 1 (1.6%) |
| Flip risk: small score perturbation would reject | 1 (1.6%) |
Limitations
- •
185 GT events is a small sample. Per-category numbers for UNCHANGED (6 events) and SPLIT (6 events) carry very wide confidence intervals — a single additional dataset could shift their F1 by 10+ points.
- •
Ground truth is derived, not hand-labeled. The derivation algorithm itself has tunable parameters that influence which events appear in GT. Different parameter choices would produce different GT labels and different precision numbers.