Engine Precision Analysis

Micro-F1 · 76.1%

Measured on 14 datasets · 185 GT events

This page documents the measured classification accuracy of the 8-pass structural diff engine. All numbers come from running the engine on transcript data and comparing its output against ground-truth labels derived from what human annotators actually did — not a synthetic benchmark.

How ground truth was derived

There is no pre-labeled test set for this task. Instead, ground truth is derived by structurally analyzing what the human annotator actually did when transforming original → reworked — a behavioral labeling approach. The idea of inferring labels from before/after pairs is related to weak supervision (programmatic labeling) in NLP literature; what is specific here is the ordered 4-phase reconstruction of split/merge/modify/add/delete events from timed transcript data.

Phase 1 — MERGES first (N orig → 1 rewk)

For each reworked row, try combining a small window of contiguous original rows. If the combined text similarity exceeds the best single-row match by a required margin, the annotator merged those rows. MERGES are resolved first to avoid misclassifying merge sources as false DELETED.

Phase 2 — SPLITS (1 orig → N rewk)

Same logic in reverse: for each unused original row, try combining a small window of contiguous reworked rows. If the combined text similarity exceeds the best single match by a required margin, the annotator split that original row.

Phase 3 — 1:1 matching (greedy best-first by similarity)

Remaining rows are matched greedily by text similarity within a time window. UNCHANGED: transcript identical AND timestamps within tolerance AND same speaker. Otherwise: MODIFIED.

Phase 4 — Leftovers

Unmatched original rows → DELETED. Unmatched reworked rows → ADDED.

Dataset scope

14 datasets from transcription annotation jobs. All datasets include NSE (non-speech event) markers, overlap tags ([overlap]), timestamps, speaker labels, and metadata tags.

DatasetOrigRewkGT eventsCorrectAccuracy
DS61419171694.1%
DS722131414100.0%
DS81199555.6%
DS911999100.0%
DS102422231565.2%
DS112817211781.0%
DS1211515853.3%
DS1391411872.7%
DS14101010990.0%
DS15688562.5%
DS16353266.7%
DS17121314964.3%
DS182117181266.7%
DS1991113538.5%
Total18118218513472.4% (recall)

Class distribution

The dataset is heavily skewed toward MODIFIED, which makes up over half of all GT events. This is a natural property of transcript reworking — annotators correct rather than restructure the majority of segments.

The micro-F1 of 76.1% is influenced by this skew: MODIFIED (the dominant class, F1=82.7%) contributes disproportionately to the micro-average. The macro-F1 (70.4%) treats all 6 categories equally and gives a more balanced view of overall performance. UNCHANGED and SPLIT each have only 6 GT events — their per-category P/R/F1 numbers carry wide uncertainty.

CategoryGT eventsShare of totalDistribution
MODIFIED9450.8%
ADDED4624.9%
MERGED2211.9%
DELETED115.9%
UNCHANGED63.2%
SPLIT63.2%

The engine also shows a mild bias toward predicting MODIFIED: 3 GT-UNCHANGED rows are classified as MODIFIED (over-detecting changes), and several GT-ADDED rows are pulled into MODIFIED or SPLIT. This is partly by design — MODIFIED is the safest fallback when similarity is moderate — but it reduces recall for UNCHANGED and ADDED.

Per-category precision / recall / F1

Computed across 185 GT events from 14 datasets. TP = engine and GT agree. FP = engine predicted this category but GT says otherwise. FN = GT says this category but engine predicted something else.

CategorySupportTPFPFNPrecisionRecallF1
MODIFIED9474112087.1%78.7%82.7%
ADDED462881877.8%60.9%68.3%
MERGED22188469.2%81.8%75.0%
DELETED1162575.0%54.5%63.2%
UNCHANGED6303100.0%50.0%66.7%
SPLIT654155.6%83.3%66.7%
Micro avg185134335180.2%72.4%76.1%
Macro avg77.4%68.2%70.4%

Global engine micro-F1: 76.1% — 134 of 185 GT events correctly classified across 14 datasets.

Confusion matrix

Rows = GT labels. Columns = engine predictions. · denotes zero. UNMATCHED = GT row had no engine output at that anchor (absorbed as merge source / split child, or fell below all similarity gates).

GT \ EngineUNCHANGEDMODIFIEDSPLITMERGEDDELETEDADDEDUNMATCHED
UNCHANGED33·····
MODIFIED·74·6158
SPLIT·15····
MERGED·1·1812·
DELETED·31·61·
ADDED·332·2810

The UNMATCHED column accounts for 18 events (9.7%): 8 GT-MODIFIED, 10 GT-ADDED. These are cases where the engine produces fewer output rows than GT expects — the most common cause is DS12, which has 1 original row expanding to 15 reworked rows, and the engine simply cannot produce enough events.

Root cause analysis

Breakdown of the 51 misclassifications by failure pattern and the engine parameter that controls each risk.

RiskFrequencyRoot causeTunable via
MODIFIED → UNMATCHED8/94 = 9%No engine match found at all↑ residual match window
MODIFIED → MERGED (over-merging)6/94 = 6%Merge pass absorbs 1:1 matches↑ merge similarity threshold
ADDED → UNMATCHED10/46 = 22%Engine produces fewer rows than GT (DS12)Event count mismatch
ADDED → SPLIT3/46 = 7%New rows misclassified as splits↑ split similarity threshold
UNCHANGED → MODIFIED3/6 = 50%Minor timestamp / whitespace drift↑ unchanged time tolerance
DELETED → MODIFIED3/11 = 27%Engine finds weak match for deleted row↑ text similarity floor

Similarity metric benchmarking

Computed against 6 datasets (DS6–DS11), 116 original rows, 1,754 total (original, reworked) pairs.

text
matchScore = w_time × timeSim + w_text × txtSim
timeSim  = time-proximity score (saturates beyond a max time delta)
txtSim   = Jaccard bigram similarity after NFKC + diacritics normalization
[blocked] if txtSim falls below a minimum text-floor threshold

Pairwise score coverage

MetricValue
Total (orig, reworked) pairs evaluated1,754
Pairs blocked by TEXT_SIM_FLOOR (txtSim < 0.10)1,459 (83.2%)
Pairs with non-zero scores295 (16.8%)
Mean score — all pairs0.065
Mean score — non-zero pairs only0.385

Accepted match confidence distribution (64 matches)

Confidence bandCountShare
HIGH (score ≥ 0.70)3859.4%
MED (0.40 – 0.70)2234.4%
LOW (0.20 – 0.40)46.3%

Decision margin above acceptance threshold

MetricValue
Mean margin above acceptance threshold0.522
Borderline matches (margin < 0.05)1 (1.6%)
Flip risk: small score perturbation would reject1 (1.6%)

Limitations

  • 185 GT events is a small sample. Per-category numbers for UNCHANGED (6 events) and SPLIT (6 events) carry very wide confidence intervals — a single additional dataset could shift their F1 by 10+ points.

  • Ground truth is derived, not hand-labeled. The derivation algorithm itself has tunable parameters that influence which events appear in GT. Different parameter choices would produce different GT labels and different precision numbers.