Engine Precision Analysis

Micro-F1 · 76.1%

Measured on 14 datasets · 185 GT events

This page documents the measured classification accuracy of the 8-pass structural diff engine. All numbers come from running the engine on transcript data and comparing its output against ground-truth labels derived from what human annotators actually did — not a synthetic benchmark.

How ground truth was derived

There is no pre-labeled test set for this task. Instead, ground truth is derived by structurally analyzing what the human annotator actually did when transforming original → reworked — a behavioral labeling approach. The idea of inferring labels from before/after pairs is related to weak supervision (programmatic labeling) in NLP literature; what is specific here is the ordered 4-phase reconstruction of split/merge/modify/add/delete events from timed transcript data.

Phase 1 — MERGES first (N orig → 1 rewk)

For each reworked row, try combining a small window of contiguous original rows. If the combined text similarity exceeds the best single-row match by a required margin, the annotator merged those rows. MERGES are resolved first to avoid misclassifying merge sources as false DELETED.

Phase 2 — SPLITS (1 orig → N rewk)

Same logic in reverse: for each unused original row, try combining a small window of contiguous reworked rows. If the combined text similarity exceeds the best single match by a required margin, the annotator split that original row.

Phase 3 — 1:1 matching (greedy best-first by similarity)

Remaining rows are matched greedily by text similarity within a time window. UNCHANGED: transcript identical AND timestamps within tolerance AND same speaker. Otherwise: MODIFIED.

Phase 4 — Leftovers

Unmatched original rows → DELETED. Unmatched reworked rows → ADDED.

Dataset scope

14 datasets from transcription annotation jobs. All datasets include NSE (non-speech event) markers, overlap tags ([overlap]), timestamps, speaker labels, and metadata tags.

Dataset	Orig	Rewk	GT events	Correct	Accuracy
DS6	14	19	17	16	94.1%
DS7	22	13	14	14	100.0%
DS8	11	9	9	5	55.6%
DS9	11	9	9	9	100.0%
DS10	24	22	23	15	65.2%
DS11	28	17	21	17	81.0%
DS12	1	15	15	8	53.3%
DS13	9	14	11	8	72.7%
DS14	10	10	10	9	90.0%
DS15	6	8	8	5	62.5%
DS16	3	5	3	2	66.7%
DS17	12	13	14	9	64.3%
DS18	21	17	18	12	66.7%
DS19	9	11	13	5	38.5%
Total	181	182	185	134	72.4% (recall)

Class distribution

The dataset is heavily skewed toward MODIFIED, which makes up over half of all GT events. This is a natural property of transcript reworking — annotators correct rather than restructure the majority of segments.

The micro-F1 of 76.1% is influenced by this skew: MODIFIED (the dominant class, F1=82.7%) contributes disproportionately to the micro-average. The macro-F1 (70.4%) treats all 6 categories equally and gives a more balanced view of overall performance. UNCHANGED and SPLIT each have only 6 GT events — their per-category P/R/F1 numbers carry wide uncertainty.

Category	GT events	Share of total
MODIFIED	94	50.8%
ADDED	46	24.9%
MERGED	22	11.9%
DELETED	11	5.9%
UNCHANGED	6	3.2%
SPLIT	6	3.2%

The engine also shows a mild bias toward predicting MODIFIED: 3 GT-UNCHANGED rows are classified as MODIFIED (over-detecting changes), and several GT-ADDED rows are pulled into MODIFIED or SPLIT. This is partly by design — MODIFIED is the safest fallback when similarity is moderate — but it reduces recall for UNCHANGED and ADDED.

Per-category precision / recall / F1

Computed across 185 GT events from 14 datasets. TP = engine and GT agree. FP = engine predicted this category but GT says otherwise. FN = GT says this category but engine predicted something else.

Category	Support	TP	FP	FN	Precision	Recall	F1
MODIFIED	94	74	11	20	87.1%	78.7%	82.7%
ADDED	46	28	8	18	77.8%	60.9%	68.3%
MERGED	22	18	8	4	69.2%	81.8%	75.0%
DELETED	11	6	2	5	75.0%	54.5%	63.2%
UNCHANGED	6	3	0	3	100.0%	50.0%	66.7%
SPLIT	6	5	4	1	55.6%	83.3%	66.7%
Micro avg	185	134	33	51	80.2%	72.4%	76.1%
Macro avg					77.4%	68.2%	70.4%

Global engine micro-F1: 76.1% — 134 of 185 GT events correctly classified across 14 datasets.

Confusion matrix

Rows = GT labels. Columns = engine predictions. · denotes zero. UNMATCHED = GT row had no engine output at that anchor (absorbed as merge source / split child, or fell below all similarity gates).

GT \ Engine	UNCHANGED	MODIFIED	SPLIT	MERGED	DELETED	ADDED	UNMATCHED
UNCHANGED	3	3	·	·	·	·	·
MODIFIED	·	74	·	6	1	5	8
SPLIT	·	1	5	·	·	·	·
MERGED	·	1	·	18	1	2	·
DELETED	·	3	1	·	6	1	·
ADDED	·	3	3	2	·	28	10

The UNMATCHED column accounts for 18 events (9.7%): 8 GT-MODIFIED, 10 GT-ADDED. These are cases where the engine produces fewer output rows than GT expects — the most common cause is DS12, which has 1 original row expanding to 15 reworked rows, and the engine simply cannot produce enough events.

Root cause analysis

Breakdown of the 51 misclassifications by failure pattern and the engine parameter that controls each risk.

Risk	Frequency	Root cause	Tunable via
MODIFIED → UNMATCHED	8/94 = 9%	No engine match found at all	↑ residual match window
MODIFIED → MERGED (over-merging)	6/94 = 6%	Merge pass absorbs 1:1 matches	↑ merge similarity threshold
ADDED → UNMATCHED	10/46 = 22%	Engine produces fewer rows than GT (DS12)	Event count mismatch
ADDED → SPLIT	3/46 = 7%	New rows misclassified as splits	↑ split similarity threshold
UNCHANGED → MODIFIED	3/6 = 50%	Minor timestamp / whitespace drift	↑ unchanged time tolerance
DELETED → MODIFIED	3/11 = 27%	Engine finds weak match for deleted row	↑ text similarity floor

Similarity metric benchmarking

Computed against 6 datasets (DS6–DS11), 116 original rows, 1,754 total (original, reworked) pairs.

text

matchScore = w_time × timeSim + w_text × txtSim
timeSim  = time-proximity score (saturates beyond a max time delta)
txtSim   = Jaccard bigram similarity after NFKC + diacritics normalization
[blocked] if txtSim falls below a minimum text-floor threshold

Pairwise score coverage

Metric	Value
Total (orig, reworked) pairs evaluated	1,754
Pairs blocked by TEXT_SIM_FLOOR (txtSim < 0.10)	1,459 (83.2%)
Pairs with non-zero scores	295 (16.8%)
Mean score — all pairs	0.065
Mean score — non-zero pairs only	0.385

Accepted match confidence distribution (64 matches)

Confidence band	Count	Share
HIGH (score ≥ 0.70)	38	59.4%
MED (0.40 – 0.70)	22	34.4%
LOW (0.20 – 0.40)	4	6.3%

Decision margin above acceptance threshold

Metric	Value
Mean margin above acceptance threshold	0.522
Borderline matches (margin < 0.05)	1 (1.6%)
Flip risk: small score perturbation would reject	1 (1.6%)

Limitations

•
185 GT events is a small sample. Per-category numbers for UNCHANGED (6 events) and SPLIT (6 events) carry very wide confidence intervals — a single additional dataset could shift their F1 by 10+ points.
•
Ground truth is derived, not hand-labeled. The derivation algorithm itself has tunable parameters that influence which events appear in GT. Different parameter choices would produce different GT labels and different precision numbers.