Structural Diff APIConfig Parameters

Config Parameters

Know exactly which flag to flip and why.

The default config works well for most transcripts. These parameters exist to handle specific annotation workflows: Arabic QA, positional-only comparison, metadata column exclusion, and structural detection control. Each section below shows the exact input/output difference a flag produces.

When to customize the config

Start with no config. Run a diff and inspect the results. Only reach for a config flag when you see a specific problem:

stripDiacriticsArabic transcripts where diacritic additions inflate MODIFIED count
simpleModePure content QA — you know the annotator made no structural changes
ignoreColNamesMetadata columns (confidence score, category) differ between QA layers but aren't the comparison target
positionalModeDebugging unexpected alignments, or processing very large uniform datasets
enableSplits: falseProject guidelines prohibit splits at this annotation layer
enableInlineDiff: falseLarge batches where only statuses and scores are needed — suppress transcript diff computation for speed
structuralTransformsRows have ID prefixes, URLs, or phone formats that vary between layers but aren't part of the transcript content

simpleMode

By default the engine runs an 8-pass alignment algorithm that matches rows by similarity across the full transcript, even if they moved positions. simpleMode disables this: row 0 is compared to row 0, row 1 to row 1, strictly by position.

Default (simpleMode: false): the engine detects that one long segment was split into two and labels it SPLIT.

Original

{
  "original": [
    { "speaker": "Candidate", "words": "For new users we relied on content-based filtering. For new items we used metadata clustering to find similar items." }
  ],
  "reworked": [
    { "speaker": "Candidate", "words": "For new users, we relied on content-based filtering." },
    { "speaker": "Candidate", "words": "For new items, we used metadata clustering to find similar items." }
  ]
}

Reworked

/* config: {} (default) */

API Result

json
{
  "results": [
    {
      "status": "SPLIT",
      "notes": "split into 2 rows",
      "originalRow": { "words": "For new users we relied on content-based filtering..." },
      "reworkedRows": [
        { "words": "For new users, we relied on content-based filtering." },
        { "words": "For new items, we used metadata clustering..." }
      ]
    }
  ]
}

With simpleMode: true: the engine compares row 0 to row 0 (finds a mismatch → MODIFIED) and sees an extra row in reworked (→ ADDED). The structural intent is lost, but every character change is visible.

json
{
  "results": [
    {
      "status": "MODIFIED",
      "notes": "words changed",
      "snapData": ["Candidate", "For new users we relied on content-based filtering..."],
      "currData": ["Candidate", "For new users, we relied on content-based filtering."],
      "transcriptDiff": [
        { "type": "equal",  "value": "For new users" },
        { "type": "insert", "value": "," },
        { "type": "equal",  "value": " we relied on content-based filtering." }
      ]
    },
    {
      "status": "ADDED",
      "notes": "new row in reworked",
      "currData": ["Candidate", "For new items, we used metadata clustering..."]
    }
  ]
}

Use when you're confident the annotator made zero structural changes — only text corrections and punctuation. Also useful when you want raw character diffs without any structural interpretation.

simpleMode is faster on very large datasets because it skips alignment. The trade-off is false MODIFIED/ADDED/DELETED counts where SPLIT/MERGED would be more accurate.

enableSplits / enableMerges

Finer-grained alternatives to simpleMode. Instead of disabling all structural detection, you can disable only one type.

SPLIT

enableSplits: false — SPLIT candidates are instead emitted as MODIFIED (truncated match) + ADDED (leftover rows). Use when your annotation guidelines at this layer prohibit splits, so surfacing them as individual changes is more actionable.

MERGED

enableMerges: false — MERGE candidates become MODIFIED (first original row) + DELETED (absorbed originals). Use when merges are not permitted at this layer and you want each deleted row flagged explicitly.

json
{
  "config": {
    "enableSplits": false,
    "enableMerges": true
  }
}

These flags are most useful in multi-layer QA pipelines where each layer has its own permitted operations. Disabling an operation you don't expect to see makes unexpected structural changes surface as distinct ADDED/DELETED flags instead of being silently grouped.

stripDiacritics

Before comparison, the engine normalises Arabic and accented characters by stripping diacritical marks. For Arabic this includes harakat (short vowels: fathah, dammah, kasrah), tanwin, shadda, sukun, and hamza variants (U+064B–U+065F, U+0670). For Latin text it strips combining accent characters (U+0300–U+036F). This flag is ON by default.

Common Arabic QA scenario: an annotator normalises the text per written Arabic style guides (adding harakat, normalising hamza). With the default (stripDiacritics: true), only lexical and segmentation differences are counted. Override to false when diacritical accuracy is itself a QA criterion.

Default behavior (stripDiacritics: true — no config needed): مرحبا → مرحباً is UNCHANGED because diacritical marks are stripped before comparison, making the stripped forms identical.

Original

{
  "original": [{ "speaker": "المذيع", "transcript": "مرحبا بكم في نشرة الاخبار" }],
  "reworked": [{ "speaker": "المذيع", "transcript": "مرحباً بكم في نشرة الأخبار" }]
}

Reworked

/* config: {} (default — stripDiacritics: true) */

API Result

json
{ "status": "UNCHANGED", "notes": "high similarity match (diacritics stripped)" }

With stripDiacritics: false (override): مرحبا → مرحباً is MODIFIED because the ً mark is no longer stripped — raw character differences are flagged.

json
{ "status": "MODIFIED", "notes": "transcript changed",
  "transcriptDiff": [
    { "type": "EQUAL",  "text": "مرحب" },
    { "type": "DELETE", "text": "ا" },
    { "type": "INSERT", "text": "اً" },
    { "type": "EQUAL",  "text": " بكم في نشرة ال" },
    { "type": "DELETE", "text": "ا" },
    { "type": "INSERT", "text": "أ" },
    { "type": "EQUAL",  "text": "خبار" }
  ]
}
json
{ "config": { "stripDiacritics": false } }

The default (true) works for most Arabic transcript QA. Override with stripDiacritics: false only when you are explicitly verifying that an annotator correctly added or removed diacritical marks — i.e., when diacritical precision is a tracked quality criterion.

positionalMode

Skips the similarity-based alignment algorithm entirely. Each original row at index N is compared to the reworked row at index N. If the arrays are different lengths, extra rows are ADDED or DELETED.

Default: if an annotator corrected a sentence and it moved from position 4 to position 6, the engine will still match them (MODIFIED). With positionalMode, row 4 in original is compared to row 4 in reworked — which may be a completely different sentence — producing a confusing MODIFIED with a large diff.

positionalMode produces misleading results when rows have been reordered. Only use it when you can guarantee the annotator did not add, remove, or reorder any rows.

Use for debugging: run positionalMode and compare it to default results to understand which rows the alignment matched. Also useful for very uniform datasets (e.g., word-by-word alignment ground truth) where positional matching is the ground truth.

json
{ "config": { "positionalMode": true } }

ignoreColNames

An array of column names to exclude from MODIFIED detection. A row is only MODIFIED if a non-ignored column changed. The ignored columns are still included in the response (snapData / currData) but do not trigger MODIFIED status.

Scenario: your data has a confidence column set by the annotation tool. QA Layer 1 might record confidence: 0.88 while QA Layer 2 records confidence: 0.91 for the same utterance. Without ignoreColNames, every such row is MODIFIED even if the transcript is identical. With ignoreColNames: ["confidence"], those rows are UNCHANGED as expected.

Without ignoreColNames

Original

{
  "original": [
    { "transcript": "The patient reports mild chest pain.", "speaker": "Doctor", "confidence": 0.88, "category": "symptom" }
  ],
  "reworked": [
    { "transcript": "The patient reports mild chest pain.", "speaker": "Doctor", "confidence": 0.94, "category": "complaint" }
  ]
}

Reworked

/* config: {} */

API Result

json
{ "status": "MODIFIED", "notes": "confidence, category changed" }

With ignoreColNames

json
{
  // request: { "config": { "ignoreColNames": ["confidence", "category"] } }
  "status": "UNCHANGED", "notes": "exact match (after ignoring confidence, category)"
}

Use whenever your schema includes metadata columns that change independently of transcript content: confidence scores, reviewer IDs, batch numbers, internal category tags, auto-generated timestamps.

enableInlineDiff

Controls whether the engine computes a character-level inline diff for MODIFIED rows. When enabled (default), each MODIFIED row in the response includes a transcriptDiff array that you can use to render highlighted changes in your review UI. Disabling it skips the diff computation entirely.

With enableInlineDiff: false, MODIFIED rows still appear in results (status and notes are unchanged), but the transcriptDiff field is absent. Use this when you only need status counts and scores and want to reduce response payload size.

json
{ "config": { "enableInlineDiff": false } }

Each transcriptDiff segment has the shape { type: "EQUAL" | "INSERT" | "DELETE", text: string }. Reconstruct the original by joining all non-INSERT spans; reconstruct the reworked by joining all non-DELETE spans. Note: type values are UPPERCASE.

json
// transcriptDiff format — type is UPPERCASE, field is "text"
[
  { "type": "EQUAL",  "text": "Hello " },
  { "type": "DELETE", "text": "world" },
  { "type": "INSERT", "text": "there" }
]

Disable (enableInlineDiff: false) when processing large batches where you only need CER/WER/SER scores and status counts, not the per-character diff. This reduces both server CPU and network payload. Re-enable for interactive review UIs where editors need to see exactly what changed.

The diff uses LCS (Longest Common Subsequence). For very long segments (combined original + reworked length > CHAR_DIFF_LIMIT), it automatically falls back from character-level to word-level tokens — still returned as the same array format.

structuralTransforms

An array of find/replace rules applied to the transcript text BEFORE the similarity scoring algorithm runs. This lets the engine align rows that differ only in predictable, non-content prefixes or formats (e.g., ID tags, URL prefixes, phone number formats).

Each rule: { find: string, replace: string, isRegex: boolean }. Plain string rules do a literal find-replace. Regex rules (isRegex: true) support standard JavaScript regex syntax (case-insensitive). Up to 20 rules per request.

json
{
  "config": {
    "structuralTransforms": [
      { "find": "^ID-\\d+:\\s*", "replace": "", "isRegex": true },
      { "find": "https?://[^\\s]+",  "replace": "[URL]", "isRegex": true }
    ]
  }
}
Transforms apply to SIMILARITY SCORING only — not to the cell data returned in snapData / currData. A row where only the ID prefix changed ("ID-001: Hello" vs "ID-002: Hello") will still show as MODIFIED because the raw transcript content differs. The transforms ensure the rows are correctly ALIGNED (not misidentified as ADDED+DELETED), but the column diff can still flag the prefix change.

Use when your original and reworked data share a common schema but rows include auto-generated IDs, batch prefixes, or formatting that the annotator changed as part of their work. Without transforms, the alignment algorithm treats rows with different prefixes as entirely different — potentially producing false ADDED/DELETED pairs instead of MODIFIED.

Expert similarity & timing thresholds

These seven numbers control the matching algorithm's sensitivity. The defaults are tuned for standard-length transcription segments (5–30 seconds, 10–60 words). Adjust them only when you've looked at the raw similarity scores and know the default thresholds produce wrong matches.

ParameterDefaultWhen to adjust · Effect
SIM_CONFIDENT
number (0–1)
0.70
Two rows this similar or closer are a definite match — committed in the high-similarity pass.
Raise to require very close text matches before committing. Lower if you have very short utterances that can't achieve high similarity.
SIM_MODERATE
number (0–1)
0.40
Plausible match — accepted when timing also confirms.
Lower if annotators rewrite sentences significantly while keeping the same meaning.
SIM_WEAK
number (0–1)
0.20
Tentative match — only accepted with very strong timing evidence.
Lower to 0.10–0.15 for very short segments (single words, disfluencies) that can't achieve 0.20 similarity.
TIME_EXACT_TOL
number (s)
0.05
Timestamps ≤ this apart count as exact match.
Increase to 0.5–1.0 if your annotation tool rounds timestamps to whole seconds.
TIME_FUZZY_TOL
number (s)
2.5
Timestamps ≤ this apart count as fuzzy match.
Increase when annotators shift segment boundaries significantly.
SPLIT_COMBINED_MIN
number (0–1)
0.35
Min combined text score to accept a SPLIT detection.
Raise to reduce false splits. Lower if your content has very short target segments.
MERGE_COMBINED_MIN
number (0–1)
0.65
Min combined text score to accept a MERGE detection.
Raise to reduce false merges. Lower for datasets with many legitimate merges.
CHAR_DIFF_LIMIT
integer (100–50000)
1500
Max combined character length before falling back to word-level diff.
Increase for batches with very long segments (300-word utterances). Decrease to force word-level diffs for all segments and save CPU on massive batches.
json
{
  "config": {
    "SIM_WEAK": 0.15,
    "TIME_EXACT_TOL": 1.0,
    "SPLIT_COMBINED_MIN": 0.70
  }
}
Config Parameters Guide · Structural Diff API · Built by Mohamed Yaakoubi← Back to Structural Diff API