Structural Diff APIDiff Statuses

Diff Statuses

One input/output example per status. No ambiguity.

The engine assigns exactly one status to each original row after the 8-pass alignment. SPLIT and MERGED are the only cases where one input row maps to multiple output rows (or vice versa). This page shows the minimal payload that reliably triggers each status, the expected response structure, and the workflow context where each status appears in real annotation pipelines.

Engine pipeline note

The 8-pass algorithm processes every original row against every reworked row and assigns the best status based on text similarity, timestamp proximity, and structural checks. The passes run in order: exact match → high-similarity → split detection → merge detection → weak matches → unmatched leftovers. This means a row can only get one status — the engine commits in pass order and skips already-matched rows.

When iterating results to build a summary count, skip rows where notes contains "Source row". These are MERGED artifacts (one per absorbed original) and should not be double-counted.
UNCHANGED

UNCHANGED

Both the original and reworked arrays contain a row with identical content across all mapped columns (after whitespace normalization). The engine matched them in the first pass (exact match) or high-similarity pass.

Original

  "original": [{ "speaker": "Alice", "transcript": "Good morning everyone." }],

Reworked

  "reworked": [{ "speaker": "Alice", "transcript": "Good morning everyone." }]

API Result

json
{
  "status": "UNCHANGED",
  "notes": "exact match",
  "snapData": ["Alice", "Good morning everyone."],
  "currData": ["Alice", "Good morning everyone."]
}
json
{
  "original": [{ "speaker": "Alice", "transcript": "Good morning everyone." }],
  "reworked": [{ "speaker": "Alice", "transcript": "Good morning everyone." }]
}
json
{
  "status": "UNCHANGED",
  "notes": "exact match",
  "snapData": ["Alice", "Good morning everyone."],
  "currData": ["Alice", "Good morning everyone."]
}

When you see this

Every row that the annotator accepted without any change. In a typical annotation QA batch, UNCHANGED represents 20–60% of rows depending on how heavily the annotator edited.

Request note

The transcript text and all other mapped columns must be identical. Case and punctuation are significant.

Response note

The result row has status: "UNCHANGED". Both snapData (original cell values) and currData (reworked cell values) are present and identical. The notes field is "exact match" or "high similarity match".

Workflow context

A very high UNCHANGED rate (>90%) may indicate the annotator did not fully review the transcript. A very low rate (<10%) may indicate the AI baseline was poor quality or the annotator over-edited.

MODIFIED

MODIFIED

The engine matched a row from original to a row in reworked (by similarity and/or position), but at least one column value changed. This is the most common status.

Transcript diff example

Original

[{ "speaker": "Doctor", "transcript": "I've been having headaches for the past two weeks" }]

Reworked

[{ "speaker": "Doctor", "transcript": "I've been having headaches for the past 2 weeks" }]

API Result

json
{
  "status": "MODIFIED",
  "notes": "transcript changed",
  "transcriptDiff": [
    { "type": "equal",  "value": "I've been having headaches for the past " },
    { "type": "delete", "value": "two weeks" },
    { "type": "insert", "value": "2 weeks" }
  ],
  "snapData": ["Doctor", "I've been having headaches for the past two weeks"],
  "currData": ["Doctor", "I've been having headaches for the past 2 weeks"]
}

Live transcriptDiff rendering

Example rendering of transcriptDiff tokens

Thanks, Sarah. gGlad to be here I haI've been looking forward to this conversation for weeks.

When you see this

Text corrections, punctuation edits, number formatting, speaker name corrections, timestamp edits, or emotion label changes. Anything short of adding/removing segments.

Request note

The row must exist in both versions. The engine needs sufficient similarity to make a confident match before it checks what changed.

Response note

When the transcript column changed, the response includes a transcriptDiff array with character-level diff tokens — each token has type: "EQUAL" | "DELETE" | "INSERT" (UPPERCASE) and a text field. Absent when enableInlineDiff: false is set in config.

Workflow context

In AI annotation QA, MODIFIED rows are the primary review target. Each one represents a correction the annotator made. CER and WER in the scores object are computed from these changes.

ADDED

ADDED

A row in reworked has no match in original. The engine exhausted all matching passes and could not find a source row.

json
{
  "original": [{ "speaker": "Agent", "transcript": "Let me pull up your account." }],
  "reworked": [
    { "speaker": "Agent",    "transcript": "Let me pull up your account." },
    { "speaker": "Customer", "transcript": "Thank you." }
  ]
}
json
{
  "results": [
    { "status": "UNCHANGED", "notes": "exact match", ... },
    {
      "status": "ADDED",
      "notes": "new row in reworked",
      "currData": ["Customer", "Thank you."]
    }
  ]
}

When you see this

The annotator added a segment that the AI missed. Common causes: AI failed to detect a quiet segment, cropped audio, code-switch not detected, or the annotator split an AI row (one of the split parts may surface as ADDED if the engine doesn't detect the split).

Request note

Set enableSplits: false if you want all splits to surface as MODIFIED + ADDED rather than the structural SPLIT label.

Response note

Only currData is present (no snapData because there is no original row). The notes field is "new row in reworked".

Workflow context

The ADDED count in your results tells you how many segments the AI missed. Combined with the DELETED count, you get the baseline's segmentation quality.

DELETED

DELETED

A row in original has no match in reworked. The annotator removed it entirely.

json
{
  "original": [
    { "speaker": "Host", "transcript": "Welcome to the show." },
    { "speaker": "[noise]", "transcript": "[background music fades]" }
  ],
  "reworked": [
    { "speaker": "Host", "transcript": "Welcome to the show." }
  ]
}
json
{
  "results": [
    { "status": "UNCHANGED", "notes": "exact match", ... },
    {
      "status": "DELETED",
      "notes": "row removed from reworked",
      "snapData": ["[noise]", "[background music fades]"]
    }
  ]
}

When you see this

The AI transcribed noise or silence as speech, produced a false utterance, created a duplicate segment at a boundary, or the segment was genuinely empty.

Request note

A DELETED row means the annotator made an explicit removal decision — this differs from a MODIFIED row where only the content changed.

Response note

Only snapData is present (no currData). The notes field is "row removed from reworked".

Workflow context

Unexpected DELETEDs in a review pass indicate the reviewer is more aggressively cleaning than expected, or the AI baseline has quality issues. Track the DELETED/original ratio across batches.

SPLIT

SPLIT

One original row maps to two or more consecutive reworked rows whose combined transcript reconstructs the original. The engine validates that the combined text similarity exceeds SPLIT_COMBINED_MIN.

One row → two rows

Original

[{
  "speaker": "Candidate",
  "transcript": "For new users we relied on content-based filtering. For new items we used metadata clustering to find similar items with existing ratings."
}]

Reworked

[
  { "speaker": "Candidate", "transcript": "For new users, we relied on content-based filtering." },
  { "speaker": "Candidate", "transcript": "For new items, we used metadata clustering to find similar items with existing ratings." }
]

API Result

json
{
  "status": "SPLIT",
  "notes": "split into 2 rows",
  "originalRow": {
    "transcript": "For new users we relied on content-based filtering. For new items..."
  },
  "reworkedRows": [
    { "transcript": "For new users, we relied on content-based filtering." },
    { "transcript": "For new items, we used metadata clustering..." }
  ]
}

When you see this

The annotator determined the AI segment was too long and contained two distinct utterances or speaker turns, and split it at a natural boundary. Per annotation guidelines, segments should be split whenever a long utterance contains a meaningful pause or a thought boundary.

Request note

The original row must be sufficiently similar to the combined text of the reworked rows. Timestamp plausibility is also checked if timestamps are present.

Response note

The original row entry has status: "SPLIT". The reworked rows it maps to are enumerated in the result. SER (Segmentation Error Rate) is incremented by this row.

Workflow context

Splits are the most significant structural change between annotation layers. A high SPLIT count in Layer 1→Layer 2 indicates Layer 1 was under-segmenting. If this is expected, it's informational. If unexpected, it warrants review.

MERGED

MERGED

Two or more original rows map to one reworked row whose transcript is close to the combined text of the originals.

Two rows → one merged row (+ two source rows in response)

Original

[
  { "speaker": "Elena", "transcript": "That resonates with our work at the localization lab." },
  { "speaker": "Elena", "transcript": "Standard Arabic models fail on Tunisian input." }
]

Reworked

[{
  "speaker": "Elena",
  "transcript": "That resonates with our work at the localization lab — standard Arabic models fail on Tunisian input."
}]

API Result

json
{
  "results": [
    {
      "status": "MERGED",
      "notes": "merged from 2 rows",
      "reworkedRow": {
        "transcript": "That resonates with our work at the localization lab — standard Arabic models fail on Tunisian input."
      }
    },
    {
      "status": "MERGED",
      "notes": "Source row 1/2 · merged into reworked row 0",
      "snapData": ["Elena", "That resonates with our work at the localization lab."]
    },
    {
      "status": "MERGED",
      "notes": "Source row 2/2 · merged into reworked row 0",
      "snapData": ["Elena", "Standard Arabic models fail on Tunisian input."]
    }
  ]
}

When you see this

The annotator determined consecutive AI segments should be joined into one. Common when the AI over-segmented at breath pauses or punctuation boundaries. Also occurs when two speakers' short utterances are merged into one attributed segment.

Request note

The absorbed original rows must have sufficient combined text similarity to the merged reworked row. The absorbed rows are also present in the result as "Source row" entries with status MERGED.

Response note

The primary merged result row (the reworked row) has status: "MERGED". Each absorbed original row also appears in the results with notes: "Source row 1/N · merged into reworked row X". Filter these out when counting statuses.

Workflow context

A high MERGED count indicates the AI over-segmented. Combined with SPLIT, the ratio tells you whether the AI tends toward over- or under-segmentation relative to your annotation standard.

transcriptDiff: inline character-level diff

For MODIFIED rows where the transcript column changed, the response includes a transcriptDiff array. Each token in the array has a type ("EQUAL", "DELETE", or "INSERT") and a text field (the characters). Tokens of the same type may be merged: a single-word swap appears as one "DELETE" + one "INSERT" token, not character by character. Note: types are UPPERCASE. This field is absent when enableInlineDiff: false is set.

"EQUAL" — characters present in both versions (render as plain text)
"DELETE" — characters present only in original (render with strikethrough or red highlight)
"INSERT" — characters present only in reworked (render with green highlight)
json
"transcriptDiff": [
  { "type": "equal",  "value": "I've been having headaches for the past " },
  { "type": "delete", "value": "two" },
  { "type": "insert", "value": "2" },
  { "type": "equal",  "value": " weeks" }
]

Source rows (MERGED artifact)

When a MERGE is detected, the results array includes both the merged result row and one "Source row" entry per absorbed original. Source rows have status: "MERGED" and notes starting with "Source row 1/N · merged into reworked row X". They exist so you can trace exactly which original rows contributed to the merge.

js
// Skip source rows when building a summary count
const primaryResults = results.filter(
  r => !(r.status === "MERGED" && r.notes?.includes("Source row"))
)

Counting statuses correctly

The results array length equals originalRows + number-of-SPLIT-extra-reworked-rows + MERGED-source-rows. Use this table to build correct summary counts:

StatusCount strategyNote
UNCHANGEDCount all
MODIFIEDCount all
ADDEDCount all
DELETEDCount all
SPLITCount allOne entry per original row that was split, regardless of how many reworked rows it produced
MERGEDCount only rows where notes does NOT contain "Source row"Source rows are trace entries for the absorbed originals — skip them in counts
Diff Statuses Guide · Structural Diff API · Built by Mohamed Yaakoubi← Back to Structural Diff API