Demo Walkthrough

A real transcript. A real post-edit. A real diff.

This walkthrough uses an actual podcast transcript with non-standard column names to demonstrate the full integration loop: adapting your data format, calling the API, and interpreting every field in the response. The use case is a QA coordinator auditing an annotator's post-edit work before accepting a batch.

The scenario

AI annotation projects for transcription and translation typically have multiple QA layers. Each layer is a handover point where a human reviews or edits the previous layer's output. The coordinator needs an objective record of what changed at each handover.

Layer 0 — AI output

The platform exports an AI-generated transcript. This is your original array.

Layer 1 — Annotator edit

An annotator post-edits the transcript per the project's style guide (punctuation, numeral formatting, verbatim corrections, speaker IDs, timestamps). This is your reworked array.

Layer 2 — QA coordinator

The coordinator calls POST /v1/diff with the two arrays. The diff report is attached to the batch before handoff to the client.

In this demo, the original is a 9-row podcast transcript about machine translation (EP101). The reworked version reflects typical annotator corrections: punctuation standardisation, numeral formatting, and one structural split (a too-long segment), plus one structural merge (two short consecutive statements by the same speaker).

The data

The demo transcript uses the column names exported by a typical annotation platform. These do not match the API's standard field names (transcript, speaker, start_time, end_time, etc.):

text

Platform column   API field
─────────────────────────────
Talker          → speaker
BeginTime       → start_time
FinishTime      → end_time
Utterance       → transcript
NoiseTag        → non_speech_events
Mood            → emotion
Show            (passed through unchanged)
Subject         (passed through unchanged)

The API needs transcript and optionally speaker, start_time, end_time to drive its alignment algorithm. Unknown fields (Show, Subject) are passed through unchanged and appear in snapData / currData.

Adapting column names

One-time key remapping before the API call. You only need to map fields the engine uses; the rest pass through automatically.

JavaScript

const KEY_MAP = {
  Talker:    'speaker',
  BeginTime: 'start_time',
  FinishTime:'end_time',
  Utterance: 'transcript',
  NoiseTag:  'non_speech_events',
  Mood:      'emotion',
}

const adapt = row =>
  Object.fromEntries(
    Object.entries(row).map(([k, v]) => [KEY_MAP[k] ?? k, v])
  )

const originalAdapted = originalData.map(adapt)
const reworkedAdapted = reworkedData.map(adapt)

Python

python

KEY_MAP = {
    "Talker":    "speaker",
    "BeginTime": "start_time",
    "FinishTime": "end_time",
    "Utterance": "transcript",
    "NoiseTag":  "non_speech_events",
    "Mood":      "emotion",
}

def adapt(row):
    return {KEY_MAP.get(k, k): v for k, v in row.items()}

original_adapted = [adapt(r) for r in original_data]
reworked_adapted = [adapt(r) for r in reworked_data]

The request

After adapting column names, the full POST request body looks like this (truncated for readability):

Original array (9 rows — AI output)

json

[
  { "speaker": "Sarah Mitchell", "start_time": 0.00,  "end_time": 4.20,  "transcript": "Welcome to The Language Lab, the podcast where we break down how AI is changing the way we communicate.",                          "non_speech_events": "[intro jingle]", "emotion": "warm"       },
  { "speaker": "Sarah Mitchell", "start_time": 4.20,  "end_time": 8.00,  "transcript": "Today we have two fantastic guests joining us to talk about machine translation and quality assurance.",                             "non_speech_events": "",               "emotion": "enthusiastic" },
  { "speaker": "James Park",     "start_time": 8.00,  "end_time": 11.50, "transcript": "Thanks Sarah glad to be here I have been looking forward to this conversation for weeks.",                                           "non_speech_events": "",               "emotion": "friendly"   },
  { "speaker": "Elena Rossi",    "start_time": 11.50, "end_time": 15.00, "transcript": "Same here this is such an important topic right now especially with how fast the field is evolving.",                                 "non_speech_events": "",               "emotion": "engaged"    },
  { "speaker": "Sarah Mitchell", "start_time": 15.00, "end_time": 19.80, "transcript": "James lets start with you. Your team recently published a paper on neural machine translation for low resource languages.",           "non_speech_events": "",               "emotion": "curious"    },
  { "speaker": "James Park",     "start_time": 19.80, "end_time": 26.50, "transcript": "Yes so our main finding was that back translation combined with careful data augmentation can boost BLEU scores by up to twelve points for languages with under fifty thousand parallel sentences.", "non_speech_events": "", "emotion": "analytical" },
  { "speaker": "James Park",     "start_time": 26.50, "end_time": 30.00, "transcript": "The trick is selecting the right seed data and not just throwing everything at the model.",                                          "non_speech_events": "",               "emotion": "technical"  },
  { "speaker": "Elena Rossi",    "start_time": 30.00, "end_time": 34.50, "transcript": "That resonates with our work at the localization lab where we focus on Arabic dialect adaptation.",                                   "non_speech_events": "",               "emotion": "thoughtful" },
  { "speaker": "Elena Rossi",    "start_time": 34.50, "end_time": 39.00, "transcript": "Standard Arabic models completely fail when you feed them Tunisian or Moroccan dialect input.",                                       "non_speech_events": "",               "emotion": "concerned"  }
]

Reworked array (9 rows — annotator post-edit)

json

[
  { "speaker": "Sarah Mitchell", "start_time": 0.00,  "end_time": 4.20,  "transcript": "Welcome to The Language Lab, the podcast where we break down how AI is changing the way we communicate.",                 "non_speech_events": "[intro jingle]", "emotion": "warm"        },
  { "speaker": "Sarah Mitchell", "start_time": 4.20,  "end_time": 8.00,  "transcript": "Today, we have two fantastic guests joining us to talk about machine translation and quality assurance.",                  "non_speech_events": "",               "emotion": "enthusiastic" },
  { "speaker": "James Park",     "start_time": 8.00,  "end_time": 11.50, "transcript": "Thanks, Sarah. Glad to be here — I've been looking forward to this conversation for weeks.",                              "non_speech_events": "",               "emotion": "friendly"    },
  { "speaker": "Elena Rossi",    "start_time": 11.50, "end_time": 15.00, "transcript": "Same here. This is such an important topic right now, especially with how fast the field is evolving.",                    "non_speech_events": "",               "emotion": "engaged"     },
  { "speaker": "Sarah Mitchell", "start_time": 15.00, "end_time": 19.80, "transcript": "James, let's start with you. Your team recently published a paper on neural machine translation for low-resource languages.", "non_speech_events": "",               "emotion": "curious"     },
  { "speaker": "James Park",     "start_time": 19.80, "end_time": 23.20, "transcript": "Yes, so our main finding was that back-translation combined with careful data augmentation can boost BLEU scores by up to 12 points.", "non_speech_events": "", "emotion": "analytical"  },
  { "speaker": "James Park",     "start_time": 23.20, "end_time": 26.50, "transcript": "This holds for languages with under 50,000 parallel sentences.",                                                           "non_speech_events": "",               "emotion": "analytical"  },
  { "speaker": "James Park",     "start_time": 26.50, "end_time": 30.00, "transcript": "The trick is selecting the right seed data and not just throwing everything at the model.",                                "non_speech_events": "",               "emotion": "technical"   },
  { "speaker": "Elena Rossi",    "start_time": 30.00, "end_time": 39.00, "transcript": "That resonates with our work at the localization lab where we focus on Arabic dialect adaptation — standard Arabic models completely fail when you feed them Tunisian or Moroccan dialect input.", "non_speech_events": "", "emotion": "thoughtful" }
]

curl -X POST https://structural-diff-engine.onrender.com/v1/diff \
  -H "Content-Type: application/json" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "x-request-id: batch-ep101-layer1-layer2-qa" \
  -d '{
    "original": <originalAdapted array>,
    "reworked": <reworkedAdapted array>
  }'

For the full demo dataset (9 original rows, 9 reworked rows after the split and merge), the body is about 8 KB — well within the 5 MB limit. Include x-request-id to correlate this call with the batch ID in your system.

Walking through the response

The results array has one entry per original row, plus source rows for any merges. Here is what each row returns and why:

json

{
  "status": "success",
  "requestId": "batch-ep101-layer1-layer2-qa",
  "data": {
    "results": [
      { "status": "UNCHANGED", "notes": "exact match",          ... },  // row 0 — Sarah intro
      { "status": "MODIFIED",  "notes": "transcript changed",   ... },  // row 1 — comma added
      { "status": "MODIFIED",  "notes": "transcript changed",   ... },  // row 2 — James punctuation
      { "status": "MODIFIED",  "notes": "transcript changed",   ... },  // row 3 — Elena punctuation
      { "status": "MODIFIED",  "notes": "transcript changed",   ... },  // row 4 — Sarah question
      { "status": "SPLIT",     "notes": "split into 2 rows",    ... },  // row 5 — James finding (split)
      { "status": "UNCHANGED", "notes": "exact match",          ... },  // row 6 — James trick
      { "status": "MERGED",    "notes": "merged from 2 rows",   ... },  // rows 7+8 — Elena merged
      { "status": "MERGED",    "notes": "Source row 1/2 ...",   ... },  // ← trace entry, skip in counts
      { "status": "MERGED",    "notes": "Source row 2/2 ...",   ... }   // ← trace entry, skip in counts
    ],
    "scores": {
      "CER": 0.09,
      "WER": 0.14,
      "SegER": 0.22,
      "SER": 0.44,
      "cerT": 0.09,
      "werT": 0.14
    },
    "composite": {
      "score": 3.9,
      "grade": "B",
      "label": "Good"
    },
    "meta": {
      "originalRows": 9,
      "reworkedRows": 9
    }
  }
}

UNCHANGED

Row 0 — Sarah's intro line

Exact match. The annotator did not touch this line.

MODIFIED

Row 1 — "Today we have..."

Annotator added a comma after "Today". transcriptDiff highlights the insertion: [equal "Today"] [insert ","] [equal " we have two..."].

MODIFIED

Row 2 — James's thanks

Multiple punctuation corrections. Full sentence structure edited for verbatim style. transcriptDiff shows several delete/insert pairs.

MODIFIED

Row 3 — Elena's "Same here"

Annotator added a period after "Same here" and a comma before "especially". Classic verbatim punctuation norm.

MODIFIED

Row 4 — Sarah's question

Comma added after "James", apostrophe correction in "let's", hyphen added in "low-resource". Three distinct corrections in one row.

SPLIT

Row 5 — James's long finding

Original: one 46-word run-on sentence. Reworked: split at a natural clause boundary into two segments (25 + 11 words). "twelve" → "12" also corrected in part 1. SER incremented.

UNCHANGED

Row 6 — James's "The trick..."

Already well-formed. Annotator left it untouched.

MERGED

Rows 7+8 — Elena's two statements

Annotator merged two consecutive Elena lines into one segment with an em-dash. Both original rows are absorbed; a source row entry also appears in results for each.

Reading the scores

The scores object gives you a quantitative summary of the entire batch. Here is how to interpret the numbers in this demo context:

CER≈ 0.09

CER ≈ 0.09 — About 9% of characters across all columns changed. This is low and expected: the annotator made punctuation and numeral corrections, which are short character-level changes in long segments.

WER≈ 0.14

WER ≈ 0.14 — About 14% of words changed. Higher than CER because one word swap ("twelve" → "12") counts as a full-word change, and the split/merge structural changes each touch a segment boundary.

SegER≈ 0.22

SegER ≈ 0.22 — About 22% of original rows had boundary-changing structural events (1 split + 1 merge out of 9 rows = 2/9). This is typical for a first annotation pass on AI output where the AI over- or under-segmented in a few places.

SER≈ 0.67

SER ≈ 0.67 — About 67% of comparable rows (UNCHANGED + MODIFIED) contain at least one edit: 4 MODIFIED out of 6 comparable rows. This measures per-sentence edit frequency, independent of structural events.

CompositeB / Good (3.9)

Composite grade B / "Good" (score ≈ 3.9) — The weighted formula rewards a low edit rate and penalises structural changes. A B grade tells the coordinator: the annotator made real corrections (not a rubber-stamp), but the overall volume of change is controlled.

Grade interpretation in annotation QA context: A = near-perfect AI output, minimal corrections. B = solid QA pass, expected corrections. C = significant post-editing required, AI quality needs investigation. D/F = the batch may need a full re-annotation or a reject/send-back decision.

Demo Walkthrough · Structural Diff API · Built by Mohamed Yaakoubi