Skip to content

process

process #

The core algorithm(s) for processing a one or more reference and hypothesis sentences so that measures can be computed and an alignment can be visualized.

AlignmentChunk dataclass #

Define an alignment between two subsequence of the reference and hypothesis.

Attributes:

Name Type Description
type str

one of equal, substitute, insert, or delete

ref_start_idx int

the start index of the reference subsequence

ref_end_idx int

the end index of the reference subsequence

hyp_start_idx int

the start index of the hypothesis subsequence

hyp_end_idx int

the end index of the hypothesis subsequence

Source code in jiwer/process.py
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
@dataclass
class AlignmentChunk:
    """
    Define an alignment between two subsequence of the reference and hypothesis.

    Attributes:
        type: one of `equal`, `substitute`, `insert`, or `delete`
        ref_start_idx: the start index of the reference subsequence
        ref_end_idx: the end index of the reference subsequence
        hyp_start_idx: the start index of the hypothesis subsequence
        hyp_end_idx: the end index of the hypothesis subsequence
    """

    type: str

    ref_start_idx: int
    ref_end_idx: int

    hyp_start_idx: int
    hyp_end_idx: int

    def __post_init__(self):
        if self.type not in ["replace", "insert", "delete", "equal", "substitute"]:
            raise ValueError("")

        # rapidfuzz uses replace instead of substitute... For consistency, we change it
        if self.type == "replace":
            self.type = "substitute"

        if self.ref_start_idx > self.ref_end_idx:
            raise ValueError(
                f"ref_start_idx={self.ref_start_idx} "
                f"is larger "
                f"than ref_end_idx={self.ref_end_idx}"
            )
        if self.hyp_start_idx > self.hyp_end_idx:
            raise ValueError(
                f"hyp_start_idx={self.hyp_start_idx} "
                f"is larger "
                f"than hyp_end_idx={self.hyp_end_idx}"
            )

CharacterOutput dataclass #

The output of calculating the character-level levenshtein distance between one or more reference and hypothesis sentence(s).

Attributes:

Name Type Description
references List[List[str]]

The reference sentences

hypotheses List[List[str]]

The hypothesis sentences

alignments List[List[AlignmentChunk]]

The alignment between reference and hypothesis sentences

cer float

The character error rate

hits int

The number of correct characters between reference and hypothesis sentences

substitutions int

The number of substitutions required to transform hypothesis sentences to reference sentences

insertions int

The number of insertions required to transform hypothesis sentences to reference sentences

deletions int

The number of deletions required to transform hypothesis sentences to reference sentences

Source code in jiwer/process.py
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
@dataclass
class CharacterOutput:
    """
    The output of calculating the character-level levenshtein distance between one or
    more reference and hypothesis sentence(s).

    Attributes:
        references: The reference sentences
        hypotheses: The hypothesis sentences
        alignments: The alignment between reference and hypothesis sentences
        cer: The character error rate
        hits: The number of correct characters between reference and hypothesis
              sentences
        substitutions: The number of substitutions required to transform hypothesis
                       sentences to reference sentences
        insertions: The number of insertions required to transform hypothesis
                       sentences to reference sentences
        deletions: The number of deletions required to transform hypothesis
                       sentences to reference sentences
    """

    # processed input data
    references: List[List[str]]
    hypotheses: List[List[str]]

    # alignment
    alignments: List[List[AlignmentChunk]]

    # measures
    cer: float

    # stats
    hits: int
    substitutions: int
    insertions: int
    deletions: int

WordOutput dataclass #

The output of calculating the word-level levenshtein distance between one or more reference and hypothesis sentence(s).

Attributes:

Name Type Description
references List[List[str]]

The reference sentences

hypotheses List[List[str]]

The hypothesis sentences

alignments List[List[AlignmentChunk]]

The alignment between reference and hypothesis sentences

wer float

The word error rate

mer float

The match error rate

wil float

The word information lost measure

wip float

The word information preserved measure

hits int

The number of correct words between reference and hypothesis sentences

substitutions int

The number of substitutions required to transform hypothesis sentences to reference sentences

insertions int

The number of insertions required to transform hypothesis sentences to reference sentences

deletions int

The number of deletions required to transform hypothesis sentences to reference sentences

Source code in jiwer/process.py
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
@dataclass
class WordOutput:
    """
    The output of calculating the word-level levenshtein distance between one or more
    reference and hypothesis sentence(s).

    Attributes:
        references: The reference sentences
        hypotheses: The hypothesis sentences
        alignments: The alignment between reference and hypothesis sentences
        wer: The word error rate
        mer: The match error rate
        wil: The word information lost measure
        wip: The word information preserved measure
        hits: The number of correct words between reference and hypothesis sentences
        substitutions: The number of substitutions required to transform hypothesis
                       sentences to reference sentences
        insertions: The number of insertions required to transform hypothesis
                       sentences to reference sentences
        deletions: The number of deletions required to transform hypothesis
                       sentences to reference sentences

    """

    # processed input data
    references: List[List[str]]
    hypotheses: List[List[str]]

    # alignment
    alignments: List[List[AlignmentChunk]]

    # measures
    wer: float
    mer: float
    wil: float
    wip: float

    # stats
    hits: int
    substitutions: int
    insertions: int
    deletions: int

process_characters #

process_characters(
    reference,
    hypothesis,
    reference_transform=cer_default,
    hypothesis_transform=cer_default,
)

Compute the character-level levenshtein distance and alignment between one or more reference and hypothesis sentences. Based on the result, the character error rate can be computed.

Note that the by default this method includes space () as a character over which the error rate is computed. If this is not desired, the reference and hypothesis transform need to be modified.

Parameters:

Name Type Description Default
reference Union[str, List[str]]

The reference sentence(s)

required
hypothesis Union[str, List[str]]

The hypothesis sentence(s)

required
reference_transform Union[Compose, AbstractTransform]

The transformation(s) to apply to the reference string(s)

cer_default
hypothesis_transform Union[Compose, AbstractTransform]

The transformation(s) to apply to the hypothesis string(s)

cer_default

Returns:

Type Description
CharacterOutput

The processed reference and hypothesis sentences.

Source code in jiwer/process.py
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def process_characters(
    reference: Union[str, List[str]],
    hypothesis: Union[str, List[str]],
    reference_transform: Union[tr.Compose, tr.AbstractTransform] = cer_default,
    hypothesis_transform: Union[tr.Compose, tr.AbstractTransform] = cer_default,
) -> CharacterOutput:
    """
    Compute the character-level levenshtein distance and alignment between one or more
    reference and hypothesis sentences. Based on the result, the character error rate
    can be computed.

    Note that the by default this method includes space (` `) as a
    character over which the error rate is computed. If this is not desired, the
    reference and hypothesis transform need to be modified.

    Args:
        reference: The reference sentence(s)
        hypothesis: The hypothesis sentence(s)
        reference_transform: The transformation(s) to apply to the reference string(s)
        hypothesis_transform: The transformation(s) to apply to the hypothesis string(s)

    Returns:
        (CharacterOutput): The processed reference and hypothesis sentences.

    """
    # make sure the transforms end with tr.ReduceToListOfListOfChars(),

    # it's the same as word processing, just every word is of length 1
    result = process_words(
        reference, hypothesis, reference_transform, hypothesis_transform
    )

    return CharacterOutput(
        references=result.references,
        hypotheses=result.hypotheses,
        alignments=result.alignments,
        cer=result.wer,
        hits=result.hits,
        substitutions=result.substitutions,
        insertions=result.insertions,
        deletions=result.deletions,
    )

process_words #

process_words(
    reference,
    hypothesis,
    reference_transform=wer_default,
    hypothesis_transform=wer_default,
)

Compute the word-level levenshtein distance and alignment between one or more reference and hypothesis sentences. Based on the result, multiple measures can be computed, such as the word error rate.

Parameters:

Name Type Description Default
reference Union[str, List[str]]

The reference sentence(s)

required
hypothesis Union[str, List[str]]

The hypothesis sentence(s)

required
reference_transform Union[Compose, AbstractTransform]

The transformation(s) to apply to the reference string(s)

wer_default
hypothesis_transform Union[Compose, AbstractTransform]

The transformation(s) to apply to the hypothesis string(s)

wer_default

Returns:

Type Description
WordOutput

The processed reference and hypothesis sentences

Source code in jiwer/process.py
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
def process_words(
    reference: Union[str, List[str]],
    hypothesis: Union[str, List[str]],
    reference_transform: Union[tr.Compose, tr.AbstractTransform] = wer_default,
    hypothesis_transform: Union[tr.Compose, tr.AbstractTransform] = wer_default,
) -> WordOutput:
    """
    Compute the word-level levenshtein distance and alignment between one or more
    reference and hypothesis sentences. Based on the result, multiple measures
    can be computed, such as the word error rate.

    Args:
        reference: The reference sentence(s)
        hypothesis: The hypothesis sentence(s)
        reference_transform: The transformation(s) to apply to the reference string(s)
        hypothesis_transform: The transformation(s) to apply to the hypothesis string(s)

    Returns:
        (WordOutput): The processed reference and hypothesis sentences
    """
    # validate input type
    if isinstance(reference, str):
        reference = [reference]
    if isinstance(hypothesis, str):
        hypothesis = [hypothesis]
    if any(len(t) == 0 for t in reference):
        raise ValueError("one or more references are empty strings")

    # pre-process reference and hypothesis by applying transforms
    ref_transformed = _apply_transform(
        reference, reference_transform, is_reference=True
    )
    hyp_transformed = _apply_transform(
        hypothesis, hypothesis_transform, is_reference=False
    )

    if len(ref_transformed) != len(hyp_transformed):
        raise ValueError(
            "After applying the transforms on the reference and hypothesis sentences, "
            f"their lengths must match. "
            f"Instead got {len(ref_transformed)} reference and "
            f"{len(hyp_transformed)} hypothesis sentences."
        )

    # Change each word into a unique character in order to compute
    # word-level levenshtein distance
    ref_as_chars, hyp_as_chars = _word2char(ref_transformed, hyp_transformed)

    # keep track of total hits, substitutions, deletions and insertions
    # across all input sentences
    num_hits, num_substitutions, num_deletions, num_insertions = 0, 0, 0, 0

    # also keep track of the total number of words in the reference and hypothesis
    num_rf_words, num_hp_words = 0, 0

    # anf finally, keep track of the alignment between each reference and hypothesis
    alignments = []

    for reference_sentence, hypothesis_sentence in zip(ref_as_chars, hyp_as_chars):
        # Get the required edit operations to transform reference into hypothesis
        edit_ops = rapidfuzz.distance.Levenshtein.editops(
            reference_sentence, hypothesis_sentence
        )

        # count the number of edits of each type
        substitutions = sum(1 if op.tag == "replace" else 0 for op in edit_ops)
        deletions = sum(1 if op.tag == "delete" else 0 for op in edit_ops)
        insertions = sum(1 if op.tag == "insert" else 0 for op in edit_ops)
        hits = len(reference_sentence) - (substitutions + deletions)

        # update state
        num_hits += hits
        num_substitutions += substitutions
        num_deletions += deletions
        num_insertions += insertions
        num_rf_words += len(reference_sentence)
        num_hp_words += len(hypothesis_sentence)
        alignments.append(
            [
                AlignmentChunk(
                    type=op.tag,
                    ref_start_idx=op.src_start,
                    ref_end_idx=op.src_end,
                    hyp_start_idx=op.dest_start,
                    hyp_end_idx=op.dest_end,
                )
                for op in Opcodes.from_editops(edit_ops)
            ]
        )

    # Compute all measures
    S, D, I, H = num_substitutions, num_deletions, num_insertions, num_hits

    wer = float(S + D + I) / float(H + S + D)
    mer = float(S + D + I) / float(H + S + D + I)
    wip = (
        (float(H) / num_rf_words) * (float(H) / num_hp_words)
        if num_hp_words >= 1
        else 0
    )
    wil = 1 - wip

    # return all output
    return WordOutput(
        references=ref_transformed,
        hypotheses=hyp_transformed,
        alignments=alignments,
        wer=wer,
        mer=mer,
        wil=wil,
        wip=wip,
        hits=num_hits,
        substitutions=num_substitutions,
        insertions=num_insertions,
        deletions=num_deletions,
    )