transformations

transformations #

This file is intended to provide the default transformation which need to be applied to input text in order to compute the WER (or similar measures).

It also implements some alternative transformations which might be useful in specific use cases.

cer_contiguous `module-attribute` #

cer_contiguous = tr.Compose(
    [
        tr.Strip(),
        tr.ReduceToSingleSentence(),
        tr.ReduceToListOfListOfChars(),
    ]
)

This can used instead of cer_default when the number of reference and hypothesis sentences differ.

cer_default `module-attribute` #

cer_default = tr.Compose(
    [tr.Strip(), tr.ReduceToListOfListOfChars()]
)

This is the default transformation when using process_characters. Each input string will have its leading and tailing white space removed. Then each string is transformed into a list with lists of strings, where each string is a single character.

wer_contiguous `module-attribute` #

wer_contiguous = tr.Compose(
    [
        tr.RemoveMultipleSpaces(),
        tr.Strip(),
        tr.ReduceToSingleSentence(),
        tr.ReduceToListOfListOfWords(),
    ]
)

This is can be used instead of wer_default when the number of reference and hypothesis sentences differ.

wer_default `module-attribute` #

wer_default = tr.Compose(
    [
        tr.RemoveMultipleSpaces(),
        tr.Strip(),
        tr.ReduceToListOfListOfWords(),
    ]
)

This is the default transformation when using proces_words. Each input string will have its leading and tailing white space removed. Thereafter multiple spaces between words are also removed. Then each string is transformed into a list with lists of strings, where each string is a single word.

wer_standardize `module-attribute` #

wer_standardize = tr.Compose(
    [
        tr.ToLowerCase(),
        tr.ExpandCommonEnglishContractions(),
        tr.RemoveKaldiNonWords(),
        tr.RemoveWhiteSpace(replace_by_space=True),
        tr.RemoveMultipleSpaces(),
        tr.Strip(),
        tr.ReduceToListOfListOfWords(),
    ]
)

This transform attempts to standardize the strings by setting all characters to lower case, expanding common contractions, and removing non-words. Then the default operations are applied.

wer_standardize_contiguous `module-attribute` #

wer_standardize_contiguous = tr.Compose(
    [
        tr.ToLowerCase(),
        tr.ExpandCommonEnglishContractions(),
        tr.RemoveKaldiNonWords(),
        tr.RemoveWhiteSpace(replace_by_space=True),
        tr.RemoveMultipleSpaces(),
        tr.Strip(),
        tr.ReduceToSingleSentence(),
        tr.ReduceToListOfListOfWords(),
    ]
)

This is the same as wer_standize, but this version can be usd when the number of reference and hypothesis sentences differ.

transformations