Skip to content

transformations

transformations #

This file is intended to provide the default transformation which need to be applied to input text in order to compute the WER (or similar measures).

It also implements some alternative transformations which might be useful in specific use cases.

cer_contiguous module-attribute #

cer_contiguous = tr.Compose(
    [
        tr.Strip(),
        tr.ReduceToSingleSentence(),
        tr.ReduceToListOfListOfChars(),
    ]
)

This can used instead of cer_default when the number of reference and hypothesis sentences differ.

cer_default module-attribute #

cer_default = tr.Compose(
    [tr.Strip(), tr.ReduceToListOfListOfChars()]
)

This is the default transformation when using process_characters. Each input string will have its leading and tailing white space removed. Then each string is transformed into a list with lists of strings, where each string is a single character.

wer_contiguous module-attribute #

wer_contiguous = tr.Compose(
    [
        tr.RemoveMultipleSpaces(),
        tr.Strip(),
        tr.ReduceToSingleSentence(),
        tr.ReduceToListOfListOfWords(),
    ]
)

This is can be used instead of wer_default when the number of reference and hypothesis sentences differ.

wer_default module-attribute #

wer_default = tr.Compose(
    [
        tr.RemoveMultipleSpaces(),
        tr.Strip(),
        tr.ReduceToListOfListOfWords(),
    ]
)

This is the default transformation when using proces_words. Each input string will have its leading and tailing white space removed. Thereafter multiple spaces between words are also removed. Then each string is transformed into a list with lists of strings, where each string is a single word.

wer_standardize module-attribute #

wer_standardize = tr.Compose(
    [
        tr.ToLowerCase(),
        tr.ExpandCommonEnglishContractions(),
        tr.RemoveKaldiNonWords(),
        tr.RemoveWhiteSpace(replace_by_space=True),
        tr.RemoveMultipleSpaces(),
        tr.Strip(),
        tr.ReduceToListOfListOfWords(),
    ]
)

This transform attempts to standardize the strings by setting all characters to lower case, expanding common contractions, and removing non-words. Then the default operations are applied.

wer_standardize_contiguous module-attribute #

wer_standardize_contiguous = tr.Compose(
    [
        tr.ToLowerCase(),
        tr.ExpandCommonEnglishContractions(),
        tr.RemoveKaldiNonWords(),
        tr.RemoveWhiteSpace(replace_by_space=True),
        tr.RemoveMultipleSpaces(),
        tr.Strip(),
        tr.ReduceToSingleSentence(),
        tr.ReduceToListOfListOfWords(),
    ]
)

This is the same as wer_standize, but this version can be usd when the number of reference and hypothesis sentences differ.