transformations
transformations #
This file is intended to provide the default transformation which need to be applied to input text in order to compute the WER (or similar measures).
It also implements some alternative transformations which might be useful in specific use cases.
cer_contiguous
module-attribute
#
cer_contiguous = tr.Compose(
[
tr.Strip(),
tr.ReduceToSingleSentence(),
tr.ReduceToListOfListOfChars(),
]
)
This can used instead of cer_default
when the number of reference and hypothesis
sentences differ.
cer_default
module-attribute
#
cer_default = tr.Compose(
[tr.Strip(), tr.ReduceToListOfListOfChars()]
)
This is the default transformation when using process_characters
. Each input string
will have its leading and tailing white space removed. Then each string is
transformed into a list with lists of strings, where each string is a single character.
wer_contiguous
module-attribute
#
wer_contiguous = tr.Compose(
[
tr.RemoveMultipleSpaces(),
tr.Strip(),
tr.ReduceToSingleSentence(),
tr.ReduceToListOfListOfWords(),
]
)
This is can be used instead of wer_default
when the number of reference and hypothesis
sentences differ.
wer_default
module-attribute
#
wer_default = tr.Compose(
[
tr.RemoveMultipleSpaces(),
tr.Strip(),
tr.ReduceToListOfListOfWords(),
]
)
This is the default transformation when using proces_words
. Each input string will
have its leading and tailing white space removed.
Thereafter multiple spaces between words are also removed.
Then each string is transformed into a list with lists of strings, where each string
is a single word.
wer_standardize
module-attribute
#
wer_standardize = tr.Compose(
[
tr.ToLowerCase(),
tr.ExpandCommonEnglishContractions(),
tr.RemoveKaldiNonWords(),
tr.RemoveWhiteSpace(replace_by_space=True),
tr.RemoveMultipleSpaces(),
tr.Strip(),
tr.ReduceToListOfListOfWords(),
]
)
This transform attempts to standardize the strings by setting all characters to lower case, expanding common contractions, and removing non-words. Then the default operations are applied.
wer_standardize_contiguous
module-attribute
#
wer_standardize_contiguous = tr.Compose(
[
tr.ToLowerCase(),
tr.ExpandCommonEnglishContractions(),
tr.RemoveKaldiNonWords(),
tr.RemoveWhiteSpace(replace_by_space=True),
tr.RemoveMultipleSpaces(),
tr.Strip(),
tr.ReduceToSingleSentence(),
tr.ReduceToListOfListOfWords(),
]
)
This is the same as wer_standize
, but this version can be usd when the number of
reference and hypothesis sentences differ.