evalmate.evaluator

This module implements the top-level functionality for performing the evaluation for the different tasks. For every task there is an Evaluator (extends Evaluator) and an Evaluation (extends Evaluation. The Evaluator is the is class responsible to perform the evaluation and the Evaluation is the output, which contains the aligned labels/segments and depending on the task further data like word confusions.

Base

class evalmate.evaluator.Evaluation(ref_outcome, hyp_outcome)[source]

Base class for evaluation results.

Variables:
  • ref_outcome (Outcome) – The outcome of the ground-truth/reference.
  • hyp_outcome (Outcome) – The outcome of the system-output/hypothesis.
get_report(template=None)[source]

Generate and return a report.

Parameters:template (str) – Name of the Jinja2 template to use. If None, the default_template() is used. All available templates are in the report_templates folder.
Returns:The rendered report.
Return type:str
template_data

Return a dictionary that contains objects/values to use in the rendering template.

write_report(path, template=None)[source]

Write the report to the given path.

Parameters:
  • path (str) – Path to write the report to.
  • template (str) – Name of the Jinja2 template to use. If None, the default_template() is used. All available templates are in the report_templates folder.
class evalmate.evaluator.Evaluator[source]

Base class for a evaluator.

Provides methods for reading outcomes in different ways. The evaluator for a specific class then has to implement do_evaluate, which performs the evaluation on ref and hyp outcome.

classmethod default_label_list_idx()[source]

Define the default label-lists which is used when reading a corpus.

do_evaluate(ref, hyp)[source]

Create the evaluation result of the given hypothesis compared to the given reference (ground truth).

Parameters:
  • ref (Outcome) – The ground-truth/reference outcome.
  • hyp (Outcome) – The system-output/hypothesis outcome.
Returns:

The evaluation results.

Return type:

Evaluation

evaluate(ref, hyp, label_list_idx=None)[source]

Create the evaluation result of the given hypothesis compared to the given reference (ground truth). There are different possibilities of input:

  • ref = Outcome / hyp = Outcome: Both ref and hyp are Outcome instances. See do_evaluate
  • ref = Corpus / hyp = dict: The dict contains label-lists which are compared against the corpus. See evaluate_label_lists_against_corpus
  • ref = LabelList / hyp = LabelList: Ref label-list is compared against the other. See evaluate_label_lists
Parameters:
  • ref (LabelList, Corpus) – A label-list, a corpus.
  • hyp (LabelList, dict) – A label-list, a dict.
  • label_list_idx (str) – The label-list to use when reading from a corpus.
Returns:

The evaluation results.

Return type:

Evaluation

evaluate_label_lists(ll_ref, ll_hyp, duration=None)[source]

Create Evaluation for ref and hyp label-list. If the duration is not provided some metrics cannot be used.

Parameters:
  • ref (LabelList) – A label-list.
  • hyp (LabelList) – A label-list.
  • duration (float) – The duration of the utterance, that belongs to the label-lists.
Returns:

The evaluation results.

Return type:

Evaluation

evaluate_label_lists_against_corpus(corpus, label_lists, label_list_idx=None)[source]

Create Evaluation for the given corpus.

Parameters:
  • corpus (Corpus) – A corpus containing the reference label-lists.
  • label_lists (Dict) – A dictionary containing label-lists with the utterance-idx as key. The utterance-idx is used to find the corresponding reference label-list in the corpus.
  • label_list_idx (str) – The idx of the label-lists to use as reference from the corpus. If None, cls.default_label_list_idx is used.
Returns:

The evaluation results.

Return type:

Evaluation

Outcome

class evalmate.evaluator.Outcome(label_lists=None, utterance_durations=None)[source]

An outcome represents the annotation/labels/transcriptions of a dataset/corpus for a given task. This can be either the ground truth/reference or the system output/hypothesis.

If no durations are provided or duration for some utterances are missing, some methods may not work or throw exceptions.

Variables:
  • label_lists (dict) – Dictionary containing all label-lists with the utterance-idx/sample-idx as key.
  • utterance_durations (dict) – Dictionary (utterance-idx/duration) containing the durations of all utterances.
all_values

Return a set of all values, occurring in the outcome.

label_set()[source]

Return a label-set containing all labels.

label_set_for_value(value)[source]

Return a label-set containing all labels, where the value is value.

Parameters:value (str) – The value to filter.
Returns:Label-set containing all labels with the given value.
Return type:LabelSet
total_duration

Return the duration of all utterances together.

Notes

Only works if for all utterances, the durations are provided.

class evalmate.evaluator.LabelSet(labels=None)[source]

Class to collect a bunch of labels. This is used to compute statistics over a defined set of labels.

For example we want to compute the average length of all labels with the value ‘music’. We can then collect all these in a label-set and perform the computation.

count

Return the number of labels.

label_lengths

Return a list containing all label lengths.

length_max

Return the length of the longest label.

length_mean

Return the mean length of all labels.

length_median

Return the median of all label lengths.

length_min

Return the length of the shortest label.

length_variance

Return the variance of all label lengths.

Segment

class evalmate.evaluator.SegmentEvaluation(ref_outcome, hyp_outcome, utt_to_segments)[source]

Result of an evaluation of a segment-based alignment.

Parameters:

utt_to_segments (dict) – Dict of lists with evalmate.alignment.Segment. Key is the utterance-idx.

Variables:
segments

Return a list of all segment (from all utterances together).

template_data

Return a dictionary that contains objects/values to use in the rendering template.

class evalmate.evaluator.SegmentEvaluator(aligner=None)[source]

Evaluation of an alignment based on segments.

Parameters:aligner (SegmentAligner) – An instance of an event-aligner to use. If not given, the alignment.InvariantSegmentAligner is used.
classmethod default_label_list_idx()[source]

Define the default label-lists which is used when reading a corpus.

do_evaluate(ref, hyp)[source]

Create the evaluation result of the given hypothesis compared to the given reference (ground truth).

Parameters:
  • ref (Outcome) – The ground-truth/reference outcome.
  • hyp (Outcome) – The system-output/hypothesis outcome.
Returns:

The evaluation results.

Return type:

Evaluation

static flatten_overlapping_labels(aligned_segments)[source]

Check all segments for overlapping labels. Overlapping means there are multiple reference or multiple hypothesis labels in a segment.

Parameters:aligned_segments (List) – List of segments.
Returns:List of segments where ref and hyp is a single label.
Return type:list
Raises:ValueError – A segment contains overlapping labels.

Event

class evalmate.evaluator.EventEvaluation(ref_outcome, hyp_outcome, utt_to_label_pairs)[source]

Result of an evaluation of any event-based alignment.

Parameters:

utt_to_label_pairs (dict) – Key is the utterance-id, value is a list of evalmate.alignment.LabelPair.

Variables:
label_pairs

Return a list of all label-pairs (from all utterances together).

template_data

Return a dictionary that contains objects/values to use in the rendering template.

class evalmate.evaluator.EventEvaluator(aligner)[source]

Class to compute evaluation results for any event-based alignment.

Parameters:aligner (EventAligner) – An instance of an event-aligner to use.
classmethod default_label_list_idx()[source]

Define the default label-lists which is used when reading a corpus.

do_evaluate(ref, hyp)[source]

Create the evaluation result of the given hypothesis compared to the given reference (ground truth).

Parameters:
  • ref (Outcome) – The ground-truth/reference outcome.
  • hyp (Outcome) – The system-output/hypothesis outcome.
Returns:

The evaluation results.

Return type:

Evaluation

KWS

class evalmate.evaluator.KWSEvaluation(ref_outcome, hyp_outcome, utt_to_label_pairs)[source]

Result of an evaluation of a keyword spotting task.

Parameters:

utt_to_label_pairs (dict) – Key is the utterance-id, value is a list of evalmate.alignment.LabelPair.

Variables:
false_alarm_rate(keyword=None)[source]

The False Alarm Rate (FAR) is the percentage of detections, where no keyword is according to the ground truth. If no keyword is given the mean FAR is calculated over all keywords. This rate is relative to the duration of all utterances.

To calculate this, we need to know the number of times a keyword could be wrongly inserted. We assume that every keyword takes one second to approximate this value.

Parameters:keyword (str) – If not None, only the FFR for this keyword is returned.
Returns:A rate between 0 and 1
Return type:float
false_rejection_rate(keyword=None)[source]

The False Rejection Rate (FRR) is the percentage of misses of all occurrences in the ground truth. If no keyword is given the mean FRR is calculated over all keywords.

Parameters:keyword (str) – If not None, only the FFR for this keyword is returned.
Returns:A rate between 0 and 1
Return type:float
keywords()[source]

Return a list of all keywords occurring in the reference outcome.

term_weighted_value(keyword=None)[source]

Computes the Term-Weighted Value (TWV).

Note

The TWV is implemented according to OpenKWS 2016 Evaluation Plan

Parameters:keyword (str) – If None, computes the TWV over all keywords, otherwise only for the given keyword.
Returns:The TWV in the range 1 to -inf
Return type:float
class evalmate.evaluator.KWSEvaluator(aligner=None)[source]

Class to retrieve evaluation results for a keyword spotting task.

Parameters:aligner (EventAligner) – An instance of an event-aligner to use. If not given the evalmate.alignment.BipartiteMatchingAligner is user.
classmethod default_label_list_idx()[source]

Define the default label-lists which is used when reading a corpus.

do_evaluate(ref, hyp)[source]

Create the evaluation result of the given hypothesis compared to the given reference (ground truth).

Parameters:
  • ref (Outcome) – The ground-truth/reference outcome.
  • hyp (Outcome) – The system-output/hypothesis outcome.
Returns:

The evaluation results.

Return type:

Evaluation

ASR

class evalmate.evaluator.ASREvaluation(ref_outcome, hyp_outcome, utt_to_label_pairs)[source]

Result of an evaluation of a automatic speech recognition task.

Parameters:

utt_to_label_pairs (dict) – Key is the utterance-id, value is a list of evalmate.alignment.LabelPair.

Variables:
class evalmate.evaluator.ASREvaluator(aligner=None)[source]

Class to retrieve evaluation results for a automatic speech recognition task.

Parameters:aligner (EventAligner) – An instance of an event-aligner to use. If not given, the alignment.LevenshteinAligner is used.
classmethod default_label_list_idx()[source]

Define the default label-lists which is used when reading a corpus.

do_evaluate(ref, hyp)[source]

Create the evaluation result of the given hypothesis compared to the given reference (ground truth).

Parameters:
  • ref (Outcome) – The ground-truth/reference outcome.
  • hyp (Outcome) – The system-output/hypothesis outcome.
Returns:

The evaluation results.

Return type:

Evaluation

static tokenize(ll, overlap_threshold=0.1)[source]

Tokenize a label-list and return a new label-list with a separate label for every token.