evalmate.evaluator¶

This module implements the top-level functionality for performing the evaluation for the different tasks. For every task there is an Evaluator (extends Evaluator) and an Evaluation (extends Evaluation. The Evaluator is the is class responsible to perform the evaluation and the Evaluation is the output, which contains the aligned labels/segments and depending on the task further data like word confusions.

Base¶

class evalmate.evaluator.Evaluation(ref_outcome, hyp_outcome)[source]¶

Base class for evaluation results.

Variables:	ref_outcome (Outcome) – The outcome of the ground-truth/reference. hyp_outcome (Outcome) – The outcome of the system-output/hypothesis.

get_report(template=None)[source]¶

Generate and return a report.

Parameters:	template (str) – Name of the Jinja2 template to use. If None, the `default_template()` is used. All available templates are in the `report_templates` folder.
Returns:	The rendered report.
Return type:	str

template_data¶: Return a dictionary that contains objects/values to use in the rendering template.

write_report(path, template=None)[source]¶

Write the report to the given path.

Parameters:	path (str) – Path to write the report to. template (str) – Name of the Jinja2 template to use. If None, the `default_template()` is used. All available templates are in the `report_templates` folder.

class evalmate.evaluator.Evaluator[source]¶

Base class for a evaluator.

Provides methods for reading outcomes in different ways. The evaluator for a specific class then has to implement do_evaluate, which performs the evaluation on ref and hyp outcome.

classmethod default_label_list_idx()[source]¶: Define the default label-lists which is used when reading a corpus.

do_evaluate(ref, hyp)[source]¶

Create the evaluation result of the given hypothesis compared to the given reference (ground truth).

Parameters:	ref (Outcome) – The ground-truth/reference outcome. hyp (Outcome) – The system-output/hypothesis outcome.
Returns:	The evaluation results.
Return type:	Evaluation

evaluate(ref, hyp, label_list_idx=None)[source]¶

Create the evaluation result of the given hypothesis compared to the given reference (ground truth). There are different possibilities of input:

ref = Outcome / hyp = Outcome: Both ref and hyp are Outcome instances. See do_evaluate
ref = Corpus / hyp = dict: The dict contains label-lists which are compared against the corpus. See evaluate_label_lists_against_corpus
ref = LabelList / hyp = LabelList: Ref label-list is compared against the other. See evaluate_label_lists

Parameters:	ref (LabelList, Corpus) – A label-list, a corpus. hyp (LabelList, dict) – A label-list, a dict. label_list_idx (str) – The label-list to use when reading from a corpus.
Returns:	The evaluation results.
Return type:	Evaluation

evaluate_label_lists(ll_ref, ll_hyp, duration=None)[source]¶

Create Evaluation for ref and hyp label-list. If the duration is not provided some metrics cannot be used.

Parameters:	ref (LabelList) – A label-list. hyp (LabelList) – A label-list. duration (float) – The duration of the utterance, that belongs to the label-lists.
Returns:	The evaluation results.
Return type:	Evaluation

evaluate_label_lists_against_corpus(corpus, label_lists, label_list_idx=None)[source]¶

Create Evaluation for the given corpus.

Parameters:	corpus (Corpus) – A corpus containing the reference label-lists. label_lists (Dict) – A dictionary containing label-lists with the utterance-idx as key. The utterance-idx is used to find the corresponding reference label-list in the corpus. label_list_idx (str) – The idx of the label-lists to use as reference from the corpus. If None, cls.default_label_list_idx is used.
Returns:	The evaluation results.
Return type:	Evaluation

Outcome¶

class evalmate.evaluator.Outcome(label_lists=None, utterance_durations=None)[source]¶

An outcome represents the annotation/labels/transcriptions of a dataset/corpus for a given task. This can be either the ground truth/reference or the system output/hypothesis.

If no durations are provided or duration for some utterances are missing, some methods may not work or throw exceptions.

Variables:	label_lists (dict) – Dictionary containing all label-lists with the utterance-idx/sample-idx as key. utterance_durations (dict) – Dictionary (utterance-idx/duration) containing the durations of all utterances.

all_values¶: Return a set of all values, occurring in the outcome.

label_set()[source]¶: Return a label-set containing all labels.

label_set_for_value(value)[source]¶

Return a label-set containing all labels, where the value is value.

Parameters:	value (str) – The value to filter.
Returns:	Label-set containing all labels with the given value.
Return type:	LabelSet

total_duration¶

Return the duration of all utterances together.

Notes

Only works if for all utterances, the durations are provided.

class evalmate.evaluator.LabelSet(labels=None)[source]¶

Class to collect a bunch of labels. This is used to compute statistics over a defined set of labels.

For example we want to compute the average length of all labels with the value ‘music’. We can then collect all these in a label-set and perform the computation.

count¶: Return the number of labels.

label_lengths¶: Return a list containing all label lengths.

length_max¶: Return the length of the longest label.

length_mean¶: Return the mean length of all labels.

length_median¶: Return the median of all label lengths.

length_min¶: Return the length of the shortest label.

length_variance¶: Return the variance of all label lengths.

Segment¶

class evalmate.evaluator.SegmentEvaluation(ref_outcome, hyp_outcome, utt_to_segments)[source]¶

Result of an evaluation of a segment-based alignment.

Parameters:	utt_to_segments (dict) – Dict of lists with `evalmate.alignment.Segment`. Key is the utterance-idx.
Variables:	ref_outcome (Outcome) – The outcome of the ground-truth/reference. hyp_outcome (Outcome) – The outcome of the system-output/hypothesis. confusion (AggregatedConfusion) – Confusion result

segments¶: Return a list of all segment (from all utterances together).

template_data¶: Return a dictionary that contains objects/values to use in the rendering template.

class evalmate.evaluator.SegmentEvaluator(aligner=None)[source]¶

Evaluation of an alignment based on segments.

Parameters:	aligner (SegmentAligner) – An instance of an event-aligner to use. If not given, the `alignment.InvariantSegmentAligner` is used.

classmethod default_label_list_idx()[source]¶: Define the default label-lists which is used when reading a corpus.

do_evaluate(ref, hyp)[source]¶

Create the evaluation result of the given hypothesis compared to the given reference (ground truth).

Parameters:	ref (Outcome) – The ground-truth/reference outcome. hyp (Outcome) – The system-output/hypothesis outcome.
Returns:	The evaluation results.
Return type:	Evaluation

static flatten_overlapping_labels(aligned_segments)[source]¶

Check all segments for overlapping labels. Overlapping means there are multiple reference or multiple hypothesis labels in a segment.

Parameters:	aligned_segments (List) – List of segments.
Returns:	List of segments where ref and hyp is a single label.
Return type:	list
Raises:	`ValueError` – A segment contains overlapping labels.

Event¶

class evalmate.evaluator.EventEvaluation(ref_outcome, hyp_outcome, utt_to_label_pairs)[source]¶

Result of an evaluation of any event-based alignment.

Parameters:	utt_to_label_pairs (dict) – Key is the utterance-id, value is a list of `evalmate.alignment.LabelPair`.
Variables:	ref_outcome (Outcome) – The outcome of the ground-truth/reference. hyp_outcome (Outcome) – The outcome of the system-output/hypothesis. confusion (AggregatedConfusion) – Confusion statistics

label_pairs¶: Return a list of all label-pairs (from all utterances together).

template_data¶: Return a dictionary that contains objects/values to use in the rendering template.

class evalmate.evaluator.EventEvaluator(aligner)[source]¶

Class to compute evaluation results for any event-based alignment.

Parameters:	aligner (EventAligner) – An instance of an event-aligner to use.

classmethod default_label_list_idx()[source]¶: Define the default label-lists which is used when reading a corpus.

do_evaluate(ref, hyp)[source]¶

Create the evaluation result of the given hypothesis compared to the given reference (ground truth).

Parameters:	ref (Outcome) – The ground-truth/reference outcome. hyp (Outcome) – The system-output/hypothesis outcome.
Returns:	The evaluation results.
Return type:	Evaluation

KWS¶

class evalmate.evaluator.KWSEvaluation(ref_outcome, hyp_outcome, utt_to_label_pairs)[source]¶

Result of an evaluation of a keyword spotting task.

Parameters:	utt_to_label_pairs (dict) – Key is the utterance-id, value is a list of `evalmate.alignment.LabelPair`.
Variables:	ref_outcome (Outcome) – The outcome of the ground-truth/reference. hyp_outcome (Outcome) – The outcome of the system-output/hypothesis. confusion (AggregatedConfusion) – Confusion statistics

false_alarm_rate(keyword=None)[source]¶

The False Alarm Rate (FAR) is the percentage of detections, where no keyword is according to the ground truth. If no keyword is given the mean FAR is calculated over all keywords. This rate is relative to the duration of all utterances.

To calculate this, we need to know the number of times a keyword could be wrongly inserted. We assume that every keyword takes one second to approximate this value.

Parameters:	keyword (str) – If not None, only the FFR for this keyword is returned.
Returns:	A rate between 0 and 1
Return type:	float

false_rejection_rate(keyword=None)[source]¶

The False Rejection Rate (FRR) is the percentage of misses of all occurrences in the ground truth. If no keyword is given the mean FRR is calculated over all keywords.

Parameters:	keyword (str) – If not None, only the FFR for this keyword is returned.
Returns:	A rate between 0 and 1
Return type:	float

keywords()[source]¶: Return a list of all keywords occurring in the reference outcome.

term_weighted_value(keyword=None)[source]¶

Computes the Term-Weighted Value (TWV).

Note

The TWV is implemented according to OpenKWS 2016 Evaluation Plan

Parameters:	keyword (str) – If None, computes the TWV over all keywords, otherwise only for the given keyword.
Returns:	The TWV in the range 1 to -inf
Return type:	float

class evalmate.evaluator.KWSEvaluator(aligner=None)[source]¶

Class to retrieve evaluation results for a keyword spotting task.

Parameters:	aligner (EventAligner) – An instance of an event-aligner to use. If not given the `evalmate.alignment.BipartiteMatchingAligner` is user.

classmethod default_label_list_idx()[source]¶: Define the default label-lists which is used when reading a corpus.

do_evaluate(ref, hyp)[source]¶

Create the evaluation result of the given hypothesis compared to the given reference (ground truth).

Parameters:	ref (Outcome) – The ground-truth/reference outcome. hyp (Outcome) – The system-output/hypothesis outcome.
Returns:	The evaluation results.
Return type:	Evaluation

ASR¶

class evalmate.evaluator.ASREvaluation(ref_outcome, hyp_outcome, utt_to_label_pairs)[source]¶

Result of an evaluation of a automatic speech recognition task.

Parameters:	utt_to_label_pairs (dict) – Key is the utterance-id, value is a list of `evalmate.alignment.LabelPair`.
Variables:	ref_outcome (Outcome) – The outcome of the ground-truth/reference. hyp_outcome (Outcome) – The outcome of the system-output/hypothesis. confusion (AggregatedConfusion) – Confusion statistics

class evalmate.evaluator.ASREvaluator(aligner=None)[source]¶

Class to retrieve evaluation results for a automatic speech recognition task.

Parameters:	aligner (EventAligner) – An instance of an event-aligner to use. If not given, the `alignment.LevenshteinAligner` is used.

classmethod default_label_list_idx()[source]¶: Define the default label-lists which is used when reading a corpus.

do_evaluate(ref, hyp)[source]¶

Create the evaluation result of the given hypothesis compared to the given reference (ground truth).

Parameters:	ref (Outcome) – The ground-truth/reference outcome. hyp (Outcome) – The system-output/hypothesis outcome.
Returns:	The evaluation results.
Return type:	Evaluation

static tokenize(ll, overlap_threshold=0.1)[source]¶: Tokenize a label-list and return a new label-list with a separate label for every token.