ALBAYZIN 2014 – SEARCH ON SPEECH EVALUATION

Evaluation organizers:

Javier Tejedor Noguerales, GEINTRA, Universidad de Alcalá, Spain

This email address is being protected from spambots. You need JavaScript enabled to view it.

Doroteo Torre Toledano, ATVS, Universidad Autónoma de Madrid, Spain

This email address is being protected from spambots. You need JavaScript enabled to view it.

Luis Javier Rodríguez Fuentes, Mikel Peñagarikano, Amparo Varona, Germán Bordel, Mireia Diez, GTTS (http://gtts.ehu.es), Universidad del País Vasco UPV/EHU, Spain

This email address is being protected from spambots. You need JavaScript enabled to view it.

 

Evaluation description:

The ALBAYZIN 2014 Search on Speech evaluation involves searching in audio content a list of terms/queries. This evaluation focuses on retrieving the appropriate audio files that contain any of those terms/queries. This evaluation consists of four different tasks:

1)     Keyword Spotting, where the input to the system is a list of terms, which is known when processing the audio and hence word-based recognizers can be effectively used to hypothesize detections.

2)     Spoken Term Detection (STD), where the input to the system is a list of terms (as in the Keyword Spotting task), but this is unknown when processing the audio. This is the same task as in NIST STD 2006 evaluation [1] and Open Keyword Search 2013 [2].

3)     Query-by-Example Spoken Term Detection (QbE STD), where the input to the system is an acoustic example per query and hence a prior knowledge of the correct word/phone transcription corresponding to each query cannot be made. This task must generate a set of occurrences for each query detected in the audio files, along with their timestamps as output, as in the STD task. This QbE STD is the same task as those proposed in MediaEval 2011, 2012 and 2013 [3].

4)     Query-by-Example Spoken Document Retrieval (QbE SDR), where the input to the system is composed of several acoustic examples per query and hence a prior knowledge of the correct word/phone transcription corresponding to each query cannot be made. This task must generate an output score for each of the provided queries, which reflects the probability that each of the queries appears in each audio file, and no information about the timestamp is required. Formally, given a spoken example of a given query q and a spoken document x (whose transcriptions are unknown), a QbE SDR system must carry out some kind of detection procedure and output a score “s” ∈R, the higher (the more positive) the score the higher the likelihood that q appears in x. Note that systems are neither required to make a strong decision about whether or not q appears in x, nor to provide the time marks of the place (or places) where q appears. Systems are just required to produce a score, which must be computed by automatic means, with no human supervision. This is the same task as that proposed in MediaEval 2014 Query-by-Example Search on Speech (QUESST) [4].

 

Database description:

For the first three tasks, data provided by the organizers for system training, development and evaluation belong to recordings of the MAVIR workshops held in 2006, 2007 and 2008 (Corpus MAVIR 2006, 2007 and 2008), all of them including speech in Spanish. However, any kind of data can be used for system training/development provided that these data are fully described in the system description. For the Keyword Spotting and Spoken Term Detection tasks, orthographic transcriptions of the selected list of terms, along with the occurrences and timestamps for each of these terms, corresponding to training/development data, will be provided. For the Query-by-example Spoken Term Detection task, occurrences and timestamps of the queries corresponding to training/development data will be also provided. For test data, only the list of terms used for evaluation will be provided for Keyword Spotting and Spoken Term Detection tasks, whereas For Query-by-example Spoken Term Detection only the queries used for the evaluation will be provided.

On the other hand, for the Query-by-Example Spoken Document Retrieval task, a set of 1841 spoken documents, all of them extracted from TV broadcast news in Basque under diverse background conditions, with a total duration of around 3 hours (individual durations ranging from 3 to 30 seconds), will be used for detecting two different sets (development and test sets) of spoken queries (also in Basque), recorded in an office environment by a reduced set of male and female speakers. Each spoken query will typically consist of a single word but may also contain several words. The first set of spoken queries, along with the whole set of spoken documents, the ground-truth file and the scoring script, will be made available to participants at the time of releasing the training/development data. The set of test queries will be distributed at the time of releasing the evaluation data. Both sets of queries will consist of about 100 queries, each featuring a basic example and two additional examples (from different speakers). Regardless of how speech signals were originally recorded and stored, all the signals will be converted to and supplied in WAV format, 16 kHz, single channel, 16 bits / sample.

 

System evaluation:

The Figure-of-Merit (FOM) definition for word spotting, as defined in the HTK Book [5], will be the primary metric for the Keyword Spotting task, whereas the Actual Term Weighted Value (ATWV) [1] will be the primary metric for the STD and QbE STD tasks. DET curves for STD and QbE STD tasks will be also computed.

In the Query-by-Example Spoken Document Retrieval task, system performance will be measured in a different way. The score “s” will be interpreted as a log-likelihood ratio and used to compute a normalized cross-entropy cost function Cnxe, which will be used as primary metric. The Cnxe function was originally defined for the Mediaeval 2013 Spoken Web Search (SWS) Evaluation [6], where it was used as a secondary metric. Given a set of spoken documents and a set of spoken queries, systems will have to provide scores for ALL the trials (q,x), or alternatively, they will have to provide a default score for the missing trials. In the Mediaeval 2013 SWS Evaluation, using a default score led to worse results than using the scores provided by a QbE SDR system. Therefore, though more costly, we recommend to process all the trials (if the applied technology allows for it). To allow for the analysis of system performance also when scores are collapsed into detection decisions, the Actual and Maximum Term Weighted Value (ATWV and MTWV) metrics [1][6], with reasonable prior and cost parameters, along with the corresponding DET curves, will be also computed as secondary metric.

Participants could submit their system/s either for the Keyword Spotting task, Spoken Term Detection task, Query-by-example Spoken Term Detection task, Query-by-Example Spoken Document Retrieval or for all tasks. For Keyword Spotting, Spoken Term Detection and Query-by-Example Spoken Term Detection, participants are required to submit one primary system and up to 4 contrastive systems for any task (Keyword Spotting and/or STD and/or QbE STD).

For Query-by-Example Spoken Document Retrieval, there will be two evaluation conditions, depending on the use of a single example per query (which MUST be the basic example, to allow for meaningful comparisons among systems) or the three available examples (which can be used in any way). These conditions, which will be identified as Single and Multiple, will allow us to check to what extent the availability of several query examples helps to improve system performance. Each registered participant can submit up to 5 result files in each condition. One of them (presumably, the best one) must be labelled as primary, and the remaining ones as contrastive (con1, con2, etc.). Participants will be ranked in this task according to the performance attained by their primary systems in the Single condition.

Due to the high similarity of QbE STD and QbE SDR tasks (mainly, they simply differ in the output format), participants involved in any of these tasks are strongly encouraged to submit their systems to both tasks. This will enhance system comparison in both tasks individually.

When submitting system results, participants commit themselves to send a description of the developed systems, including information about all the procedures applied, the databases and subsystems used and the results available at the time of submission. The format templates and the deadlines for submitting the description papers will be made public by the Albayzin 2014 Evaluations Organizing Committee. Participants also commit themselves to present their systems at Iberspeech 2014, to be held in Las Palmas de Gran Canaria on November 19-21, 2014.

 

Registration process:

Interested groups must register for the evaluation before July 15th 2014, by contacting the organizing team at: This email address is being protected from spambots. You need JavaScript enabled to view it. , This email address is being protected from spambots. You need JavaScript enabled to view it. (with CC to the Chairs of Iberspeech 2014, This email address is being protected from spambots. You need JavaScript enabled to view it. ), and providing the following information:

    Research group (name and acronym)

    Institution (university, research center, etc.)

    Contact person (name)

    Email

 

Schedule:

•      June 30, 2014: Release of the training and development data.

•      July 15, 2014: Registration deadline.

•      September 3, 2014: Release of the evaluation data.

•      September 30, 2014: Deadline for submission of results and system descriptions.

•      October 15, 2014: Results distribute to the participants.

•      Iberspeech 2014 workshop: Official public publication of the results.

The Search on Speech evaluation plan can be found here

References:

[1] http://www.itl.nist.gov/iad/mig/tests/std/2006/index.html

[2] http://www.nist.gov/itl/iad/mig/openkws13.cfm

[3] Florian Metze, Xavier Anguera, Etienne Barnard, Marelie Davel and Guillaume Gravier. "Language Independent Search in Mediaeval's Spoken Web Search Task". Computer Speech and Language, Special Issue on Information Extraction & Retrieval, 2014.

[4] http://multimediaeval.pbworks.com/w/page/79432139/QUESST2014

[5] http://htk.eng.cam.ac.uk/docs/docs.shtml

[6] Luis J. Rodriguez-Fuentes, Mikel Penagarikano, "MediaEval 2013 Spoken Web Search Task: System Performance Measures". Technical Report, Department of Electricity and Electronics, University of the Basque Country UPV/EHU, May 30, 2013. [Online: http://gtts.ehu.es/gtts/NT/fulltext/rodriguezmediaeval13.pdf]

 

Additional information