Skip to main content

S3 - Evaluation and Validation

Chair: Jan Odijk

Rapporteur: Joseph Mariani

 

Introduction

The topic of this session encompasses two major issues:

    • Evaluation and validation of the quality and quantity of Language Resources (LRs) which are produced for a given objective (conduct research investigations, develop a product, etc);
    • Evaluation of Language Technology (LT) and production/distribution of the LRs which are necessary for developing and testing the corresponding LT.

 

LRs

Validation of a LR entails checking whether it has been created in accordance with its specification or documentation; it is an essential ingredient to assess the quality of LRs. Validation is systematically applied in programmes and projects in which it is known in advance that LRs will have to be distributed to others, but very often still neglected outside of such a context. The focus in validation has so far been on formal validation, and a systematic approach towards this kind of validation has been developed and applied to a range of resources. Though content validation has been applied in some cases, this has been tentative and somewhat ad-hoc since, differently than for formal validation, there is no established methodology for content validation.

In this session we intend to assess the situation around formal validation and inventory what new needs and trends there are in this respect, but we especially also hope to dig a little deeper into the problems that content validation poses, and how they can be overcome: Are there fundamental differences between formal and content validation or can they be approached in the same manner? What elements are lacking to make a systematic approach to content validation possible? How can we stimulate that such validation is made a systematic ingredient in the production of LRs? Etc..

Evaluation of LRs relative to a certain objective (conduct research investigations, develop a product, etc), is an assessment of whether the LR is suited for this objective. Have the LRs produced in the last decades indeed been used for the objectives they were intended for? And were they successful? What can we learn from this for future resources? Can useful LRs be created without a specific objective in mind? Are resources created with such an objective in mind not too limited in scope given the amount of effort and money invested in them? Are LRs with pretty wide objectives such as BNC and the Dutch Spoken Corpus useful and for which objectives are they actually being used?

 

LT

In the area of LT evaluation, we have witnessed several developments over the past twenty years, starting from the DARPA initiative in the mid 1980s which relied on NIST for conducting the evaluation campaigns and created the LDC for making the necessary LR available. Starting from the evaluation of Automatic Speech Recognition systems, it was generalized to many areas of spoken and written language processing, and to multimedia/multimodal data. Based on the same approach, several evaluation campaigns have been organized, e.g. CLEF (Europe) , TREC (US), ACE (US), NTCIR (Japan), Senseval/Semeval, EVALDA (France), EVALITA (Italy), N-BEST (Netherlands and Flanders), etc..

In the area of Machine Translation, automatic evaluation methods have been recently proposed with the specific evaluation data they require and their associated metrics, with several variants (BLEU, NIST, TER, ...), but the question of how to measure the quality of translation is still open and a matter of discussion (see the recent NIST campaign on evaluation of MT evaluation metrics). The same issue is present in other fields (e.g. Question Answering with metrics such us RR, Q-measure, etc.).

It is also important to consider the distribution of packages making it possible for researchers and industrials to evaluate the quality of their results after the evaluation campaign.

 

Discussion, Objectives, FLaReNet Claims

In this workshop we want to reflect on these developments, share (good and bad) experiences and look to the future. Are there new needs, new trends?

 

Questions

LRs
3.1.  Which LR validation/evaluation methods are there already? Are such methods lacking for specific types of resources?
3.2.  Can formal validation and content validation be approached in the same manner, or are there fundamental differences between them?
3.3.  How can we measure the quality of LRs? What do we mean by quality?
3.4.  Have the LRs produced in the last decades indeed been used for the objectives they were intended for? And were they successful? What can we learn from this for future resources?
3.5.  Can useful LRs be created without a specific objective in mind?
3.6.  Are resources created with such an objective in mind not too limited in scope given the amount of effort and money invested in them?
3.7.  Are LRs with pretty wide objectives such as BNC and the Dutch Spoken Corpus useful and for which objectives are they actually being used?

LT
3.8.  Are the models as used in the various evaluation campaigns the right model? Are adaptations needed? Are there good experiences we should promote, or bad experiences that should be shared so that they can be avoided by others?
3.9.  Are there urgent needs for evaluation data or tools?
3.10.  Are there new trends or desires in methodologies for carrying out evaluation, and if so do they require new types of evaluation resources (data, tools, metrics)?
3.11.  Should the quality of the process of creating technology play a role in an overall evaluation, and if so, what recommendations can be made in this domain?
3.12.  What should be the business model attached to LT evaluation? Should it be fully/partly supported by public funds? Should it be handled by public organizations? Should it be a prerequisite for any participation in public programs? How to port evaluation campaigns to other languages?

 

For the detailed program of the session see the Session 3 section.