Skip to main content

S4 - Interoperability and Standards

Chair: James Pustejovsky

Rapporteur: Nancy Ide

 

Introduction

Interoperability is probably one of the most important ingredients in the glue that allows integration, sharing, interchange and reuse of Language Resources and Technologies. Interoperability of resources, tools, and frameworks has recently been recognized as the most pressing need for language processing research. Interoperability is especially critical at this time because of the widely recognized need to create and merge information and annotations at different linguistic levels. Unfortunately, there are some problems in using standards: standards are often disregarded, ignored, or considered too time-consuming and effort-demanding. Recognition of the urgency for interoperability of language resources and tools is becoming more and more critical because the multilingual scenario in Europe and new emerging data and tools for strategic languages as well as minority languages needs to be faced. We will start addressing the wider issue of interoperability and standards, as a first step towards a full LR sharing, accessibility, availability (which is a core question of the US “sister” project INTEROP-SILT).

A specific area of standardization, i.e. resource documentation will be touched. Proper resource documentation is, indeed, a pre-requisite for reuse of LRs. The contents, degree of detail, and the structure of documentation of LRs currently varies wildly. There are currently no explicit guidelines for the documentation of LRs. Some projects in which similar resources were created for multiple languages (e.g. the projects in the SpeechDat family) have used a kind of template for the resource documentation. This has many beneficial effects: the information contained in the documentation for the different resources is the same; the degree of detail is similar; it is easy to find relevant information since the structure of the documentation for the different resources is identical, etc. But the documentation, even if uniformly structured as for the SpeechDat resources, is still just text (sometimes with semi-formalized parts such as lists and tables), suitable for humans but not for software programs. Another aspect of the problem is the issue of metadata. Metadata for a resource are a set of formalized data that describe properties of the resource, and can therefore be seen as a formalized part of the resource documentation. Indeed, a lot of information that is usually put in the documentation should also occur in the metadata. Having the information in the metadata is crucial for searching and browsing facilities, but especially for tools and services to be able to apply on the resource.

 

Discussion, Objectives, FLaReNet Claims

Interoperability and Standards

In the current landscape, many different standards are already around, most of them de-facto, others ratified by official standardization organizations, e.g. industry standards used for localization (XLIFF), translation memories (TMX), terminology (TMF, TBX). The recent flurry of activities within the community aimed at defining standards for LRs and LTs is the apparent reply to the recognized urgency of interoperability of language resources and tools: current initiatives in the ISO TC37 SC4 working groups, such as the Linguistic Annotation Framework, the Syntactic Annotation Framework, the Semantic Annotation Framework, the Lexical Mark-up Framework absolutely go in this direction.

Questions

4.1.  What’s needed now that the issue of interoperability and standardisation is particularly acute in the multilingual scenario?
4.2.  How to make standard adoption more appealing? And easier?
4.3.  How to show players the real advantages of standard adoption?
4.4.  Are the global efforts to create linked monolingual resources (wordnets and framenets) the right way to proceed?
4.5.  How to extend these success stories to other types of resources?
4.6.  How to make this interlinking operational in view of multilingual applications?

Standardized resource documentation

Using a documentation template and maximizing the information in the metadata will lead to many advantages:

    • Less work for the developer / documentation creator;
    • Less work for the user of the resource / documentation reader;
    • Easier re-use of the resource by humans;
    • Easier application of tools and services on the resource in a technical language resource infrastructure or in a stand-alone configuration;
    • Better, more complete and more consistent documentation;
    • Higher quality resources;
    • Easier and faster validation of resources.

The above claims may be true, but are not likely to work because the resource producers are often not keen to use the documentation template, because:

    • They forget about the documentation template and make the documentation in their own way;
    • The same as before, but at the last moment they realize about the document template, but then they do not have the time or the willingness to restructure their documentation;
    • They think they can make better documentation than with the template, or they may think (right or wrong) that their important and innovative work cannot be properly described by documentation according to a fixed pre-defined template;
    • They already documented the resource on their web site, and do not want to convert this to a document to be included in the resource itself .

Questions

4.7.  Is it feasible to create a commonly agreed upon documentation template?
4.8.  What elements should such a documentation template contain?
4.9.  Are different templates needed for different resource types? If so, for which ones and why?
4.10.  How can we devise such templates, test them in practice and stimulate the use among resource creators?
4.11.  All parts of the documentation that should also be in the metadata should be kept out of the documentation and only be put in the metadata scheme.
4.12.  All other formalizable documentation that does not fit in the metadata scheme should be represented in a formal manner, and kept out of the documentation (though the documentation will contain references to them). Possible examples are lists of possible values for attributes in the resource; list of possible tags and their interpretation, etc.). These should be stored in processable files (e.g. plain text files but not PDF) in precisely specified locations.

 

For the detailed program of the session see the Session 4 section.