Skip to main content

S2 - Automatic and Innovative Means of Acquisition, Annotation, Indexing

Chair: Stelios Piperidis

Rapporteur: Nùria Bel



Contemporary methods for language technology (LT) R&D rely on the deployment of the appropriate language resources (LRs) more than ever before. The most promising LTs, although language independent by themselves, are nonetheless inherently tied to language dependent knowledge in the form of LRs. This paradigm shift, effected in the late eighties, applies today almost to all areas of LT: from speech recognition and synthesis to technologies for converting unstructured information (textual or multimedia) to structured information by means of a range of information extraction technologies and contemporary methods for machine translation technologies development. Components and tools enabling the development of these applications rely heavily on LRs – such as lexical resources, annotated or un-annotated corpora, ontologies – depending on the learning technique adopted. At the front of multilinguality and machine translation, the success of statistical machine translation renders multilingual resources the absolute indispensable requirement.

In their turn, the use of LTs can be looked at as a source of competitive advantage, especially if they are considered as general purpose technologies that can add value to most ICT products dealing with language in whatever manifestation. But multilingual technologies are located on the production-side of the economic equation. They are intermediate products used to produce final goods and services, and therefore they are valued for what they actually do. And what LT based applications currently do is hampered by the fact that eventually they fail when they need to cover a new word, or a new domain, or a new language. An additional challenge to the robustness, coverage and performance of the tools and applications mentioned above is presented by the current language use on the various web communities, social networks, blogs and the like. Moreover, language on the web and on other information and communication platforms (radio, TV, etc.), converging today through advances in telecommunications engineering, is tightly interlinked to other media, notably images, video and sounds. The unavailability of the appropriate resources is a hindering factor for systems and application development and full deployment. It is therefore of uttermost importance to develop and deploy methods for an automatic construction, linking and repurposing of new and existing LRs which can satisfy such demand.


Discussion, Objectives, FLaReNet Claims

Automatic techniques and methods for speeding up the resources building process include the building of purpose specific tools from using the latest technologies on automation and machine learning, e.g. specifically trained web crawlers for parallel data identification and automatic multilevel aligners to ontology learners, lexical mining, etc. Issues to discuss wtr are the relationship between quantity and quality of language data and its impact on technologies, and the possible improvement achieved by the current results of automated production of LR’s. Current methods vary in performance but are they mere experiments or should be consider near to production techniques?

Besides, the rapid evolution of collaborative methods and networks is gradually designing a new paradigm of creation and development of knowledge repositories. As particular types of knowledge repositories, even language resources could take advantage of methods of collective construction of knowledge by non-specialist volunteers. This calls into place the creative thinking of radically different modalities of resource creation and deployment, demands new technological solutions, and opens up unforeseen scenarios of resource validation. What are the experiences with such approaches so far? Are they indeed successful in terms of production quantity?

Further challenges to the LRs and LT field emerge today as requirements from developers of cognitive and robotic systems. Not infrequently, innovation in cognitive and robotics applications require new types of LRs, or sometimes a different view to the content of existing LRs (for instance, the lexicon of natural language), such that a kind of language resource re-engineering is necessary.



Drawing on the above, the following issues are to be elaborated:
2.1.  What are the current methods for automatically building/linking/repurposing LRs? Are such techniques usable already on a large scale, or are they still research efforts?
2.2.  Regarding the automatic creation of resources: does quantity equal quality eventually?
2.3.  Are there successful stories we can learn from and build on for the future?
2.4.  What is the future target of LR acquisition/annotation? Can we set priorities?
2.5.  Are the existing resources suitable for the development of current applications, systems etc.?
2.6.  What is missing in the current picture? What new types of resources are necessary?
2.7.  Is the cost of ensuring interoperability / compatibility worth paying? How can it be quantified?
2.8.  Is the interlinking between individual monolingual resources for the development of multilingual resources a viable solution? What does it entail?
2.9.  While LT has been traditionally developed for processing well-formed language (text), language use on the web is largely “unregulated”: how effectively can we process today the language used on the web, in blogs, in chat rooms and other web-based forums?
2.10.  Can LRs be used by other disciplines and new areas: cognitive systems and robots for instance. Can we repurpose them? What types of LRs and LT does the contemporary intelligent humanoid robot need to acquire or improve its linguistic capacity?
2.11.  What does the linking between different media (language, images, video, sounds) entail?
2.12.  Are LRs to be optimized in the future taking new shapes, as a result of automation processes?
2.13.  On the other hand, how can the web help in delivering quality language data and annotations of them?
2.14.  How can the web 2.0 and the collective intelligence be used in the production of LRs? What role can the now popular social networks play in the acquisition of language use data?
2.15.  Can automatic LR production participate and contribute in the web 3.0?
2.16.  What are the experiences with social/collaborative approaches so far? Are they indeed successful in terms of production quantity and quality?
2.17.  How are people encouraged to participate? Why is it in their interest?
2.18.  Can collaborative techniques be used to do annotation in the area of speech and language resources? For example, can we devise an entertaining web-based game that will yield large quantities of high-quality phonetic transcriptions for names, or that yield annotations to place a word in the right place in a WordNet like database?
2.19.  Are there any properties of the successful approaches that should be taken into account? (e.g. in the case of picture tagging there is no absolute standard, many different tags are appropriate, and the fact that multiple independent individuals come up with the same tag for a picture is by itself proof of its usefulness as a tag for this picture).
2.20.  Can collaborative web data be exploited in order to derive new types of language resources and if so, how?


For the detailed program of the session see the Session 2 section.