S1 - Broadening the Coverage, Addressing the Gaps

Chair: Joseph Mariani

Rapporteur: Khalid Choukri



HLT should ensure a large coverage of languages and of the major economic/social/cultural sectors, through the supply of numerous applications and technologies which should be fed with the necessary language resources (LRs, multimedia, multilingual, multimodal).

A thorough analysis of the existing resources, technology components, etc. should be carried out along various dimensions:

  1. Languages;
  2. Sectors of human activities: e.g. e-services (e-learning, e-government, e-tourism, etc.), mass and specialised services, audiovisual and Internet communications, information production and access;
  3. Technologies & Applications: Machine Translation, Human-Machine Interactions & Dialogue,   Human-Human Communications, ML Information retrieval, access, summarisation, Subtitling, Audio-visual transcriptions and indexing, etc.;
  4. Modalities: text processing, speech & acoustics, sign languages, visual/video input/output, biometrics, combination of various modalities.


Discussion, Objectives, FLaReNet Claims

A large number of the above mentioned sectors may benefit from language technologies (LT) if awareness is conducted more aggressively. But a number of such sectors lack the right applications that can be supplied by current state-of-the-art technology (e.g. broadcast news transcriptions).

A number of gaps can be identified on the various dimensions:

    • If we look e.g. at trends like Statistical Machine Translation (SMT), in particular for the EU official languages (23 languages and 506 language pairs) and some “interesting/lucrative ones” (Chinese, Arabic): SMT is mostly based on European Union Jargon (JRC-acquis) and some technical manuals (Microsoft & Linux OS, etc.);
    • The same comments may apply to Speech-to-Text transcription (main focus on Broadcast news of a few languages): what about other languages, other domains (conversational speech), what should be the size and content; etc.;
    • And in general this applies to all LTs which can get benefit in their development from statistical approaches.

The objectives of this thread of discussion are to:

    • Identify gaps along the language, application, domain, sector and modality dimensions;
    • Devise means and strategies to support the development of missing LRs, especially for less developed countries and for regions, taking into account the general ecosystem (EC, Member-States, Regional governments, etc.);
    • Assess/reassess BLARK/ELARK, and tailor them to current application needs, domains, multilingual and multicultural landscape;
    • Suggest short and medium term actions: well-structured objectives, well coordinated tasks assigned to identified parties of excellent reputation, evaluation and monitoring of progress.



Monitoring the landscape, identifying the gaps

1.1.  What are the gaps in our “scientific knowledge” relevant to the production of missing blocks?
1.2.  Where are the major gaps: gaps for application, lack of technology components, lack of LRs (language & domain)?
1.3.  What criteria should be considered for defining and prioritizing actions to address such gaps?
1.4.  How can we identify new sectors for deploying LRs & LTs?

Extending and updating BLARK/ELARK

1.5.  Starting from the definition of BLARK/ELARK, can we redefine and update these notions according to the current landscape?
1.6.  Do we need to establish current “baselines” per language, per technology, etc. with a clear picture of important barriers and threats?
1.7.  What are the needs /requirements /issues to tackle in order to ensure an accurate and efficient deployment of technologies for a given set of languages and domains?

Missing LRs

1.8.  What are the needs /requirements /issues to tackle in order to ensure a fast prototyping for a given language and domain?
1.9.  How can we promote and accelerate the extension of work conducted on one, or a small set of languages to a larger set?
1.10.  How can the development of missing LRs be supported?

Suggesting directions of action

1.11.  How can we identify/promote applications/technologies of “greatest exposure”? (Multilingualism?)
1.12.  How can we identify the sectors than can be “early/today” adopters and how can use them as window-dressing for HLT (high exposure)?
1.13.  How will we know we are making progress on addressing these gaps? (Program monitoring? Program evaluation?)
1.14.  How to improve management (enforce?) of LR sharing & distribution (also from old projects)?
1.15.  Which is an appropriate legal framework to foster the deployment of technologies and the successful sharing of LRs?
1.16.  How enhance coordination of LRs collection between all involved agencies and ensure efficiency (e.g. interoperability)?


