Axe 6

Language Resources (responsable Cédric Gendrot, LPP-Paris3)

This strands involves over 60 participants from all Labex teams, and interacts with all other strands. Its objectives are threefold:

Defining and controlling a language resource policy within the LabEx (setting up an inventory of existing and required resources, encouraging resource developers to use standards and develop mappings with other resources, defining a distribution policy centered around free availability at least for research purposes);

Designing and implementing advanced techniques for language resource development that help increasing their accuracy, coverage and development speed (techniques for resource-scarce languages, techniques for linguistic knowledge transfer between closely related languages, collaborative techniques for resource development) ;

Applying both preceding objectives to the development of language resources for French — which is obviously a language of choice for the LabEx —, but also for other languages from various typological families for which there is a particular interest within the LabEx, e.g., for linguistic, social or applicative reasons.

It should be emphasized that the importance of language resources is not limited to academic studies. Developing language resources has a strong social impact, which, among others, often depends on the language involved: (i) developing cooperation and knowledge transfer towards developing countries (Mauritian Creole, Afroasiatic languages…); (ii) preserving linguistic diversity and teaching languages of France (e.g., Western Armenian, languages of French Guyana…); (iii) providing resources for NLP systems used in business and military intelligence (e.g., Iranian languages, Mandarin Chinese, Arabic…) and other widespread NLP applications, such as machine translation (French, Mandarin Chinese, etc.).

Common tools for language resources: This sub-strand will provide a repository for listing and distributing both existing and future resources in the LabEx, in relation with all other strands and in particular with strand 3 (an online inventory of existing and future resources will be created and maintained). A particular effort will be made towards standardization, distribution and dissemination issues, in particular by promoting free availability of all resources, at least for research purposes. Another major goal of this sub-strand, in collaboration with resource developers and users, is to develop adequate tools for validating, editing and annotating all kinds of language resources, as well as tools for extracting linguistic information from them. Finally, this sub-strand will play a key role in training researchers in how to develop and exploit language resources.

Designing semi-automatic techniques for resource development: For most languages, no usable language resources are available, although they function as a basis for the development of experimental linguistics studies and NLP tools. However, developing such resources is a very costly task. Therefore, semi-automatic techniques for resource development constitute a crucial area of research. For this reason, this sub-strand will be in charge of improving the state-of-the-art of algorithmic, formal and practical models for reducing the cost of language resource development as much as possible. A particular effort will be made towards the development of basic language resources, i.e., morphological lexicons and part-of-speech taggers as well as speech corpora. Specific techniques will also be developed for taking benefit of situations where the language concerned is closely related to another one for which language resources already exist. Moreover, although human intervention will be reduced as much as possible, an important objective of this sub-strand will be to understand how to optimize it, in particular by the means of collaborative techniques (wikis, online games, mechanical Turks).

Developing new language resources: Semantic resources are costly to develop, and very few of them exist for French, although research in linguistics, psycholinguistics and NLP would strongly benefit from large-scale semantic resources, as shown by recent work on English. One of our goals is to annotate particular corpora, such as the French TreeBank, with a large variety of semantic annotation layers, on top of existing morphosyntactic and syntactic layers, namely anaphora, (co)reference, named entities, FrameNet and discourse information. Another goal will be produce sizable new oral resources (e.g., a MapTask for French in Strand 2, acquisition corpora in Strand 4, Learner corpora and phonological speech database in Strand 1), in addition to existing ones. Third, LabEx members will pursue their work on the study and equipment of French with a historical perspective (resources for Medieval French, resources created from historical grammars and other linguistic texts). Sizable resources will also be developed for more than 20 languages that are of particular interest to LabEx WPs, in collaboration with other Strands, especially Strand 3. As new needs will emerge, new projects will be set up.

Here is the list of the Strand 6's research operations:

 

·LR-1 A joint approach to language resources development (resp. C. Plancq) LLF, Alpage, LPP-P3, LLACAN, Lacito, SeDyL

  • LR-2 Designing semi-automatic techniques for resource development

·LR-2.1 Techniques for resource-scarce languages (resp. B. Sagot) Alpage, LLF, LPP-P3

·LR-2.2 Techniques for transfering lexical resources from one language to a closely-related one (resp. B Sagot) Alpage, LLF, MII, LPP-P3, SeDyL

·LR-2.3 Techniques for speech corpora  (resp. M. Adda-Decker) LPP-P3, LACITO, Sedyl

 

  • LR-3 Developing new resources for French

·LR-3.1 Instantiating a French FrameNet and FrameNet-annotate the French TreeBank (resp. M. Candito) Alpage, LLF, Lattice

·LR-3.2 Adding annotation layers in the French Treebank for anaphora, (co)reference, named entities and discourse structures  (resp. L. Danlos) Alpage, LLF, Lattice

·LR-3.3 A multilayer meta-lexicon for French: developing mappings between existing resources (resp. B Sagot) Alpage, LPNCog, LPP-P5

·LR-3.5 Acoustic and physiological data for multi-sensor investigation of normal speech (resp. J. Vaissière) LPP-P3, LLF

·LR-3.6 Development of non-standard speech corpora: learner speech in French and English (resp. E. Delais) LLF, LPP-P3

·LR-3.7 Pathological speech: acoustic, perceptual  and physiological data (resp. C. Fougeron) LPP-P3, LLF

·LR-3.8 Longitudinal corpus of spoken French: acoustic, perceptual and physiological data (resp. C. Gendrot) LPP-P3, LFF

 

  • LR-4 Developing resources for various languages

·LR-4.1 Developing morphological and syntactic resources for western Iranian languages (resp. P.Samvelian)

·LR-4.3 Linguistic resources for Mandarin Chinese (resp. C. Saillard) LLF, Alpage

·LR-4.6 Towards a Treebank for Mauritian creole (resp. F. Henri) LLF, Alpage

·LR-4.8  Text corpora for Manding languages (Bambara, Maninka) (resp. V. Vydrin)

·LR-4.9  A historical perspective on language resources and linguistic traditions (resp. S. Archaimbault)

·LR-4.10  BdD PluriL – Base de Données Plurilingues (resp. I. Léglise) SEDYL, LACITO, LLF, LLACAN

·LR-4.11  Automatic paradigm generation and language description (resp. G. Jacques) CRLAO, Alpage, LLF, HTL

LR-4.12  Resource acquisition for Hausa : ResHau (resp. B. Crysmann) LLF, Alpage, LLACAN