Postdoctoral research fellow : Text data mining applied to heterogeneous and multilingual corpora

Isabelle Leglise, SEDYL
chercheur postdoctorant
12 months, starting dec. 2011/jan. 2012.
24 000 € net/ year
computational linguistics, data mining, manifold learning
Applications are invited to work on "Multifactorial Analysis of Language Contact & Language Change" which is part of a 10 year project "Empirical foundations of linguistics".
It is a full-time position


The candidate should have a PhD in computer science, and should be an expert in the field of data mining, preferably on a linguistic field of application (text mining, natural language processing) involving large-dimension data/texts. The candidate should have experience of XML format. A knowledge of TEI standards will be a plus. She must know how to program in C language; C ++ or Java.  She  will use the relational model of databases and the SQL language; knowledge of  MySQL is an advantage. An interest for linguistic diversity is a good point.


This task consists in developing functions of search / data mining applied to language contact corpora, that is to transcriptions of non-homogeneous and mixed verbal productions collected in multilingual areas (38 languages from all continents involved). This scenario is traditionally little taken into account by the algorithms of computational linguistics (grammatical inference or lexical labeling). We expect to find correlations of certain categories, or certain syntactical positions, with language contact or language change phenomena.
Given the large number of variables to be analyzed, with regard to the size of the corpus (large number of samples), we will need to explore approaches in data dimensionality reduction such as "manifold learning".

If you are interested, please send a a CV (including a publication list), a letter of application and the names of two referents to:

Isabelle Léglise ( & Pascal Vaillant (