Analysing Discourse Automatically, with Multiple Objectives
Documents are not random sequences of sentences, but, rather, spans of texts are ordered and linked together to make coherent and meaningful documents: this organization is called discourse structure. Developing systems able to interpret documents, to make inference over their content and to extract structured information is a current open challenge in natural language processing (NLP) that requires high-performing discourse parsers.
While great effort is currently devoted to building rich semantic representations of sentences, the level of document is becoming the next frontier. Crossing sentence boundaries is crucial to provide the rich analysis needed for applications such as question-answering, opinion mining, summarization or automatic translation.
For example, identifying the semantic relations between the sentences in (1) – i.e. a Result between the first two sentences and an Explanation between this pair and the last sentence – could allow a system to answer complex questions such as: “Why did the train derail?” or “What were the consequences of the derailment?”. This task is hard, since these relations are not marked explicitly, they are implicit with respect to the meaning of each separate sentence, and they have thus to be inferred from varied clues.
(1) [[ A train derailed. (Result) There are four wounded people. ] [ The driver was going too fast. ]
However, performance is still low, and current work does not make it clear where the problem lies: data representation, model architecture or problem modelling? Moreover, most of the studies focus on English monologues from news, which limits their use, and, crucially, prevents from evaluating their robustness and identifying their flaws. Finally, while there are several competing theoretical frameworks dedicated to discourse, empirical studies rarely try to inform and propose change in theoretical models.
In this project, we propose a general framework here called multi-objective learning to tackle these issues with the aim of building robust and high-performing discourse parsers. Broadly speaking, we will build systems seeking to achieve several goals as a way to provide robust systems for multiple languages, domains and modalities, to investigate data representation and problem modelling, to improve evaluation, and to shed some light on the theoretical divergences.
Open positions
1-3 Master 2 internships
Positions are opened for Master 2 internships with the following topics:
- multilingual discourse relations identification
- cross-formalisms analysis of discourse relations
- multilingual discourse parsing
- discourse parsing for dialogue via transfer learning
- analysis of discourse relations via unsupervised learning
Please contact me if you’re interested in a internship on one of these topics, or a related one.
Archives: The following positions are already filled
Postdoctoral (or engineer) position in NLP at IRIT, Toulouse (France) – ANR AnDiAMO
Developing systems towards robust discourse parsing and its application
- Contract duration: 18 months
- Starting date: March 2023
- Location: IRIT, Université Paul Sabatier (Toulouse III)
- Remuneration: starting at 2,745 euros, gross salary, depending on experience
- Application deadline: the position will be open until fulfilled
- Send application by email to: chloe.braud@irit.fr
Application procedure:
Please send a CV and a short letter motivating your application by detailing the following elements (incomplete application will not be considered):
- indicate your skills in machine learning, e.g. the type of tasks you already worked on, the type of algorithms, the libraries used. Please specify your experience with neural architectures and pre-trained language models.
- describe your interest and/or experience in natural language processing, i.e. the type of tasks you already tried to solve if any, or similar problems you worked on, or why you now want to work in NLP and why you think your experience in another domain could be relevant
- If you are interested but don’t have a phd, rather a master / engineer diploma and your CV fits the requirements, please send me an email with the same information as above
The AnDiAMO project:
Natural Language Processing (NLP) is a domain at the frontier of AI, computer science and linguistics, aiming at developing systems able to automatically analyze textual documents. Within NLP, discourse parsing is a crucial but challenging task: its goal is to produce structures describing the relationships (e.g. explanation, contrast…) between spans of text in full documents, allowing for making inference on their content. Developing high-performing and robust discourse parsers could help to improve downstream applications such as automatic summarization or translation, question-answering, chat bots, e.g. [1,2,3]. However, current performance are still low, mainly due to the lack of annotated data (see e.g. [4] on monologues, [5] on dialogues, [6,7] for the multilingual setting).
In order to develop robust discourse parsers within the AnDiAMO project, we want to explore multi-objective settings, where the goal is ultimately to perform a discourse analysis, but relying on another related objective such as performing well on another task (e.g. morphological, syntactic or temporal analysis), or an application (e.g. sentiment analysis or argument mining). We will also explore the issues of cross-language and cross framework learning.
Research plan:
The recruited candidate will work on one or several of the following topics, depending on its interests:
– Data representation: Discourse processing requires information from various levels of linguistics analysis. For now, existing studies do not make it clear what kind of information is important and needed, and some potentially relevant sources of information are ignored. We plan to explore this issue within a multi-task learning setting, where a system has to jointly learn different tasks. We will experiment on classification tasks (discourse relation, segmentation) and on full discourse parsing.
– Transferring to new languages, domains and modalities: Developing systems that perform well on domains or languages (that are) different from those used at training time is crucial, especially if the adaptation can be done in an unsupervised way. It is especially important for discourse, since annotation is very hard and time-consuming. We plan to experiment with cross-lingual embeddings and to explore multi-task learning, but trying to understand how to integrate additional linguistic information with only little annotated data for auxiliary tasks. We also want to investigate dialogues, for which only a few discourse parsers exist, and better understand how it differs for monologues.
– Extrinsic evaluation: We will investigate a few downstream applications that could benefit from discourse information, as a way to give an extrinsic evaluation. We will explore pipeline systems, varying the way we encode the discourse information as input of our end system. We will also explore transfer learning strategies, either via multi-task learning or representation learning. We plan to start with cognitive impairment detection (e.g. schizophrenia, Alzheimer) and argument mining. More applications will be considered, depending on the interest of the recruited postdoc.
It will be possible to investigate other paths of research, such as few-shot or unsupervised learning, depending on the interest of the recruited candidate.
Profile
* PhD degree in computer science or computational linguistics
* Good knowledge in Machine Learning is required
* Interest in language technology / NLP
* Good programming skills: preferably with Python, knowledge of PyTorch is a plus
[1] Feng, X., Feng, X., Qin, B., and Geng, X. Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization. In Proceedings of IJCAI. 2019.
[2] Bawden, R., Sennrich, R., Birch, A., and Haddow, B. Evaluating Discourse Phenomena in Neural Machine Translation. In Proceedings of NAACL. 2018
[3] Xu, J., Gan, Z., Cheng, Y., & Liu, J. Discourse-Aware Neural Extractive Text Summarization. In Proceedings of ACL. 2020
[4] Koto, F., Lau, J. H., & Baldwin, T. Top-down Discourse Parsing via Sequence Labelling. In Proceedings of EACL. 2021
[5] Liu, Z., & Chen, N. Improving Multi-Party Dialogue Discourse Parsing via Domain Integration. In Proceedings of the 2nd Workshop on Computational Approaches to Discourse. 2021
[6] Braud, C., Coavoux, M., & Søgaard, A. Cross-lingual RST Discourse Parsing. In Proceedings of EACL. 2017
[7] Liu, Z., Shi, K., & Chen, N. DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing. In Proceedings of the 2nd Workshop on Computational Approaches to Discourse. 2021
Research engineer in NLP at IRIT, Toulouse (France) – ANR AnDiAMO
Data and software support for robust discourse parsing and its application
- Contract duration: 24 months
- Starting date: September 2022
- Location: IRIT, Université Paul Sabatier (Toulouse III)
- Remuneration: 2035-2630 euros per month, gross salary, depending on experience
- Application deadline: the position will be open until fulfilled
- Send application by email to: chloe.braud@irit.fr
Natural Language Processing (NLP) is a domain at the frontier of AI, computer science and linguistics, aiming at developing systems able to automatically analyze textual documents. Within NLP, discourse parsing is a crucial but challenging task: its goal is to produce structures describing the relationships (e.g. explanation, contrast…) between spans of text in full documents, allowing for making inference on their content. Developing high-performing and robust discourse parsers could help to improve downstream applications such as automatic summarization or translation, question-answering, chat bots. However, current performance are still low, mainly due to the lack of annotated data.
In order to develop robust discourse parsers within the AnDiAMO project, we want to explore multi-objective settings, where the goal is ultimately to perform a discourse analysis, but relying on another related objective such as performing well on another task (e.g. morphological, syntactic or temporal analysis), or an application (e.g. sentiment analysis or argument mining). We will also explore the issues of cross-language and cross framework learning.
The hired engineer will be in charge of:
– Set up evaluation: set up pipeline systems for evaluation of downstream applications (e.g. sentiment analysis, question-answering, argument mining…) ; investigating different ways of using the discourse parsers outputs to test the impact of discourse information.
– Corpus curation: collect datasets for several tasks (e.g. POS tagging, syntactic parsing, temporality, modality…) and pre-process them ;
– Corpus harmonization: collect existing discourse corpora and harmonize them, following the format used for the DisRPT shared task (https://sites.google.com/georgetown.edu/disrpt2021/home?authuser=0)
The position is funded by the ANR AnDiAMO project, for which postdocs and master interns will also be recruited. Collaborations are planned with researchers in Toulouse, Grenoble, Nancy and Munich. The hired person will be part of the MELODI team at IRIT, participating in team and project meeting, and co-authoring articles.
Profile
- Master or PhD degree in computer science or computational linguistics
- Interest in language technology / NLP
The recruited engineer should have good developing skills. Knowledge in machine learning would be a plus. In addition to these tasks, it will be possible to investigate other paths, such as building multi-task learning architectures or testing few-shot learning strategies, according to the interests of the candidate.
Application
Please send a CV and a few lines explaining your interest for the position to chloe.braud@irit.fr
The position is funded by the ANR AnDiAMO project, for which an engineer and master interns will also be recruited. Collaborations are planned with researchers in Toulouse, Grenoble, Nancy and Munich. The hired person will be part of the MELODI team at IRIT, participating in team and project meeting, and co-authoring articles.
Profile
- PhD degree in computer science or computational linguistics
- Good knowledge in Machine Learning
- Interest in language technology / NLP
- Good programming skills: preferably with Python, knowledge of PyTorch is a plus
Application
Please send a CV and a few lines explaining your interest for the position to chloe.braud@irit.fr