Analysing Discourse Automatically, with Multiple Objectives
Documents are not random sequences of sentences, but, rather, spans of texts are ordered and linked together to make coherent and meaningful documents: this organization is called discourse structure. Developing systems able to interpret documents, to make inference over their content and to extract structured information is a current open challenge in natural language processing (NLP) that requires high-performing discourse parsers.
While great effort is currently devoted to building rich semantic representations of sentences, the level of document is becoming the next frontier. Crossing sentence boundaries is crucial to provide the rich analysis needed for applications such as question-answering, opinion mining, summarization or automatic translation.
For example, identifying the semantic relations between the sentences in (1) – i.e. a Result between the first two sentences and an Explanation between this pair and the last sentence – could allow a system to answer complex questions such as: “Why did the train derail?” or “What were the consequences of the derailment?”. This task is hard, since these relations are not marked explicitly, they are implicit with respect to the meaning of each separate sentence, and they have thus to be inferred from varied clues.
(1) [[ A train derailed. (Result) There are four wounded people. ] [ The driver was going too fast. ]
However, performance is still low, and current work does not make it clear where the problem lies: data representation, model architecture or problem modelling? Moreover, most of the studies focus on English monologues from news, which limits their use, and, crucially, prevents from evaluating their robustness and identifying their flaws. Finally, while there are several competing theoretical frameworks dedicated to discourse, empirical studies rarely try to inform and propose change in theoretical models.
In this project, we propose a general framework here called multi-objective learning to tackle these issues with the aim of building robust and high-performing discourse parsers. Broadly speaking, we will build systems seeking to achieve several goals as a way to provide robust systems for multiple languages, domains and modalities, to investigate data representation and problem modelling, to improve evaluation, and to shed some light on the theoretical divergences.
Archives: The following positions are already filled
Open positions
1-3 Master 2 internships
Positions are opened for Master 2 internships with the following topics:
- multilingual discourse relations identification
- cross-formalisms analysis of discourse relations
- multilingual discourse parsing
- discourse parsing for dialogue via transfer learning
- analysis of discourse relations via unsupervised learning
Please contact me if you’re interested in a internship on one of these topics, or a related one.
Internship 2024: Weak Supervision for Natural Language Processing (Discourse parsing), IRIT, Toulouse (France) – ANR Andiamo
This internship will be co-supervised by Chloé Braud and Philippe Muller, and the intern will work within the MELODI team at IRIT. They will participate in group meetings, reading groups, and they will collaborate with other members of the project.
– Contract duration: 5-6 months
– Starting date: March 2024 (flexible)
– Location: IRIT, Université P. Sabatier (Toulouse III)
– Application deadline: 21 January 2024 or until position filled
– Send application by email to chloe.braud@irit.fr
Description of the project :
Natural Language Processing (NLP) is a subfield of Artificial Intelligence, at the interface of Computer Science, Machine Learning and Linguistics. Its ultimate goals are to build computational models of human languages. NLP is a science of data, as current approaches based on machine learning algorithms rely on the availability of annotated corpora for their training and evaluation, and even more when it comes to the currently dominating neural architectures, described as data-hungry. However, annotations are not available, or only in small quantities, for most languages or domains, and specific high-level, semantic and pragmatic tasks. This leads to low performance and more generally to issues with robustness, when systems are unable to generalize to new situations. In this internship, we propose to explore Weak Supervision approaches to develop hybrid systems in order to tackle low resource NLP.
Weak Supervision is intended at automatically annotating large labeled sets without the need of seed gold instances. Many weak strategies have been applied to NLP, such as distant supervision, crowdsourcing or ensemble methods. All these approaches allow to leverage synthetic, noisy datasets and improve performance within low-resource settings, but a key challenge is to understand how to combine them, to enhance performance and coverage. To this end, we will explore the paradigm of Programmatic Weak Supervision (PWS) [Ratner et al. 2016, Zhang et al. 2021] that subsumes all weak supervision strategies, while also dealing with conflicting and dependent rules, and noisy labels. We will apply this paradigm to discourse parsing, e.g. [Wang et al. 2017, Nishida et al 2022], a high-level task – crossing sentence boundaries – and a complex learning problem, typically requiring large amounts of annotations. Discourse parsing consists in building structures in which spans of text are linked with semantic-pragmatic relation such as Explanation or Contrast. It is a crucial task for many applications such as machine translation or question answering, but with, for now, low performance. In this internship, we will focus on discourse relation classification, but evaluating the impact of the proposed approach for full parsing.
Requirements:
– Master degree in Computer Science / Natural Language Processing
or equivalent
– Good knowledge in Machine Learning
– Good programming skills: preferably with Python, knowledge of
PyTorch is a plus
Application procedure: please send a CV, your grades for the last 2 years and a short letter motivating your application by detailing the following elements:
– indicate your **skills / experience in machine learning**
– describe your **interest and/or experience in natural language
processing**
More about AnDiAMO: https://www.irit.fr/~Chloe.Braud/andiamo/
Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., & Ré, C. (2016). Data programming: Creating large training sets, quickly. Advances in neural information processing systems, 29.
Wang, Y., Li, S., & Wang, H. (2017, July). A two-stage parsing method for text-level discourse analysis. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 184-188).
Nishida, N., & Matsumoto, Y. (2022). Out-of-Domain Discourse Dependency Parsing via Bootstrapping: An Empirical Analysis on Its Effectiveness and Limitation. Transactions of the Association for Computational Linguistics, 10, 127-144.
Zhang, J., Yu, Y., Li, Y., Wang, Y., Yang, Y., Yang, M., & Ratner, A. (2021, August). WRENCH: A Comprehensive Benchmark for Weak
Supervision. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
Postdoctoral (or engineer) position in NLP at IRIT, Toulouse (France) – ANR AnDiAMO
Developing systems towards robust discourse parsing and its application
- Contract duration: 18 months
- Starting date: March 2023
- Location: IRIT, Université Paul Sabatier (Toulouse III)
- Remuneration: starting at 2,745 euros, gross salary, depending on experience
- Application deadline: the position will be open until fulfilled
- Send application by email to: chloe.braud@irit.fr
Application procedure:
Please send a CV and a short letter motivating your application by detailing the following elements (incomplete application will not be considered):
- indicate your skills in machine learning, e.g. the type of tasks you already worked on, the type of algorithms, the libraries used. Please specify your experience with neural architectures and pre-trained language models.
- describe your interest and/or experience in natural language processing, i.e. the type of tasks you already tried to solve if any, or similar problems you worked on, or why you now want to work in NLP and why you think your experience in another domain could be relevant
- If you are interested but don’t have a phd, rather a master / engineer diploma and your CV fits the requirements, please send me an email with the same information as above
The AnDiAMO project:
Natural Language Processing (NLP) is a domain at the frontier of AI, computer science and linguistics, aiming at developing systems able to automatically analyze textual documents. Within NLP, discourse parsing is a crucial but challenging task: its goal is to produce structures describing the relationships (e.g. explanation, contrast…) between spans of text in full documents, allowing for making inference on their content. Developing high-performing and robust discourse parsers could help to improve downstream applications such as automatic summarization or translation, question-answering, chat bots, e.g. [1,2,3]. However, current performance are still low, mainly due to the lack of annotated data (see e.g. [4] on monologues, [5] on dialogues, [6,7] for the multilingual setting).
In order to develop robust discourse parsers within the AnDiAMO project, we want to explore multi-objective settings, where the goal is ultimately to perform a discourse analysis, but relying on another related objective such as performing well on another task (e.g. morphological, syntactic or temporal analysis), or an application (e.g. sentiment analysis or argument mining). We will also explore the issues of cross-language and cross framework learning.
Research plan:
The recruited candidate will work on one or several of the following topics, depending on its interests:
– Data representation: Discourse processing requires information from various levels of linguistics analysis. For now, existing studies do not make it clear what kind of information is important and needed, and some potentially relevant sources of information are ignored. We plan to explore this issue within a multi-task learning setting, where a system has to jointly learn different tasks. We will experiment on classification tasks (discourse relation, segmentation) and on full discourse parsing.
– Transferring to new languages, domains and modalities: Developing systems that perform well on domains or languages (that are) different from those used at training time is crucial, especially if the adaptation can be done in an unsupervised way. It is especially important for discourse, since annotation is very hard and time-consuming. We plan to experiment with cross-lingual embeddings and to explore multi-task learning, but trying to understand how to integrate additional linguistic information with only little annotated data for auxiliary tasks. We also want to investigate dialogues, for which only a few discourse parsers exist, and better understand how it differs for monologues.
– Extrinsic evaluation: We will investigate a few downstream applications that could benefit from discourse information, as a way to give an extrinsic evaluation. We will explore pipeline systems, varying the way we encode the discourse information as input of our end system. We will also explore transfer learning strategies, either via multi-task learning or representation learning. We plan to start with cognitive impairment detection (e.g. schizophrenia, Alzheimer) and argument mining. More applications will be considered, depending on the interest of the recruited postdoc.
It will be possible to investigate other paths of research, such as few-shot or unsupervised learning, depending on the interest of the recruited candidate.
Profile
* PhD degree in computer science or computational linguistics
* Good knowledge in Machine Learning is required
* Interest in language technology / NLP
* Good programming skills: preferably with Python, knowledge of PyTorch is a plus
[1] Feng, X., Feng, X., Qin, B., and Geng, X. Dialogue Discourse-Aware Graph Model and Data Augmentation for Meeting Summarization. In Proceedings of IJCAI. 2019.
[2] Bawden, R., Sennrich, R., Birch, A., and Haddow, B. Evaluating Discourse Phenomena in Neural Machine Translation. In Proceedings of NAACL. 2018
[3] Xu, J., Gan, Z., Cheng, Y., & Liu, J. Discourse-Aware Neural Extractive Text Summarization. In Proceedings of ACL. 2020
[4] Koto, F., Lau, J. H., & Baldwin, T. Top-down Discourse Parsing via Sequence Labelling. In Proceedings of EACL. 2021
[5] Liu, Z., & Chen, N. Improving Multi-Party Dialogue Discourse Parsing via Domain Integration. In Proceedings of the 2nd Workshop on Computational Approaches to Discourse. 2021
[6] Braud, C., Coavoux, M., & Søgaard, A. Cross-lingual RST Discourse Parsing. In Proceedings of EACL. 2017
[7] Liu, Z., Shi, K., & Chen, N. DMRST: A Joint Framework for Document-Level Multilingual RST Discourse Segmentation and Parsing. In Proceedings of the 2nd Workshop on Computational Approaches to Discourse. 2021
Research engineer in NLP at IRIT, Toulouse (France) – ANR AnDiAMO
Data and software support for robust discourse parsing and its application
- Contract duration: 24 months
- Starting date: September 2022
- Location: IRIT, Université Paul Sabatier (Toulouse III)
- Remuneration: 2035-2630 euros per month, gross salary, depending on experience
- Application deadline: the position will be open until fulfilled
- Send application by email to: chloe.braud@irit.fr
Natural Language Processing (NLP) is a domain at the frontier of AI, computer science and linguistics, aiming at developing systems able to automatically analyze textual documents. Within NLP, discourse parsing is a crucial but challenging task: its goal is to produce structures describing the relationships (e.g. explanation, contrast…) between spans of text in full documents, allowing for making inference on their content. Developing high-performing and robust discourse parsers could help to improve downstream applications such as automatic summarization or translation, question-answering, chat bots. However, current performance are still low, mainly due to the lack of annotated data.
In order to develop robust discourse parsers within the AnDiAMO project, we want to explore multi-objective settings, where the goal is ultimately to perform a discourse analysis, but relying on another related objective such as performing well on another task (e.g. morphological, syntactic or temporal analysis), or an application (e.g. sentiment analysis or argument mining). We will also explore the issues of cross-language and cross framework learning.
The hired engineer will be in charge of:
– Set up evaluation: set up pipeline systems for evaluation of downstream applications (e.g. sentiment analysis, question-answering, argument mining…) ; investigating different ways of using the discourse parsers outputs to test the impact of discourse information.
– Corpus curation: collect datasets for several tasks (e.g. POS tagging, syntactic parsing, temporality, modality…) and pre-process them ;
– Corpus harmonization: collect existing discourse corpora and harmonize them, following the format used for the DisRPT shared task (https://sites.google.com/georgetown.edu/disrpt2021/home?authuser=0)
The position is funded by the ANR AnDiAMO project, for which postdocs and master interns will also be recruited. Collaborations are planned with researchers in Toulouse, Grenoble, Nancy and Munich. The hired person will be part of the MELODI team at IRIT, participating in team and project meeting, and co-authoring articles.
Profile
- Master or PhD degree in computer science or computational linguistics
- Interest in language technology / NLP
The recruited engineer should have good developing skills. Knowledge in machine learning would be a plus. In addition to these tasks, it will be possible to investigate other paths, such as building multi-task learning architectures or testing few-shot learning strategies, according to the interests of the candidate.
Application
Please send a CV and a few lines explaining your interest for the position to chloe.braud@irit.fr
The position is funded by the ANR AnDiAMO project, for which an engineer and master interns will also be recruited. Collaborations are planned with researchers in Toulouse, Grenoble, Nancy and Munich. The hired person will be part of the MELODI team at IRIT, participating in team and project meeting, and co-authoring articles.
Profile
- PhD degree in computer science or computational linguistics
- Good knowledge in Machine Learning
- Interest in language technology / NLP
- Good programming skills: preferably with Python, knowledge of PyTorch is a plus
Application
Please send a CV and a few lines explaining your interest for the position to chloe.braud@irit.fr