# Workshop 3: List of Abstract

**Hachem Kadri,*** **Kernel Methods for Structured Outputs*

Kernel methods are a popular family of statistical learning methods that exploit training data through the implicit definition of a similarity between data points. This talk will start by explaining the principles of this class of learning algorithms and then discussing some of their advantages and disadvantages. It will then explain how these methods can be extended to handle structured outputs. The talk will concentrate on conceptual rather than technical issues.

**François Coste,*** **Learning the language of protein sequences*

The linguistic metaphor has been used for long in Molecular Biology in the study of biological sequences. In this talk, I will focus on modelling protein sequences by formal grammars, presenting classical methods from the Bioinformatics field and our recent work to learn automatically the topology of these grammars.

**Ariadna Quattoni,*** **Hankel Based Methods for Learning Non-Deterministic Automata*

There is an increasing interest in spectral methods for learning latent-variable language models in the form of weighted automata and context-free grammars. Spectral methods provide an algebraic formulation of the learning problem that directly exploits the recurrence relations satisfied by these functions. I will review the spectral method from an algebraic perspective that exploits Hankel matrices to represent functions. The key idea is that if a function is computed by a finite state process then its corresponding Hankel matrix will be of low rank. Using this fact we can reduce the problem of searching over the space of finite state functions to searching over the space of low-rank Hankel matrices.

**Christophe Gonzales****,*** **Learning Non-stationary Dynamic Bayesian Networks*

Dynamic Bayesian networks (DBN) are a popular framework for managing uncertainty in time-evolving systems. Their efficient learning has thus received many contributions in the literature. But, most often, those assume that the data to be modeled by a DBN are generated by a stationary process, i.e., neither the structure nor the parameters of the BNs evolve over time. Unfortunately, there exist real-world problems where such a hypothesis is highly unrealistic, e.g., in video event recognition, social networks or road traffic analysis. In this talk, we will review the state-of-the art algorithms for learning "non-stationary DBNs" and propose a new principled approach to learn both the structure and parameters of "non-stationary DBNs".

**André Martins****,*** **Parsing as Reduction*

In this talk, I will show how we can reduce phrase-based parsing to dependency parsing. Our reduction is grounded on a new intermediate representation, "head-ordered dependency trees," shown to be isomorphic to constituent trees. By encoding order information in the dependency labels, we show that any off-the shelf, trainable dependency parser can be used to produce constituents. When this parser is non-projective, we can perform discontinuous parsing in a very natural manner. Despite the simplicity of this approach, experiments show that the resulting parsers are on par with strong baselines, such as the Berkeley parser for English and the best non-reranking system in the SPMRL-2014 shared task. Results are particularly striking for discontinuous parsing of German, where we surpass the current state of the art by a wide margin. If time permits, I will describe TurboParser, a multilingual dependency parser based on linear programming.

**Andreas Vlachos****,*** **Imitation learning for structured prediction in NLP*

Imitation learning is a learning paradigm originally developed to learn robotic controllers from demonstrations by humans, e.g. autonomous helicopters from pilot's demonstrations. Recently, algorithms for structured prediction were proposed under this paradigm and have been applied successfully to a number of tasks such as information extraction, coreference resolution, semantic parsing and dynamic feature selection. Key advantages are the ability to handle large output search spaces and to learn with non-decomposable loss functions. In this talk I will discuss the main ideas behind imitation leaning, and describe in detail its application to semantic parsing.

**Xavier Carreras****,*** **Low-rank Matrix Learning for Compositional Objects, Strings and Trees*

Compositional structures abound in NLP, ranging from bilexical relations, entities in context, sequences and trees. The focus of this talk is in latent-variable compositional models for structured objects that: (1) induce a latent n-dimensional representation for each element of the structure; and (2) learn operators for composing such elements into structures. I will present a framework based on formulating the learning problem as low-rank matrix learning, where we employ the well-known nuclear norm regularizer to favor parameter matrices that are low-rank. I will then illustrate applications of this framework in NLP tasks. First I will show a method to learn word embeddings tailored for specific linguistic relations, which results in word vectors that are both very compact and predictive. Then I will show the importance of low-rank regularization in conjunctive feature spaces, and specifically how our method can propagate weights to conjunctions not observed during training, which results in large improvements in a named entity extraction task. Finally I move to structure prediction and show that low-rank matrix learning can be used to induce the states of weighted automata for sequence tagging tasks.

**James Cussens****,*** **Integer programming for Bayesian network structure learning*

With complete data and appropriately chosen parameter priors the problem of finding a Bayesian network with maximal log marginal likelihood (LML) becomes a purely discrete problem: search for a directed acyclic graph (DAG) with maximal LML. We solve this problem of discrete optimisation using integer linear programming (ILP). In many cases this allows us to solve the problem: we find a DAG which we know to have maximal LML. Also using ILP allows prior knowledge, such as known conditional independence relations, to be expressed as constraints on DAG structure. This approach has proven to be particularly successful when learning pedigrees from genetic marker data where it is possible to learn guaranteed maximal LML DAGs with over 1000 nodes. However, many challenges remain, for example when there is missing data and when there is linkage between markers.

**Pascal Denis,*** **Learning Latent Trees for Joint Coreference Resolution and Anaphoricity Detection*

Noun Phrase Coreference Resolution is one of the most challenging tasks in Natural Language Processing with numerous applications within and outside NLP. After briefly reviewing limitations of previous approaches, this talk introduces a new structured model for learning coreference resolution and anaphoricity detection in a joint fashion. Specifically, we use a latent tree to represent the full coreference and anaphoric structure of a document at a global level, and we jointly learn the parameters of the two models using a version of the structured perceptron algorithm. Our joint structured model is further refined by the use of pairwise constraints which help the model to capture accurately certain patterns of coreference. Our experiments on the CoNLL-2012 dataset, the main benchmark for this task, show large improvements compared to various competing architectures.

**Eustasio del Barrio****,*** A contamination model for approximate stochastic order*

Stochastic ordering among distributions has been considered in a variety of scenarios. Economic studies often involve research about the ordering of investment strategies or social welfare. However, as noted in the literature, stochastic orderings are often a too strong assumption which is not supported by the data even in cases in which the researcher tends to believe that a certain variable is somehow smaller than other. Instead of considering this rigid model of stochastic order we propose to look at a more flexible version in which two distributions are said to satisfy an approximate stochastic order relation if they are slightly contaminated versions of distributions which do satisfy the stochastic ordering. The minimal level of contamination that makes this approximate model hold can be used as a measure of the deviation of the original distributions from the exact stochastic order model. Our approach is based on the use of trimmings of probability measures. We discuss the connection between them and the approximate stochastic order model and provide theoretical support for its use in data analysis, including asymptotic distributional theory as well as non-asymptotic bounds for the error probabilities of our tests. We also provide simulation results and a case study for illustration.

**Christine Sinoquet****,*** **Modeling of high-dimensional and spatially correlated data with forest of latent tree models: application to the modeling of genetical data for genome-scale multilocus association studies.*

At the corner of graph and probability theory, we have proposed a new class of probabilistic graphical models. This model, named FLTM (Forest of latent tree models), is dedicated to the modeling of complex high-dimensional and spatially correlated data. This modeling is made easier by the conception of a scalable learning model. This breakthrough has made possible the modeling of genetic data that describes a population at the genomic level. In a first time, we will describe our learning algorithm. Then, we focus on the estimation of the impact of the choice of the clustering method associated with this algorithm.

**Alessandro Moschitti****,*** **Efficient Structural Kernels for Natural Language Processing*

Almost all Natural Language Processing (NLP) applications deal with syntactic and semantic structures, requiring therefore the use of models for processing structured input and generating structured output. Given the complexity of natural language, heuristic methods demonstrate often to be inadequate for encoding such structures in NLP systems, also requiring a considerable design effort. Statistical machine learning approaches for structured output prediction are an effective alternative to the approach above. However, to our knowledge, there are no efficient global inference algorithms enabling the use of structures in the input, in the form of structural kernels, and outputting structures. An efficient approach consists in two steps: (i) the generation of a list of hypotheses of the structured output and (ii) the application of learning to rank approaches based on kernels for selecting the best hypothesis of the list.

This talk will introduce the most advanced structural kernels for NLP and then will show methods for efficient linguistic structure prediction for several applications, ranging from concept segmentation and labeling to predicate argument and discourse structures.

**Jian-Yun Nie****,*** *How can NLP help IR?

It is often believed that Natural Language Processing (NLP) plays an important role in Information Retrieval (IR). However, the current state of the art of IR does not use extensively NLP techniques. IR has developed its own methods to deal with languages. In this talk, I will review some of the successful and unsuccessful attempts on utilizing NLP in IR. It will be shown that successful uses generally require adaptation of NLP techniques to IR tasks. Two such adaptations will be presented in more detail: query parsing and deep learning in IR. In both cases, the adapted methods improved IR.