The E3C Project: Collection and Annotation of a Multilingual Corpus of Clinical Cases

English. We present the European Clinical Case Corpus (E3C) project, aimed at collecting and annotating a large corpus of clinical cases in five European languages (Italian, English, French, Spanish, and Basque). Project results include: (i) a freely available collection of multilingual clinical cases; and (ii) a two-level annotation scheme based on temporal relations (derived from THYME), whose purpose is to allow the construction of clinical timelines, and taxonomy relations based on medical taxonomies, to be used for semantic reasoning over clinical cases.


Introduction
Identifying clinically relevant events and anchoring them to a chronology is very important in clinical information processing, as the ability to access an ordered sequence of events can help to understand the evolution of clinical conditions in patients. However, although interest in information extraction from clinical narratives has increased in recent decades, attention has been focused on clinical entity extraction and classification (Schulz et al., 2020;Grabar et al., 2019;Dreisbach et al., 2019;Luo et al., 2017) rather than on temporal information. If temporal information is extracted from clinical free text, it can be added to structured data collections, e.g. MIMIC III (Johnson et al., 2016), to train clinical prediction systems. Despite some effort on the organization of clinical narratives processing challenges, e.g. CLEF eHealth (Kelly et al., 2019), few shared training and test data sets have been created, and thus developing tools for this task is still difficult.
Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
In fact, the amount of freely available annotated corpora for any of the clinical information extraction tasks has not grown at the same rate as interest in the field, mainly due to patient privacy and data protection issues. In addition, most datasets consist of English texts, which makes research focus on that language.
In an attempt to overcome these problems, we present the European Clinical Case Corpus (E3C) 1 , a project aimed at offering a freely available multilingual corpus of semantically annotated clinical narratives. The project will build a 5language (Italian, English, French, Spanish, and Basque) clinical narrative corpus to allow for the linguistic analysis, benchmarking, and training of information extraction systems. We build upon available resources and collect new data when necessary, with the goal to harmonize current annotations, introduce new annotation layers, and provide baselines for information extraction tasks.
The E3C corpus is organized into three layers, with different purposes: Layer 1: about 25K tokens per language of clinical narratives with full manual or manually checked annotation of clinical entities, temporal information and factuality, for benchmarking and linguistic analysis.
Layer 2: 50-100K tokens per language of clinical narratives with automatic annotation of clinical entities and manual check of a small sample (about 10%) of this annotation.
Layer 3: about 1M tokens per language of non-annotated medical documents (not necessarily clinical narratives) to be exploited by semisupervised approaches.
In this paper we present our data collection effort, focused on clinical cases (Section 3), and we describe our annotation scheme (Section 4).

Clinical Cases
A clinical case is a statement of a clinical practice, presenting the reason for a clinical visit, the description of physical exams, and the assessment of the patient's situation. We focus on clinical cases because they are often de-identified, overcoming privacy issues, and are rich in clinical entities as well as temporal information, which is almost absent in other clinical documents (e.g., radiological reports).
A 25-year-old man with a history of Klippel-Trenaunay syndrome presented to the hospital with mucopurulent bloody stool and epigastric persistent colic pain for 2 wk. Continuous superficial ulcers and spontaneous bleeding were observed under colonoscopy. Subsequent gastroscopy revealed mucosa with diffuse edema, ulcers, errhysis, and granular and friable changes in the stomach and duodenal bulb, which were similar to the appearance of the rectum. After ruling out other possibilities according to a series of examinations, a diagnosis of GDUC was considered. The patient hesitated about intravenous corticosteroids, so he received a standardized treatment with pentasa of 3.2 g/d. After 0.5 mo of treatment, the patient's symptoms achieved complete remission. Followup endoscopy and imaging findings showed no evidence of recurrence for 26 mo.
Here we present a sample case extracted from our collection. It is about a patient presenting gastric symptoms (mucopurulent bloody stool and epigastric persistent colic pain), who is finally diagnosed with gastroduodenitis associated with ul-cerative colitis (GDUC). To reach the diagnosis, two consecutive medical tests (colonoscopy and gastroscopy) were performed. Treatment (treatment with pentasa of 3.2 g/d), outcome (complete remission) and follow-up (no evidence of recurrence) are also present in the text. Symptoms, tests, observations, treatments and diseases are relevant events for the history of a patient, and it is relevant to place them in chronological order, so as to understand the evolution of the health situation of the patient. For example, we know that the symptoms started 2 weeks prior to the hospital visit, that the colonoscopy was performed before the gastroscopy, that the treatment lasted for half a month and that the patient had no recurrence in the following 26 months.
Since precision in symptom description and diagnosis is utterly important in the clinical field, the clinical findings, body structures, medicines, etc., have to be uniquely identified. This can be done through international coding standards, which allow to assign a unique code to every clinically relevant element in the text.

Data Collection
When building the E3C corpus, a big concern has been ensuring its reusability and shareability, which forced us to use anonymised and freely redistributable clinical cases. We deal with three types of clinical narratives: discharge summaries, clinical cases published in journals, and clinical cases from medical training resources. The clinical cases in the E3C corpus contain narratives such as the excerpt presented here. 2020-09-01. The patient enters the ER due to abdominal pains. He reports chest pain 5 days ago.
The state of the data collection efforts for the five languages addressed by the project vary depending on their online presence and the number of publications available. For Spanish, a large dataset of clinical narratives and other clinical text collections already exist; for English and French, a significant amount of published material is publicly available. Corpus collection for Italian and Basque, on the other hand, has been more demanding, as we have had to manually extract clinical cases from a number of different sources. This is shown by the data in Table 1, where we report statistics about the clinical cases col-  Taking into account those numbers and the types of documents we have collected for each language, we can say that we have been able to collect enough data to complete Layer 1 in all the languages. For Layer 2, instead, we have only been able to collect enough clinical cases for English, French and Spanish. Reaching the million tokens in Layer 3 is not as complicated as it may seem, as the documents in it do not necessarily need to be clinical cases, although not as many data is available for Basque. The total amount of collected tokens and the layer coverage for each language can be seen in Table 2.
Corpus collection is in a very advanced stage, but new data will be added in the near future. The whole E3C corpus, including core metadata (i.e. language, source, date, length, etc.), will be made available.

Data Protection in the E3C Corpus
As mentioned, there are two main types of documents in the E3C corpus: clinical narratives and descriptive clinical documents. The latter and even some of the clinical cases (the ones that describe model situations) do not contain any personal data and are out of the scope of data protection regulations. Personal data protection issues, instead, regard the reports that have been written after an actual clinical case. These often contain sensitive patient information and it is the researchers' duty to disseminate them respecting data protection rules (e.g. European Union General Data Protection Regulation) and to address other ethical issues such as achieving informed consent from the patients prior to publication.
All the clinical cases in the E3C corpus have been previously published in other sources, and furthermore, they have been published under licenses that allow redistribution. As a consequence, we consider that all data protection and ethical issues were addressed at the time of first publication and that the documents already comply with the patient data protection policies.
While preparing the E3C dataset, we have also contributed to the protection of personal data, only getting the relevant information for our corpus, responding to the principle of data minimization. For example, many clinical case reports provide illustrative images that have not been considered, as image processing is out of the scope of our project.
In addition, we have also contributed to the reduction of patient traceability, as the article publication date (or an approximate one) has been established as the day the clinical case was written.

Annotation Scheme
E3C annotation consists of two levels that provide complementary information. On one hand, annotation of temporal information and factuality follows a mostly language-independent annotation scheme consisting of the THYME guidelines and their extensions (described in more detail in (Speranza and Altuna, 2020)). Annotation and classification of clinical entities, on the other hand, is based on two comprehensive medical taxonomies, SNOMED-CT and ICD-10.
The THYME-driven annotation focuses mainly on clinically relevant events and on the temporal relations between them, with the end goal of coding the information needed to build complete timelines, while the taxonomy-driven annotation provides semantic information and domain-specific knowledge. Looking at the sample clinical case in Section 3, the taxonomy-driven annotation might allow one to infer, for instance, that abdominal pains in the first sentence and chest pain in the last sentence are closely related, as they are siblings in the hierarchy (in fact, they are both children of [pain of truncal structure] in SNOMED-CT). From the THYME-driven annotation, instead, one might infer the chronological order in which the two events happened.

THYME-driven Annotation
THYME offers guidelines for the annotation of clinically relevant events, time expressions and the relations between them.
Events are all actions, states, and circumstances that are relevant to the clinical history of a patient (for example, we have pathologies and symptoms such as pain, but also more general events such as enters, reports, and continue). The annotation of events also includes a number of attributes, some of which focus on factuality-related information (the contextual modality attribute, for instance, is used to mark non-factual, either generic or hypothetical, events).
Time expressions are all references to time, such as dates (both absolute like 2020-09-01 and relative like 5 days ago), intervals (last three days), etc. THYME also provides guidelines for the annotation of relations between events and/or time expressions. By expressing precedence, overlap, containment, initiation or ending between two events and/or time expressions, TLINKs allow for chronologically ordering them. ALINKs are relations that link aspectual events, i.e. events indicating a specific phase (beginning, end, continuation, etc.) of an event, to the event itself.
To obtain annotations that will allow more descriptive timelines, we have expanded the THYME annotation scheme.
Anatomical parts are not annotated in THYME even if noun phrases whose head is a body part can be clinically very relevant (as in He had a swollen eye). To annotate them, we have created the new BODY PART tag. In addition, a new ACTOR tag is used to mark the actors (patients, health professionals, etc.) mentioned in the narratives. Finally, RML is a tag we have created to mark test results, results of laboratory analyses, formulaic measurements, and measure values (which are not marked in THYME), as we think that they offer relevant insights into the health status of a patient. Table 3 represents the annotated version of the clinical case in Section 3. The first column contains the original text (one token per line). The second column shows the span of the THYMEdriven annotated elements (specifically, examples of time expressions, actors, events, and body parts) in the IOB2 format, where B-LABEL marks the first token of an element of type LABEL, I-LABEL is used for the subsequent tokens (if any), and O is used for tokens that do not belong to an annotated element. The last two columns represent the taxonomy-driven annotation (see below).

Taxonomy-driven Annotation
Clinical coding is widely spread in clinical practice; either doctors add the codes for findings, procedures, treatments, etc. to the patients' clinical histories, or large amounts of raw clinical data are automatically coded for the development of clinical prediction systems. The coded concepts are hierarchically classified in taxonomies such as SNOMED-CT and ICD-10.
SNOMED-CT is considered to be the most comprehensive clinical healthcare taxonomy, and is available for most of the languages of the E3C project, i.e. English, French, Spanish, and Basque. There is a validated SNOMED-CT version for the first three languages, while for Basque a partial version has been used (Perez de Viñaspre and Oronoz, 2015). SNOMED-CT offers 19 main categories (and a wide set of subcategories) that range from clinical findings and body structures to social contexts. On the other hand, ICD-10 (International Classification of Diseases, 10th revision) is a classification of diagnoses and procedures. The diseases are classified in 22 categories.
Taxonomy-driven annotation consists of marking in the texts all mentions of clinical entities and mapping them to a code from both international standards. Table 3 represents the annotated version of the clinical case in Section 3. The third and forth columns show the span of the annotated clinical entities in the IOB2 format, with respect to SNOMED-CT and ICD-10 respectively.
The taxonomy-driven annotation is based, for each concept, on the specific linguistic realization that is coded in the taxonomy, whereas in texts we can find a number of different textual realizations of the same concept. Variability may relate to the alternation between singular and plural and between similar prepositions, or to the presence/omission of a preposition or article. In E3C we have devised a set of rules to account for the variability of linguistic expressions. For instance, looking at the excerpt in Section 3, the textual realization abdominal pains is associated with the singular SNOMED-CT concept [abdominal pain]. In addition, if overlapping portions of text match different concepts, we select the most specific one; for instance, [chest pain] is preferred over [pain].
The E3C guidelines for taxonomy-driven annotation are based on both the ShARe (Elhadad et al., 2012) and the ASSESS CT annotation guidelines 7 (Miñarro-Giménez et al., 2018).

Language-dependent Decisions
Semantic annotation of the E3C corpus is largely language-independent. However, as we are dealing with morpho-syntactically diverse languages, we have added additional annotation guidelines for each language. These guidelines respond mainly to the annotation of the extent of the temporal and clinical entities, since their semantic features are not altered by the morpho-syntactic features.
Both the THYME-driven and the taxonomydriven annotation schemes were originally developed for English, a language whose morphology is not particularly rich compared to the other languages of the E3C corpus (especially the Basque language). For all these, it was therefore necessary to define language specific guidelines handling the annotation of semantically complex tokens resulting from the combination of different elements (e.g., a preposition and an article) 8 .
In the case of romance languages (Italian, French and Spanish), we have taken decisions on the annotation of preposition+article contractions. The article may be part of the extent of time expressions, RML, actors and body parts, whereas the preposition should not be included. When a contraction is present, though, we have decided to capture it inside the extent (1-3).
(  Basque, on the other hand, is a highly agglutinative language in which information expressed by prepositions in Indo-european languages is expressed by postpositions. Most of those postpositions appear attached to the nouns, adjectives, verbs and adverbs they refer to, while there is also a small set of free postpositions. The attached postpositions are taken inside the extent of the tags in E3C (4)

Discussion
The two annotation levels can be mapped to address specific tasks, or to develop applications that need to exploit both. Within the E3C project, we are exploring the main issues that emerge when trying to exploit the two annotation levels at the same time. Our future aim within the project is to select a specific task and implement a mapping tailored to that task. The main mapping issue is determined by nonmatching annotated spans. Given that more specific (typically longer) taxonomy concepts are preferred to more generic ones, and that in THYME only the syntactic head of events is marked, in many cases the span of the concept is longer than the span of the event. Compare, for example, the SNOMED-CT concept associated with abdominal pains and the THYME event pains in Table 3.
More interestingly, in some cases, we can have two separate THYME annotations within the span of a single taxonomic concept. Back to our example, the SNOMED-CT concept [chest pain] overlaps with the two separate THYME annotations pain and chest.
Another issue is the inevitably different classification criteria in medical taxonomies and THYME. For instance, only a minimal part of what is marked as an event in THYME is a child concept of [event] in SNOMED-ct (e.g., abuse and death); in most cases what is marked as an event in THYME belongs to a different subpart of the SNOMED-CT hierarchy (for instance, pain is part of the [finding] subhierarchy, not of [event]).

Conclusions and Future Work
We presented the E3C project, which aims to become a reference European corpus of annotated clinical cases. We focused on two initial achievements: (i) a freely available collection of clinical cases in five languages; and (ii) a comprehensive annotation scheme based both on temporal information and on medical taxonomies.
Our next steps include the extensive manual annotation of the clinical cases in all five languages, and the definition of tasks and baselines on top of the annotated data, taking advantage of neural models derived from training data. More specifically, we plan to target the automatic construction of clinical timelines and question answering over clinical cases.