A Real-World Data Resource of Complex Sensitive Sentences Based on Documents from the Monsanto Trial

Jan Neerbek, Morten Eskildsen, Peter Dolog, Ira Assent

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceartikel i proceedingForskningpeer review

3 Citationer (Scopus)
22 Downloads (Pure)

Abstrakt

In this work we present a corpus for the evaluation of sensitive information detection approaches that addresses the need for real world sensitive information for empirical studies. Our sentence corpus contains different notions of complex sensitive information that correspond to different aspects of concern in a current trial of the Monsanto company. This paper describes the annotations process, where we both employ human annotators and furthermore create automatically inferred labels regarding technical, legal and informal communication within and with employees of Monsanto, drawing on a classification of documents by lawyers involved in the Monsanto court case. We release corpus of high quality sentences and parse trees with these two types of labels on sentence level. We characterize the sensitive information via several representative sensitive information detection models, in particular both keyword-based (n-gram) approaches and recent deep learning models, namely, recurrent neural networks (LSTM) and recursive neural networks (RecNN). Data and code are made publicly available.
OriginalsprogEngelsk
TitelProceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020
RedaktørerNicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, Stelios Piperidis
Antal sider10
UdgivelsesstedMarseille, France
ForlagEuropean Language Resources Association
Publikationsdato2020
Sider1258-1267
StatusUdgivet - 2020
Begivenhed12th Language Resources and Evaluation Conference - Marseille, Frankrig
Varighed: 1 maj 202031 maj 2020
Konferencens nummer: 12th

Konference

Konference12th Language Resources and Evaluation Conference
Nummer12th
Land/OmrådeFrankrig
ByMarseille
Periode01/05/202031/05/2020

Fingeraftryk

Dyk ned i forskningsemnerne om 'A Real-World Data Resource of Complex Sensitive Sentences Based on Documents from the Monsanto Trial'. Sammen danner de et unikt fingeraftryk.

Citationsformater