Unleashing tabular content to open data: A survey on PDF table extraction methods and tools

Andreiwid Sheffer Corrêa, Pär Ola Zander

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceartikel i proceedingForskningpeer review

26 Citationer (Scopus)

Abstract

Portable Document Format (PDF) has been a popular way to exchange data in documents since Adobe introduced the format in 1993. Its report-like characteristic which preserves and prioritizes graphical visualization was part of the main publishing concerns among several segments including government agencies. In this way, tabular data started to be enclosed within PDF documents and disclosed in government portals. This situation, apart being surprisingly contradictory to data openness, is still found even in the major open data initiatives. It is estimated that roughly 13% of published files in some main open data portals around the world have their data made available in PDF. Thus, there is a need for effective tools capable of extracting tabular content (a main placeholder for data) from PDF to allow its data to be published in more open formats such as the well-known CSV which complies with accessible and machine processable open data principles. This paper aims at providing a structured and comprehensive overview of the research in tabular content extraction specifically from PDF documents as well as to provide an overview of most recent practical results in the literature. The contribution of this work goes beyond theoretical discussions by helping data practitioners to understand to what extent methods and tools regarding tabular content extraction from PDF can benefit the open data initiatives in practical and effective ways.

OriginalsprogEngelsk
TitelDG.O 2017 - Proceedings of the 18th Annual International Conference on Digital Government Research : Innovations and Transformations in Government
Antal sider10
Vol/bindPart F128275
ForlagAssociation for Computing Machinery
Publikationsdato7 jun. 2017
Sider54-63
ISBN (Elektronisk)9781450353175
DOI
StatusUdgivet - 7 jun. 2017
Begivenhed18th Annual International Conference on Digital Government Research, DG.O 2017 - Staten Island, USA
Varighed: 7 jun. 20179 jun. 2017

Konference

Konference18th Annual International Conference on Digital Government Research, DG.O 2017
Land/OmrådeUSA
ByStaten Island
Periode07/06/201709/06/2017
SponsorCity University of New York (CUNY)/School of Business at College of Staten Island (CSI), Emerald Publishing Group, IOS Press, iSecure Lab at College of Staten Island (CSI), Journal of Informatics, Rutgers University I-DSLA institute

Fingeraftryk

Dyk ned i forskningsemnerne om 'Unleashing tabular content to open data: A survey on PDF table extraction methods and tools'. Sammen danner de et unikt fingeraftryk.

Citationsformater