Kraaler: A User-Perspective Web Crawler

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceartikel i proceedingForskningpeer review

Resumé

Adaption of technologies being used on the web is changing frequently, requiring applications that interact with the web to continuously change their ability to parse it. This has led most web crawlers to either inherent simplistic parsing capabilities, differentiating from web browsers, or use a web browser with high-level interactions that restricts observable information. We introduce Kraaler, an open source universal web crawler that uses the Chrome Debugging Protocol, enabling the use of the Blink browser engine for parsing, while obtaining protocol-level information. The crawler stores information in a database and on the file system and the implementation has been evaluated in a predictable environment to ensure correctness in the collected data. Additionally, it has been evaluated in a real-world scenario, demonstrating the impact of the parsing capabilities for data collection.
OriginalsprogEngelsk
Titel2019 Network Traffic Measurement and Analysis Conference (TMA)
Antal sider8
ForlagIEEE
Publikationsdato5 aug. 2019
ISBN (Trykt)978-3-903176-17-1
ISBN (Elektronisk)978-3-903176-17-1
DOI
StatusUdgivet - 5 aug. 2019
Begivenhed2019 Network Traffic Measurement and Analysis Conference (TMA) - Paris , Frankrig
Varighed: 19 jun. 201921 jun. 2019

Konference

Konference2019 Network Traffic Measurement and Analysis Conference (TMA)
LandFrankrig
ByParis
Periode19/06/201921/06/2019

Citer dette

Panum, Thomas Kobber ; Hansen, René Rydhof ; Pedersen, Jens Myrup. / Kraaler: A User-Perspective Web Crawler. 2019 Network Traffic Measurement and Analysis Conference (TMA). IEEE, 2019.
@inproceedings{02e7c7cf787743699da21ca2bf93f18d,
title = "Kraaler: A User-Perspective Web Crawler",
abstract = "Adaption of technologies being used on the web is changing frequently, requiring applications that interact with the web to continuously change their ability to parse it. This has led most web crawlers to either inherent simplistic parsing capabilities, differentiating from web browsers, or use a web browser with high-level interactions that restricts observable information. We introduce Kraaler, an open source universal web crawler that uses the Chrome Debugging Protocol, enabling the use of the Blink browser engine for parsing, while obtaining protocol-level information. The crawler stores information in a database and on the file system and the implementation has been evaluated in a predictable environment to ensure correctness in the collected data. Additionally, it has been evaluated in a real-world scenario, demonstrating the impact of the parsing capabilities for data collection.",
author = "Panum, {Thomas Kobber} and Hansen, {Ren{\'e} Rydhof} and Pedersen, {Jens Myrup}",
year = "2019",
month = "8",
day = "5",
doi = "10.23919/TMA.2019.8784660",
language = "English",
isbn = "978-3-903176-17-1",
booktitle = "2019 Network Traffic Measurement and Analysis Conference (TMA)",
publisher = "IEEE",
address = "United States",

}

Panum, TK, Hansen, RR & Pedersen, JM 2019, Kraaler: A User-Perspective Web Crawler. i 2019 Network Traffic Measurement and Analysis Conference (TMA). IEEE, Paris , Frankrig, 19/06/2019. https://doi.org/10.23919/TMA.2019.8784660

Kraaler: A User-Perspective Web Crawler. / Panum, Thomas Kobber; Hansen, René Rydhof; Pedersen, Jens Myrup.

2019 Network Traffic Measurement and Analysis Conference (TMA). IEEE, 2019.

Publikation: Bidrag til bog/antologi/rapport/konference proceedingKonferenceartikel i proceedingForskningpeer review

TY - GEN

T1 - Kraaler: A User-Perspective Web Crawler

AU - Panum, Thomas Kobber

AU - Hansen, René Rydhof

AU - Pedersen, Jens Myrup

PY - 2019/8/5

Y1 - 2019/8/5

N2 - Adaption of technologies being used on the web is changing frequently, requiring applications that interact with the web to continuously change their ability to parse it. This has led most web crawlers to either inherent simplistic parsing capabilities, differentiating from web browsers, or use a web browser with high-level interactions that restricts observable information. We introduce Kraaler, an open source universal web crawler that uses the Chrome Debugging Protocol, enabling the use of the Blink browser engine for parsing, while obtaining protocol-level information. The crawler stores information in a database and on the file system and the implementation has been evaluated in a predictable environment to ensure correctness in the collected data. Additionally, it has been evaluated in a real-world scenario, demonstrating the impact of the parsing capabilities for data collection.

AB - Adaption of technologies being used on the web is changing frequently, requiring applications that interact with the web to continuously change their ability to parse it. This has led most web crawlers to either inherent simplistic parsing capabilities, differentiating from web browsers, or use a web browser with high-level interactions that restricts observable information. We introduce Kraaler, an open source universal web crawler that uses the Chrome Debugging Protocol, enabling the use of the Blink browser engine for parsing, while obtaining protocol-level information. The crawler stores information in a database and on the file system and the implementation has been evaluated in a predictable environment to ensure correctness in the collected data. Additionally, it has been evaluated in a real-world scenario, demonstrating the impact of the parsing capabilities for data collection.

U2 - 10.23919/TMA.2019.8784660

DO - 10.23919/TMA.2019.8784660

M3 - Article in proceeding

SN - 978-3-903176-17-1

BT - 2019 Network Traffic Measurement and Analysis Conference (TMA)

PB - IEEE

ER -

Panum TK, Hansen RR, Pedersen JM. Kraaler: A User-Perspective Web Crawler. I 2019 Network Traffic Measurement and Analysis Conference (TMA). IEEE. 2019 https://doi.org/10.23919/TMA.2019.8784660