Abstract
Adaption of technologies being used on the web is changing frequently, requiring applications that interact with the web to continuously change their ability to parse it. This has led most web crawlers to either inherent simplistic parsing capabilities, differentiating from web browsers, or use a web browser with high-level interactions that restricts observable information. We introduce Kraaler, an open source universal web crawler that uses the Chrome Debugging Protocol, enabling the use of the Blink browser engine for parsing, while obtaining protocol-level information. The crawler stores information in a database and on the file system and the implementation has been evaluated in a predictable environment to ensure correctness in the collected data. Additionally, it has been evaluated in a real-world scenario, demonstrating the impact of the parsing capabilities for data collection.
Originalsprog | Engelsk |
---|---|
Titel | 2019 Network Traffic Measurement and Analysis Conference (TMA) |
Redaktører | Stefano Secci, Isabelle Chrisment, Marco Fiore, Lionel Tabourier, Keun-Woo Lim |
Antal sider | 8 |
Forlag | IEEE |
Publikationsdato | 5 aug. 2019 |
Sider | 153-160 |
Artikelnummer | 8784660 |
ISBN (Trykt) | 978-3-903176-17-1 |
ISBN (Elektronisk) | 978-3-903176-17-1 |
DOI | |
Status | Udgivet - 5 aug. 2019 |
Begivenhed | 2019 Network Traffic Measurement and Analysis Conference (TMA) - Paris , Frankrig Varighed: 19 jun. 2019 → 21 jun. 2019 |
Konference
Konference | 2019 Network Traffic Measurement and Analysis Conference (TMA) |
---|---|
Land/Område | Frankrig |
By | Paris |
Periode | 19/06/2019 → 21/06/2019 |