Evaluating distance-based clustering for user (browse and click) sessions in a domain-specific collection

Jeremy Steinhauer; Lois M.L. Delcambre; Marianne Lykke; Marit Kristine Ådland

doi:10.1007/s00799-014-0117-z

Evaluating distance-based clustering for user (browse and click) sessions in a domain-specific collection

Jeremy Steinhauer, Lois M.L. Delcambre, Marianne Lykke, Marit Kristine Ådland

Department of Communication and Psychology

Research output: Contribution to journal › Journal article › Research › peer-review

Abstract

We seek to improve information retrieval in a domain-specific collection by clustering user sessions from a click log and then classifying later user sessions in real time. As a preliminary step, we explore the main assumption of this approach: whether user sessions in such a site are related to the question that they are answering. Since a large class of machine learning algorithms use a distance measure at the core, we evaluate the suitability of common machine learning distance measures to distinguish sessions of users searching for the answer to same or different questions. We found that two distance measures work very well for our task and three others do not. As a further step, we then investigate how effective the distance measures are when used in clustering. For our dataset, we conducted a user study where we had multiple users answer the same set of questions. This data, grouped by question, was used as our gold standard for evaluating the clusters produced by the clustering algorithms. We found that the observed difference between the two classes of distance measures affected the quality of the clusterings, as expected. We also found that one of the two distance measures that worked well to differentiate sessions, worked significantly better than the other when clustering. Finally, we discuss why some distance metrics performed better than others in the two parts of our work.

Original language	English
Journal	International Journal on Digital Libraries
Volume	14
Issue number	3/4
Pages (from-to)	167-179
Number of pages	13
ISSN	1432-5012
DOIs	https://doi.org/10.1007/s00799-014-0117-z
Publication status	Published - 2014

Access to Document

10.1007/s00799-014-0117-z

AUB Link

Search for the material in Aalborg University Library's search engine

FIRE: FIRE: Facilitating information retrieval (for domain)experts
Lykke, M.
01/01/2009 → 01/01/2012
Project: Research

Cite this

@article{67993176dbfe4db78307220377617e1f,

title = "Evaluating distance-based clustering for user (browse and click) sessions in a domain-specific collection",

abstract = "We seek to improve information retrieval in a domain-specific collection by clustering user sessions from a click log and then classifying later user sessions in real time. As a preliminary step, we explore the main assumption of this approach: whether user sessions in such a site are related to the question that they are answering. Since a large class of machine learning algorithms use a distance measure at the core, we evaluate the suitability of common machine learning distance measures to distinguish sessions of users searching for the answer to same or different questions. We found that two distance measures work very well for our task and three others do not. As a further step, we then investigate how effective the distance measures are when used in clustering. For our dataset, we conducted a user study where we had multiple users answer the same set of questions. This data, grouped by question, was used as our gold standard for evaluating the clusters produced by the clustering algorithms. We found that the observed difference between the two classes of distance measures affected the quality of the clusterings, as expected. We also found that one of the two distance measures that worked well to differentiate sessions, worked significantly better than the other when clustering. Finally, we discuss why some distance metrics performed better than others in the two parts of our work.",

author = "Jeremy Steinhauer and Delcambre, {Lois M.L.} and Marianne Lykke and {\AA}dland, {Marit Kristine}",

year = "2014",

doi = "10.1007/s00799-014-0117-z",

language = "English",

volume = "14",

pages = "167--179",

journal = "International Journal on Digital Libraries",

issn = "1432-5012",

publisher = "Physica-Verlag",

number = "3/4",

}

TY - JOUR

T1 - Evaluating distance-based clustering for user (browse and click) sessions in a domain-specific collection

AU - Steinhauer, Jeremy

AU - Delcambre, Lois M.L.

AU - Lykke, Marianne

AU - Ådland, Marit Kristine

PY - 2014

Y1 - 2014

N2 - We seek to improve information retrieval in a domain-specific collection by clustering user sessions from a click log and then classifying later user sessions in real time. As a preliminary step, we explore the main assumption of this approach: whether user sessions in such a site are related to the question that they are answering. Since a large class of machine learning algorithms use a distance measure at the core, we evaluate the suitability of common machine learning distance measures to distinguish sessions of users searching for the answer to same or different questions. We found that two distance measures work very well for our task and three others do not. As a further step, we then investigate how effective the distance measures are when used in clustering. For our dataset, we conducted a user study where we had multiple users answer the same set of questions. This data, grouped by question, was used as our gold standard for evaluating the clusters produced by the clustering algorithms. We found that the observed difference between the two classes of distance measures affected the quality of the clusterings, as expected. We also found that one of the two distance measures that worked well to differentiate sessions, worked significantly better than the other when clustering. Finally, we discuss why some distance metrics performed better than others in the two parts of our work.

AB - We seek to improve information retrieval in a domain-specific collection by clustering user sessions from a click log and then classifying later user sessions in real time. As a preliminary step, we explore the main assumption of this approach: whether user sessions in such a site are related to the question that they are answering. Since a large class of machine learning algorithms use a distance measure at the core, we evaluate the suitability of common machine learning distance measures to distinguish sessions of users searching for the answer to same or different questions. We found that two distance measures work very well for our task and three others do not. As a further step, we then investigate how effective the distance measures are when used in clustering. For our dataset, we conducted a user study where we had multiple users answer the same set of questions. This data, grouped by question, was used as our gold standard for evaluating the clusters produced by the clustering algorithms. We found that the observed difference between the two classes of distance measures affected the quality of the clusterings, as expected. We also found that one of the two distance measures that worked well to differentiate sessions, worked significantly better than the other when clustering. Finally, we discuss why some distance metrics performed better than others in the two parts of our work.

U2 - 10.1007/s00799-014-0117-z

DO - 10.1007/s00799-014-0117-z

M3 - Journal article

SN - 1432-5012

VL - 14

SP - 167

EP - 179

JO - International Journal on Digital Libraries

JF - International Journal on Digital Libraries

IS - 3/4

ER -

Evaluating distance-based clustering for user (browse and click) sessions in a domain-specific collection

Abstract

Access to Document

AUB Link

Projects

FIRE: FIRE: Facilitating information retrieval (for domain)experts

Cite this