Sample-based Attribute Selective AnDE for Large Data

Shenglei Chen; Ana Martinez; Geoffrey Webb; Limin Wang

doi:10.1109/TKDE.2016.2608881

Sample-based Attribute Selective AnDE for Large Data

Shenglei Chen, Ana Martinez, Geoffrey Webb, Limin Wang

Publikation: Bidrag til tidsskrift › Tidsskriftartikel › Forskning › peer review

25 Citationer (Scopus)

Abstract

More and more applications come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.

Originalsprog	Engelsk
Artikelnummer	7565579
Tidsskrift	IEEE Transactions on Knowledge and Data Engineering
Vol/bind	29
Udgave nummer	1
Sider (fra-til)	172-185
ISSN	1041-4347
DOI	https://doi.org/10.1109/TKDE.2016.2608881
Status	Udgivet - 2017

Adgang til dokumentet

10.1109/TKDE.2016.2608881

AUB Link

Søg efter materialet i Aalborg Universitetsbiblioteks søgemaskine

Andre filer og links

http://www.scopus.com/inward/record.url?scp=84992053080&partnerID=8YFLogxK

Citationsformater

@article{dc51d76a6d344ee4827a9a148e02712e,

title = "Sample-based Attribute Selective AnDE for Large Data",

abstract = "More and more applications come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.",

keywords = "Attribute selection, Averaged n-Dependence Estimators (AnDE), Bayesian network classifiers, Classification learning, Large data, Leave-one-out cross validation",

author = "Shenglei Chen and Ana Martinez and Geoffrey Webb and Limin Wang",

year = "2017",

doi = "10.1109/TKDE.2016.2608881",

language = "English",

volume = "29",

pages = "172--185",

journal = "IEEE Transactions on Knowledge and Data Engineering",

issn = "1041-4347",

publisher = "IEEE",

number = "1",

}

TY - JOUR

T1 - Sample-based Attribute Selective AnDE for Large Data

AU - Chen, Shenglei

AU - Martinez, Ana

AU - Webb, Geoffrey

AU - Wang, Limin

PY - 2017

Y1 - 2017

N2 - More and more applications come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.

AB - More and more applications come with large data sets in the past decade. However, existing algorithms cannot guarantee to scale well on large data. Averaged n-Dependence Estimators (AnDE) allows for flexible learning from out-of-core data, by varying the value of n (number of super parents). Hence AnDE is especially appropriate for large data learning. In this paper, we propose a sample-based attribute selection technique for AnDE. It needs one more pass through the training data, in which a multitude of approximate AnDE models are built and efficiently assessed by leave-one-out cross validation. The use of a sample reduces the training time. Experiments on 15 large data sets demonstrate that the proposed technique significantly reduces AnDE's error at the cost of a modest increase in training time. This efficient and scalable out-of-core approach delivers superior or comparable performance to typical in-core Bayesian network classifiers.

KW - Attribute selection

KW - Averaged n-Dependence Estimators (AnDE)

KW - Bayesian network classifiers

KW - Classification learning

KW - Large data

KW - Leave-one-out cross validation

UR - http://www.scopus.com/inward/record.url?scp=84992053080&partnerID=8YFLogxK

U2 - 10.1109/TKDE.2016.2608881

DO - 10.1109/TKDE.2016.2608881

M3 - Journal article

SN - 1041-4347

VL - 29

SP - 172

EP - 185

JO - IEEE Transactions on Knowledge and Data Engineering

JF - IEEE Transactions on Knowledge and Data Engineering

IS - 1

M1 - 7565579

ER -

Sample-based Attribute Selective AnDE for Large Data

Abstract

Adgang til dokumentet

AUB Link

Andre filer og links

Fingeraftryk

Citationsformater