TY - JOUR
T1 - Scaling up Bayesian variational inference using distributed computing clusters
AU - Masegosa, Andrés R.
AU - Martinez, Ana M.
AU - Langseth, Helge
AU - Nielsen, Thomas Dyhre
AU - Salmerón, Antonio
AU - Ramos-López, Darío
AU - Madsen, Anders Læsø
PY - 2017/9/1
Y1 - 2017/9/1
N2 - In this paper we present an approach for scaling up Bayesian learning using variational methods by exploiting distributed computing clusters managed by modern big data processing tools like Apache Spark or Apache Flink, which efficiently support iterative map-reduce operations. Our approach is defined as a distributed projected natural gradient ascent algorithm, has excellent convergence properties, and covers a wide range of conjugate exponential family models. We evaluate the proposed algorithm on three real-world datasets from different domains (the Pubmed abstracts dataset, a GPS trajectory dataset, and a financial dataset) and using several models (LDA, factor analysis, mixture of Gaussians and linear regression models). Our approach compares favorably to stochastic variational inference and streaming variational Bayes, two of the main current proposals for scaling up variational methods. For the scalability analysis, we evaluate our approach over a network with more than one billion nodes and approx. 75% latent variables using a computer cluster with 128 processing units (AWS). The proposed methods are released as part of an open-source toolbox for scalable probabilistic machine learning (http://www.amidsttoolbox.com) Masegosa et al. (2017) [29].
AB - In this paper we present an approach for scaling up Bayesian learning using variational methods by exploiting distributed computing clusters managed by modern big data processing tools like Apache Spark or Apache Flink, which efficiently support iterative map-reduce operations. Our approach is defined as a distributed projected natural gradient ascent algorithm, has excellent convergence properties, and covers a wide range of conjugate exponential family models. We evaluate the proposed algorithm on three real-world datasets from different domains (the Pubmed abstracts dataset, a GPS trajectory dataset, and a financial dataset) and using several models (LDA, factor analysis, mixture of Gaussians and linear regression models). Our approach compares favorably to stochastic variational inference and streaming variational Bayes, two of the main current proposals for scaling up variational methods. For the scalability analysis, we evaluate our approach over a network with more than one billion nodes and approx. 75% latent variables using a computer cluster with 128 processing units (AWS). The proposed methods are released as part of an open-source toolbox for scalable probabilistic machine learning (http://www.amidsttoolbox.com) Masegosa et al. (2017) [29].
KW - Apache Flink
KW - Conjugate exponential family
KW - Probabilistic graphical models
KW - Scalable Bayesian learning
KW - Variational inference
UR - http://www.scopus.com/inward/record.url?scp=85021999641&partnerID=8YFLogxK
U2 - 10.1016/j.ijar.2017.06.010
DO - 10.1016/j.ijar.2017.06.010
M3 - Journal article
AN - SCOPUS:85021999641
SN - 0888-613X
VL - 88
SP - 435
EP - 451
JO - International Journal of Approximate Reasoning
JF - International Journal of Approximate Reasoning
ER -