Scaling up Bayesian variational inference using distributed computing clusters

Andrés R. Masegosa; Ana M. Martinez; Helge Langseth; Thomas Dyhre Nielsen; Antonio Salmerón; Darío  Ramos-López; Anders Læsø Madsen

doi:10.1016/j.ijar.2017.06.010

Scaling up Bayesian variational inference using distributed computing clusters

Andrés R. Masegosa^*, Ana M. Martinez, Helge Langseth, Thomas Dyhre Nielsen, Antonio Salmerón, Darío Ramos-López, Anders Læsø Madsen

^*Corresponding author for this work

Research output: Contribution to journal › Journal article › Research › peer-review

10 Citations (Scopus)

Abstract

In this paper we present an approach for scaling up Bayesian learning using variational methods by exploiting distributed computing clusters managed by modern big data processing tools like Apache Spark or Apache Flink, which efficiently support iterative map-reduce operations. Our approach is defined as a distributed projected natural gradient ascent algorithm, has excellent convergence properties, and covers a wide range of conjugate exponential family models. We evaluate the proposed algorithm on three real-world datasets from different domains (the Pubmed abstracts dataset, a GPS trajectory dataset, and a financial dataset) and using several models (LDA, factor analysis, mixture of Gaussians and linear regression models). Our approach compares favorably to stochastic variational inference and streaming variational Bayes, two of the main current proposals for scaling up variational methods. For the scalability analysis, we evaluate our approach over a network with more than one billion nodes and approx. 75% latent variables using a computer cluster with 128 processing units (AWS). The proposed methods are released as part of an open-source toolbox for scalable probabilistic machine learning (http://www.amidsttoolbox.com) Masegosa et al. (2017) [29].

Original language	English
Journal	International Journal of Approximate Reasoning
Volume	88
Pages (from-to)	435-451
Number of pages	17
ISSN	0888-613X
DOIs	https://doi.org/10.1016/j.ijar.2017.06.010
Publication status	Published - 1 Sept 2017

Keywords

Apache Flink
Conjugate exponential family
Probabilistic graphical models
Scalable Bayesian learning
Variational inference

Access to Document

10.1016/j.ijar.2017.06.010

AUB Link

Search for the material in Aalborg University Library's search engine

Cite this

@article{d9877379554642c7b2a54b5bbf2b7e5d,

title = "Scaling up Bayesian variational inference using distributed computing clusters",

abstract = "In this paper we present an approach for scaling up Bayesian learning using variational methods by exploiting distributed computing clusters managed by modern big data processing tools like Apache Spark or Apache Flink, which efficiently support iterative map-reduce operations. Our approach is defined as a distributed projected natural gradient ascent algorithm, has excellent convergence properties, and covers a wide range of conjugate exponential family models. We evaluate the proposed algorithm on three real-world datasets from different domains (the Pubmed abstracts dataset, a GPS trajectory dataset, and a financial dataset) and using several models (LDA, factor analysis, mixture of Gaussians and linear regression models). Our approach compares favorably to stochastic variational inference and streaming variational Bayes, two of the main current proposals for scaling up variational methods. For the scalability analysis, we evaluate our approach over a network with more than one billion nodes and approx. 75% latent variables using a computer cluster with 128 processing units (AWS). The proposed methods are released as part of an open-source toolbox for scalable probabilistic machine learning (http://www.amidsttoolbox.com) Masegosa et al. (2017) [29].",

keywords = "Apache Flink, Conjugate exponential family, Probabilistic graphical models, Scalable Bayesian learning, Variational inference",

author = "Masegosa, {Andr{\'e}s R.} and Martinez, {Ana M.} and Helge Langseth and Nielsen, {Thomas Dyhre} and Antonio Salmer{\'o}n and Dar{\'i}o Ramos-L{\'o}pez and Madsen, {Anders L{\ae}s{\o}}",

year = "2017",

month = sep,

day = "1",

doi = "10.1016/j.ijar.2017.06.010",

language = "English",

volume = "88",

pages = "435--451",

journal = "International Journal of Approximate Reasoning",

issn = "0888-613X",

publisher = "Elsevier",

}

TY - JOUR

T1 - Scaling up Bayesian variational inference using distributed computing clusters

AU - Masegosa, Andrés R.

AU - Martinez, Ana M.

AU - Langseth, Helge

AU - Nielsen, Thomas Dyhre

AU - Salmerón, Antonio

AU - Ramos-López, Darío

AU - Madsen, Anders Læsø

PY - 2017/9/1

Y1 - 2017/9/1

N2 - In this paper we present an approach for scaling up Bayesian learning using variational methods by exploiting distributed computing clusters managed by modern big data processing tools like Apache Spark or Apache Flink, which efficiently support iterative map-reduce operations. Our approach is defined as a distributed projected natural gradient ascent algorithm, has excellent convergence properties, and covers a wide range of conjugate exponential family models. We evaluate the proposed algorithm on three real-world datasets from different domains (the Pubmed abstracts dataset, a GPS trajectory dataset, and a financial dataset) and using several models (LDA, factor analysis, mixture of Gaussians and linear regression models). Our approach compares favorably to stochastic variational inference and streaming variational Bayes, two of the main current proposals for scaling up variational methods. For the scalability analysis, we evaluate our approach over a network with more than one billion nodes and approx. 75% latent variables using a computer cluster with 128 processing units (AWS). The proposed methods are released as part of an open-source toolbox for scalable probabilistic machine learning (http://www.amidsttoolbox.com) Masegosa et al. (2017) [29].

AB - In this paper we present an approach for scaling up Bayesian learning using variational methods by exploiting distributed computing clusters managed by modern big data processing tools like Apache Spark or Apache Flink, which efficiently support iterative map-reduce operations. Our approach is defined as a distributed projected natural gradient ascent algorithm, has excellent convergence properties, and covers a wide range of conjugate exponential family models. We evaluate the proposed algorithm on three real-world datasets from different domains (the Pubmed abstracts dataset, a GPS trajectory dataset, and a financial dataset) and using several models (LDA, factor analysis, mixture of Gaussians and linear regression models). Our approach compares favorably to stochastic variational inference and streaming variational Bayes, two of the main current proposals for scaling up variational methods. For the scalability analysis, we evaluate our approach over a network with more than one billion nodes and approx. 75% latent variables using a computer cluster with 128 processing units (AWS). The proposed methods are released as part of an open-source toolbox for scalable probabilistic machine learning (http://www.amidsttoolbox.com) Masegosa et al. (2017) [29].

KW - Apache Flink

KW - Conjugate exponential family

KW - Probabilistic graphical models

KW - Scalable Bayesian learning

KW - Variational inference

UR - http://www.scopus.com/inward/record.url?scp=85021999641&partnerID=8YFLogxK

U2 - 10.1016/j.ijar.2017.06.010

DO - 10.1016/j.ijar.2017.06.010

M3 - Journal article

AN - SCOPUS:85021999641

SN - 0888-613X

VL - 88

SP - 435

EP - 451

JO - International Journal of Approximate Reasoning

JF - International Journal of Approximate Reasoning

ER -

Scaling up Bayesian variational inference using distributed computing clusters

Abstract

Keywords

Access to Document

AUB Link

Other files and links

Fingerprint

Cite this