Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Scalable MCMC for large data problems using data subsampling and the difference estimator
Stockholm University, Faculty of Social Sciences, Department of Statistics.
(English)Manuscript (preprint) (Other academic)
Abstract [en]

We propose a generic Markov Chain Monte Carlo (MCMC) algorithm to speed up computations for data sets with many observations. A key feature of our approach is the use of the highly efficient difference estimator from the survey sampling literature to estimate the log-likelihood accurately using only a small fraction of the data. Our algorithm improves on the O(n) complexity of regular MCMC by operating over local data clusters instead of the full sample when computing the likelihood. The likelihood estimate is used in a Pseudo-marginal framework to sample from a perturbed posterior which is within O(m-1/2) of the true posterior, where m is the subsample size. The method is applied to a logistic regression model to predict firm bankruptcy for a large data set. We document a significant speed up in comparison to the standard MCMC on the full data set.

Keyword [en]
Bayesian inference, Markov Chain Monte Carlo, Pseudo-marginal MCMC, estimated likelihood, GLM for large data
National Category
Probability Theory and Statistics
Research subject
Statistics
Identifiers
URN: urn:nbn:se:su:diva-118137OAI: oai:DiVA.org:su-118137DiVA: diva2:820454
Funder
VINNOVA, 2010-02635
Available from: 2015-06-12 Created: 2015-06-12 Last updated: 2015-07-30
In thesis
1. Bayesian Inference in Large Data Problems
Open this publication in new window or tab >>Bayesian Inference in Large Data Problems
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In the last decade or so, there has been a dramatic increase in storage facilities and the possibility of processing huge amounts of data. This has made large high-quality data sets widely accessible for practitioners. This technology innovation seriously challenges traditional modeling and inference methodology.

This thesis is devoted to developing inference and modeling tools to handle large data sets. Four included papers treat various important aspects of this topic, with a special emphasis on Bayesian inference by scalable Markov Chain Monte Carlo (MCMC) methods.

In the first paper, we propose a novel mixture-of-experts model for longitudinal data. The model and inference methodology allows for manageable computations with a large number of subjects. The model dramatically improves the out-of-sample predictive density forecasts compared to existing models.

The second paper aims at developing a scalable MCMC algorithm. Ideas from the survey sampling literature are used to estimate the likelihood on a random subset of data. The likelihood estimate is used within the pseudomarginal MCMC framework and we develop a theoretical framework for such algorithms based on subsets of the data.

The third paper further develops the ideas introduced in the second paper. We introduce the difference estimator in this framework and modify the methods for estimating the likelihood on a random subset of data. This results in scalable inference for a wider class of models.

Finally, the fourth paper brings the survey sampling tools for estimating the likelihood developed in the thesis into the delayed acceptance MCMC framework. We compare to an existing approach in the literature and document promising results for our algorithm.

Place, publisher, year, edition, pages
Stockholm: Department of Statistics, Stockholm University, 2015. 50 p.
Keyword
Bayesian inference, Large data sets, Markov chain Monte Carlo, Survey sampling, Pseudo-marginal MCMC, Delayed acceptance MCMC
National Category
Probability Theory and Statistics
Research subject
Statistics
Identifiers
urn:nbn:se:su:diva-118836 (URN)978-91-7649-199-7 (ISBN)
Public defence
2015-09-07, Ahlmannsalen, Geovetenskapens hus, Svante Arrhenius väg 12, Stockholm, 10:00 (English)
Opponent
Supervisors
Funder
VINNOVA, 2010-02635
Note

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 1: Submitted. Paper 2: Submitted. Paper 3: Manuscript. Paper 4: Manuscript.

Available from: 2015-08-14 Created: 2015-07-08 Last updated: 2015-08-13Bibliographically approved

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Quiroz, Matias
By organisation
Department of Statistics
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 1391 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf