Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Bayesian Inference in Large Data Problems
Stockholms universitet, Samhällsvetenskapliga fakulteten, Statistiska institutionen.
2015 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

In the last decade or so, there has been a dramatic increase in storage facilities and the possibility of processing huge amounts of data. This has made large high-quality data sets widely accessible for practitioners. This technology innovation seriously challenges traditional modeling and inference methodology.

This thesis is devoted to developing inference and modeling tools to handle large data sets. Four included papers treat various important aspects of this topic, with a special emphasis on Bayesian inference by scalable Markov Chain Monte Carlo (MCMC) methods.

In the first paper, we propose a novel mixture-of-experts model for longitudinal data. The model and inference methodology allows for manageable computations with a large number of subjects. The model dramatically improves the out-of-sample predictive density forecasts compared to existing models.

The second paper aims at developing a scalable MCMC algorithm. Ideas from the survey sampling literature are used to estimate the likelihood on a random subset of data. The likelihood estimate is used within the pseudomarginal MCMC framework and we develop a theoretical framework for such algorithms based on subsets of the data.

The third paper further develops the ideas introduced in the second paper. We introduce the difference estimator in this framework and modify the methods for estimating the likelihood on a random subset of data. This results in scalable inference for a wider class of models.

Finally, the fourth paper brings the survey sampling tools for estimating the likelihood developed in the thesis into the delayed acceptance MCMC framework. We compare to an existing approach in the literature and document promising results for our algorithm.

Ort, förlag, år, upplaga, sidor
Stockholm: Department of Statistics, Stockholm University , 2015. , s. 50
Nyckelord [en]
Bayesian inference, Large data sets, Markov chain Monte Carlo, Survey sampling, Pseudo-marginal MCMC, Delayed acceptance MCMC
Nationell ämneskategori
Sannolikhetsteori och statistik
Forskningsämne
statistik
Identifikatorer
URN: urn:nbn:se:su:diva-118836ISBN: 978-91-7649-199-7 (tryckt)OAI: oai:DiVA.org:su-118836DiVA, id: diva2:840507
Disputation
2015-09-07, Ahlmannsalen, Geovetenskapens hus, Svante Arrhenius väg 12, Stockholm, 10:00 (Engelska)
Opponent
Handledare
Forskningsfinansiär
Vinnova, 2010-02635
Anmärkning

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 1: Submitted. Paper 2: Submitted. Paper 3: Manuscript. Paper 4: Manuscript.

Tillgänglig från: 2015-08-14 Skapad: 2015-07-08 Senast uppdaterad: 2022-02-23Bibliografiskt granskad
Delarbeten
1. Dynamic mixture-of-experts models for longitudinal and discrete-time survival data
Öppna denna publikation i ny flik eller fönster >>Dynamic mixture-of-experts models for longitudinal and discrete-time survival data
(Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

We propose a general class of flexible models for longitudinal data with special emphasis on discrete-time survival data. The model is a finite mixture model where the subjects are allowed to move between components through time. The time-varying probabilities of component memberships are modeled as a function of subject-specific time-varying covariates. This allows for interesting within-subject dynamics and manageable computations even with a large number of subjects. Each parameter in the component densities and in the mixing function is connected to its own set of covariates through a link function. The models are estimated using a Bayesian approach via a highly efficient Markov Chain Monte Carlo (MCMC) algorithm with tailored proposals and variable selection in all sets of covariates. The focus of the paper is on models for discrete-time survival data with an application to bankruptcy prediction for Swedish firms, using both exponential and Weibull mixture components. The dynamic mixture-of-experts models are shown to have an interesting interpretation and to dramatically improve the out-of-sample predictive density forecasts compared to models with time-invariant mixture probabilities.

Nyckelord
Bayesian inference, Markov chain Monte Carlo, Bayesian variable selection, Survival analysis, Mixture-of-experts
Nationell ämneskategori
Sannolikhetsteori och statistik
Forskningsämne
statistik
Identifikatorer
urn:nbn:se:su:diva-118133 (URN)
Forskningsfinansiär
VINNOVA, 2010-02635
Tillgänglig från: 2015-06-12 Skapad: 2015-06-12 Senast uppdaterad: 2022-02-23
2. Speeding up MCMC by efficient data subsampling
Öppna denna publikation i ny flik eller fönster >>Speeding up MCMC by efficient data subsampling
(Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

The computing time for Markov Chain Monte Carlo (MCMC) algorithms can be prohibitively large for data sets with many observations, especially when the data density for each observation is costly to evaluate. We propose a framework where the likelihood function is estimated from a random subset of the data, resulting in substantially fewer density evaluations. The data subsets are selected using an efficient Probability Proportional-to-Size (PPS) sampling scheme, where the inclusion probability of an observation is proportional to an approximation of its contribution to the log-likelihood function. Three broad classes of approximations are presented. The proposed algorithm is shown to sample from a distribution that is within O(m-1/2) of the true posterior, where m is the subsample size. Moreover, the constant in the  O(m-1/2) error bound of the likelihood is shown to be small and the approximation error is demonstrated to be negligible even for a small m in our applications. We propose a simple way to adaptively choose the sample size m during the MCMC to optimize sampling efficiency for a fixed computational budget. The method is applied to a bivariate probit model on a data set with half a million observations, and on a Weibull regression model with random effects for discrete-time survival data.

Nyckelord
Bayesian inference, Markov Chain Monte Carlo, Pseudo-marginal MCMC, Big Data, Probability Proportional-to-Size sampling, Numerical integration
Nationell ämneskategori
Sannolikhetsteori och statistik
Forskningsämne
statistik
Identifikatorer
urn:nbn:se:su:diva-118134 (URN)
Forskningsfinansiär
VINNOVA, 2010-02635
Tillgänglig från: 2015-06-12 Skapad: 2015-06-12 Senast uppdaterad: 2022-02-23
3. Scalable MCMC for large data problems using data subsampling and the difference estimator
Öppna denna publikation i ny flik eller fönster >>Scalable MCMC for large data problems using data subsampling and the difference estimator
(Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

We propose a generic Markov Chain Monte Carlo (MCMC) algorithm to speed up computations for data sets with many observations. A key feature of our approach is the use of the highly efficient difference estimator from the survey sampling literature to estimate the log-likelihood accurately using only a small fraction of the data. Our algorithm improves on the O(n) complexity of regular MCMC by operating over local data clusters instead of the full sample when computing the likelihood. The likelihood estimate is used in a Pseudo-marginal framework to sample from a perturbed posterior which is within O(m-1/2) of the true posterior, where m is the subsample size. The method is applied to a logistic regression model to predict firm bankruptcy for a large data set. We document a significant speed up in comparison to the standard MCMC on the full data set.

Nyckelord
Bayesian inference, Markov Chain Monte Carlo, Pseudo-marginal MCMC, estimated likelihood, GLM for large data
Nationell ämneskategori
Sannolikhetsteori och statistik
Forskningsämne
statistik
Identifikatorer
urn:nbn:se:su:diva-118137 (URN)
Forskningsfinansiär
VINNOVA, 2010-02635
Tillgänglig från: 2015-06-12 Skapad: 2015-06-12 Senast uppdaterad: 2022-02-23
4. Speeding up MCMC by delayed acceptance and data subsampling
Öppna denna publikation i ny flik eller fönster >>Speeding up MCMC by delayed acceptance and data subsampling
(Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

The complexity of Markov Chain Monte Carlo (MCMC) algorithms arises from the requirement of a likelihood evaluation for the full data set in each iteration. Payne and Mallick (2014) propose to speed up the Metropolis-Hastings algorithm by a delayed acceptance approach where the acceptance decision proceeds in two stages. In the first stage, an estimate of the likelihood based on a random subsample determines if it is likely that the draw will be accepted and, if so, the second stage uses the full data likelihood to decide upon final acceptance. Evaluating the full data likelihood is thus avoided for draws that are unlikely to be accepted. We propose a more precise likelihood estimator which incorporates auxiliary information about the full data likelihood while only operating on a sparse set of the data. It is proved that the resulting delayed acceptance MCMC is asymptotically more efficient compared to that of Payne and Mallick (2014). Furthermore, we adapt the method to handle data sets that are too large to fit in Random-Access Memory (RAM). This adaptation results in an algorithm that samples from an approximate posterior with well studied theoretical properties in the literature.

Nyckelord
MCMC, delayed acceptance, data subsampling, large data
Nationell ämneskategori
Sannolikhetsteori och statistik
Forskningsämne
statistik
Identifikatorer
urn:nbn:se:su:diva-118140 (URN)
Forskningsfinansiär
VINNOVA, 2010-02635
Tillgänglig från: 2015-06-12 Skapad: 2015-06-12 Senast uppdaterad: 2022-02-23

Open Access i DiVA

kappa fulltext(321 kB)1094 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 321 kBChecksumma SHA-512
b175f9f408dce5d3f3ff4b8aea05e7bf782bbb1c0ffe0ca55b1fea5779e33ccbae4e42536f67e875a7343765e244b1b5d29bd06e94694def03522640bdf31038
Typ fulltextMimetyp application/pdf

Person

Quiroz, Matias

Sök vidare i DiVA

Av författaren/redaktören
Quiroz, Matias
Av organisationen
Statistiska institutionen
Sannolikhetsteori och statistik

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 1094 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 5452 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf