Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Classification models for high-dimensional data with sparsity patterns
Stockholm University, Faculty of Social Sciences, Department of Statistics.
2013 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Today's high-throughput data collection devices, e.g. spectrometers and gene chips, create information in abundance. However, this poses serious statistical challenges, as the number of features is usually much larger than the number of observed units.  Further, in this high-dimensional setting, only a small fraction of the features are likely to be informative for any specific project. In this thesis, three different approaches to the two-class supervised classification in this high-dimensional, low sample setting are considered.

There are classifiers that are known to mitigate the issues of high-dimensionality, e.g. distance-based classifiers such as Naive Bayes. However, these classifiers are often computationally intensive and therefore less time-consuming for discrete data. Hence, continuous features are often transformed into discrete features. In the first paper, a discretization algorithm suitable for high-dimensional data is suggested and compared with other discretization approaches. Further, the effect of discretization on misclassification probability in high-dimensional setting is evaluated.  

Linear classifiers are more stable which motivate adjusting the linear discriminant procedure to high-dimensional setting. In the second paper, a two-stage estimation procedure of the inverse covariance matrix, applying Lasso-based regularization and Cuthill-McKee ordering is suggested. The estimation gives a block-diagonal approximation of the covariance matrix which in turn leads to an additive classifier. In the third paper, an asymptotic framework that represents sparse and weak block models is derived and a technique for block-wise feature selection is proposed.     

Probabilistic classifiers have the advantage of providing the probability of membership in each class for new observations rather than simply assigning to a class. In the fourth paper, a method is developed for constructing a Bayesian predictive classifier. Given the block-diagonal covariance matrix, the resulting Bayesian predictive and marginal classifier provides an efficient solution to the high-dimensional problem by splitting it into smaller tractable problems.

The relevance and benefits of the proposed methods are illustrated using both simulated and real data.

Abstract [sv]

Med dagens teknik, till exempel spektrometer och genchips, alstras data i stora mängder. Detta överflöd av data är inte bara till fördel utan orsakar även vissa problem, vanligtvis är antalet variabler (p) betydligt fler än antalet observation (n). Detta ger så kallat högdimensionella data vilket kräver nya statistiska metoder, då de traditionella metoderna är utvecklade för den omvända situationen (p<n).  Dessutom är det vanligtvis väldigt få av alla dessa variabler som är relevanta för något givet projekt och styrkan på informationen hos de relevanta variablerna är ofta svag. Därav brukar denna typ av data benämnas som gles och svag (sparse and weak). Vanligtvis brukar identifiering av de relevanta variablerna liknas vid att hitta en nål i en höstack.

Denna avhandling tar upp tre olika sätt att klassificera i denna typ av högdimensionella data.  Där klassificera innebär, att genom ha tillgång till ett dataset med både förklaringsvariabler och en utfallsvariabel, lära en funktion eller algoritm hur den skall kunna förutspå utfallsvariabeln baserat på endast förklaringsvariablerna. Den typ av riktiga data som används i avhandlingen är microarrays, det är cellprov som visar aktivitet hos generna i cellen. Målet med klassificeringen är att med hjälp av variationen i aktivitet hos de tusentals gener (förklaringsvariablerna) avgöra huruvida cellprovet kommer från cancervävnad eller normalvävnad (utfallsvariabeln).

Det finns klassificeringsmetoder som kan hantera högdimensionella data men dessa är ofta beräkningsintensiva, därav fungera de ofta bättre för diskreta data. Genom att transformera kontinuerliga variabler till diskreta (diskretisera) kan beräkningstiden reduceras och göra klassificeringen mer effektiv. I avhandlingen studeras huruvida av diskretisering påverkar klassificeringens prediceringsnoggrannhet och en mycket effektiv diskretiseringsmetod för högdimensionella data föreslås.

Linjära klassificeringsmetoder har fördelen att vara stabila. Nackdelen är att de kräver en inverterbar kovariansmatris och vilket kovariansmatrisen inte är för högdimensionella data. I avhandlingen föreslås ett sätt att skatta inversen för glesa kovariansmatriser med blockdiagonalmatris. Denna matris har dessutom fördelen att det leder till additiv klassificering vilket möjliggör att välja hela block av relevanta variabler. I avhandlingen presenteras även en metod för att identifiera och välja ut blocken.

Det finns också probabilistiska klassificeringsmetoder som har fördelen att ge sannolikheten att tillhöra vardera av de möjliga utfallen för en observation, inte som de flesta andra klassificeringsmetoder som bara predicerar utfallet. I avhandlingen förslås en sådan Bayesiansk metod, givet den blockdiagonala matrisen och normalfördelade utfallsklasser.

De i avhandlingen förslagna metodernas relevans och fördelar är visade genom att tillämpa dem på simulerade och riktiga högdimensionella data.     

Place, publisher, year, edition, pages
Stockholm: Department of Statistics, Stockholm University , 2013. , 17 p.
Keyword [en]
High-dimensionality, supervised classification, classification accuracy, sparse, block-diagonal covariance structure, graphical Lasso, separation strength, discretization
National Category
Mathematics
Research subject
Statistics
Identifiers
URN: urn:nbn:se:su:diva-95664ISBN: 978-91-7447-772-6 (print)OAI: oai:DiVA.org:su-95664DiVA: diva2:661069
Public defence
2013-12-05, hörsal 2, hus A, Universitetsvägen 10 A, Stockholm, 10:00 (English)
Opponent
Supervisors
Available from: 2013-11-13 Created: 2013-10-31 Last updated: 2013-11-04Bibliographically approved
List of papers
1. Effect of Data Discretization on the Classification Accuracy in a High-Dimensional Framework
Open this publication in new window or tab >>Effect of Data Discretization on the Classification Accuracy in a High-Dimensional Framework
2012 (English)In: International Journal of Intelligent Systems, ISSN 0884-8173, E-ISSN 1098-111X, Vol. 27, no 4, 355-374 p.Article in journal (Refereed) Published
Abstract [en]

We investigate discretization of continuous variables for classification problems in a high-dimensional framework. As the goal of classification is to correctly predict a class membership of an observation, we suggest a discretization method that optimizes the discretization procedure using the misclassification probability as a measure of the classification accuracy. Our method is compared to several other discretization methods as well as result for continuous data. To compare performance we consider three supervised classification methods, and to capture the effect of high dimensionality we investigate a number of feature variables for a fixed number of observations. Since discretization is a data transformation procedure, we also investigate how the dependence structure is affected by this. Our method performs well, and lower misclassification can be obtained in a high-dimensional framework for both simulated and real data if the continuous feature variables are first discretized. The dependence structure is well maintained for some discretization methods.

National Category
Probability Theory and Statistics
Research subject
Statistics
Identifiers
urn:nbn:se:su:diva-76111 (URN)10.1002/int.21527 (DOI)000301654000004 ()
Note

1

Available from: 2012-05-10 Created: 2012-05-09 Last updated: 2017-12-07Bibliographically approved
2. Covariance structure approximation via glasso in high dimensional supervised classification
Open this publication in new window or tab >>Covariance structure approximation via glasso in high dimensional supervised classification
2012 (English)In: Journal of Applied Statistics, ISSN 0266-4763, E-ISSN 1360-0532, Vol. 39, no 8, 1643-1666 p.Article in journal (Refereed) Published
Abstract [en]

Recent work has shown that the Lasso-based regularization is very useful for estimating the high-dimensional inverse covariance matrix. A particularly useful scheme is based on penalizing the l(1) norm of the off-diagonal elements to encourage sparsity. We embed this type of regularization into high-dimensional classification. A two-stage estimation procedure is proposed which first recovers structural zeros of the inverse covariance matrix and then enforces block sparsity by moving non-zeros closer to the main diagonal. We show that the block-diagonal approximation of the inverse covariance matrix leads to an additive classifier, and demonstrate that accounting for the structure can yield better performance accuracy. Effect of the block size on classification is explored, and a class of as ymptotically equivalent structure approximations in a high-dimensional setting is specified. We suggest a variable selection at the block level and investigate properties of this procedure in growing dimension asymptotics. We present a consistency result on the feature selection procedure, establish asymptotic lower an upper bounds for the fraction of separative blocks and specify constraints under which the reliable classification with block-wise feature selection can be performed. The relevance and benefits of the proposed approach are illustrated on both simulated and real data.

Keyword
high dimensionality, classification accuracy, sparsity, block-diagonal covariance structure, graphical Lasso, separation strength
National Category
Probability Theory and Statistics
Research subject
Statistics
Identifiers
urn:nbn:se:su:diva-80118 (URN)10.1080/02664763.2012.663346 (DOI)000305486300002 ()
Note

AuthorCount:3;

Available from: 2012-09-18 Created: 2012-09-12 Last updated: 2017-12-07Bibliographically approved
3. Empirical evaluation of sparse classification boundaries and HC-feature thresholding in high-dimensional data
Open this publication in new window or tab >>Empirical evaluation of sparse classification boundaries and HC-feature thresholding in high-dimensional data
2013 (English)Report (Other academic)
Abstract [en]

The analysis of high-throughput data commonly used in modern applications poses many statistical  challenges, one of which is the  selection  of a small subset of features that are likely to be informative for a specific project. This issue is crucial for success of supervised classification in very high-dimensional setting with  sparsity patterns.   In this paper, we  derive an asymptotic framework that represents sparse and weak blocks model and suggest a technique for block-wise feature selection by thresholding.  Our procedure extends the standard Higher Criticism (HC) thresholding to the case where dependence structure underlying the data can be taken into account and  is shown to be optimally adaptive,  i. e. performs well without knowledge of the sparsity and weakness  parameters.   We empirically investigate the detection boundary of our HC procedure and  performance properties of some estimators of  sparsity parameter. The relevance and benefits of our approach in high-dimensional  classification is demonstrated using both simulation and real data.

Publisher
37 p.
Series
Research Report / Department of Statistics, Stockholm University, ISSN 0280-7564 ; 2013:5
Keyword
Higher criticism, detection boundary, high dimensionality, supervised classification, separation strength
National Category
Probability Theory and Statistics
Research subject
Statistics
Identifiers
urn:nbn:se:su:diva-95263 (URN)
Available from: 2013-10-24 Created: 2013-10-24 Last updated: 2013-11-01Bibliographically approved
4. Bayesian Block-Diagonal Predictive Classifier for Gaussian Data
Open this publication in new window or tab >>Bayesian Block-Diagonal Predictive Classifier for Gaussian Data
2013 (English)In: Synergies of Soft Computing and Statistics for Intelligent Data Analysis / [ed] Rudolf Kruse, Michael R. Berthold, Christian Moewes, María Ángeles Gil, Przemysław Grzegorzewski, Olgierd Hryniewicz, Springer Berlin/Heidelberg, 2013, 543-551 p.Chapter in book (Refereed)
Abstract [en]

The paper presents a method for constructing Bayesian predictive classifier in a high-dimensional setting. Given that classes are represented by Gaussian distributions with block-structured covariance matrix, a closed form expression for the posterior predictive distribution of the data is established. Due to factorization of this distribution, the resulting Bayesian predictive and marginal classifier provides an efficient solution to the high-dimensional problem by splitting it into smaller tractable problems. In a simulation study we show that the suggested classifier outperforms several alternative algorithms such as linear discriminant analysis based on block-wise inverse covariance estimators and the shrunken centroids regularized discriminant analysis.

Place, publisher, year, edition, pages
Springer Berlin/Heidelberg, 2013
Series
Advances in Intelligent Systems and Computing, ISSN 2194-5357 ; 190
Keyword
Covariance estimators, discriminant analysis, high-dimensional data, hyperparameters
National Category
Probability Theory and Statistics
Research subject
Statistics
Identifiers
urn:nbn:se:su:diva-95262 (URN)10.1007/978-3-642-33042-1_58 (DOI)978-3-642-33041-4 (ISBN)978-3-642-33042-1 (ISBN)
Available from: 2013-10-24 Created: 2013-10-24 Last updated: 2017-11-13Bibliographically approved

Open Access in DiVA

fulltext(479 kB)513 downloads
File information
File name FULLTEXT01.pdfFile size 479 kBChecksum SHA-512
298b5234cc3b909f5969435546220503987037c074b1e2b7b46f20472bbfbe37644986eb2eec6bdf2d132e16c244224e2987a8c703af29eb0400eb6e9966a41a
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Tillander, Annika
By organisation
Department of Statistics
Mathematics

Search outside of DiVA

GoogleGoogle Scholar
Total: 513 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 353 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf