Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Learning Decision Trees and Random Forests from Histogram Data: An application to component failure prediction for heavy duty trucks
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2017 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

A large volume of data has become commonplace in many domains these days. Machine learning algorithms can be trained to look for any useful hidden patterns in such data. Sometimes, these big data might need to be summarized to make them into a manageable size, for example by using histograms, for various reasons. Traditionally, machine learning algorithms can be trained on data expressed as real numbers and/or categories but not on a complex structure such as histogram. Since machine learning algorithms that can learn from data with histograms have not been explored to a major extent, this thesis intends to further explore this domain.

This thesis has been limited to classification algorithms, tree-based classifiers such as decision trees, and random forest in particular. Decision trees are one of the simplest and most intuitive algorithms to train. A single decision tree might not be the best algorithm in term of its predictive performance, but it can be largely enhanced by considering an ensemble of many diverse trees as a random forest. This is the reason why both algorithms were considered. So, the objective of this thesis is to investigate how one can adapt these algorithms to make them learn better on histogram data. Our proposed approach considers the use of multiple bins of a histogram simultaneously to split a node during the tree induction process. Treating bins simultaneously is expected to capture dependencies among them, which could be useful. Experimental evaluation of the proposed approaches was carried out by comparing them with the standard approach of growing a tree where a single bin is used to split a node. Accuracy and the area under the receiver operating characteristic (ROC) curve (AUC) metrics along with the average time taken to train a model were used for comparison. For experimental purposes, real-world data from a large fleet of heavy duty trucks were used to build a component-failure prediction model. These data contain information about the operation of trucks over the years, where most operational features are summarized as histograms. Experiments were performed further on the synthetically generated dataset. From the results of the experiments, it was observed that the proposed approach outperforms the standard approach in performance and compactness of the model but lags behind in terms of training time. This thesis was motivated by a real-life problem encountered in the operation of heavy duty trucks in the automotive industry while building a data driven failure-prediction model. So, all the details about collecting and cleansing the data and the challenges encountered while making the data ready for training the algorithm have been presented in detail.

Place, publisher, year, edition, pages
Stockholm: Stockholm University, 2017. , p. 66
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 17-008
Keywords [en]
histogram decision trees, histogram random forest, prognostics
National Category
Computer Systems
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-149060OAI: oai:DiVA.org:su-149060DiVA, id: diva2:1157183
Presentation
2017-11-29, L50, Borgarfjordsgatan 12 (Nod Building), Campus Kista, Stockholm, 10:00 (English)
Supervisors
Available from: 2020-02-17 Created: 2017-11-15 Last updated: 2022-02-28Bibliographically approved
List of papers
1. Learning Decision Trees from Histogram Data
Open this publication in new window or tab >>Learning Decision Trees from Histogram Data
2015 (English)In: Proceedings of the 2015 International Conference on Data Mining: DMIN 2015 / [ed] Robert Stahlbock, Gary M. Weiss, CSREA Press, 2015, p. 139-145Conference paper, Published paper (Refereed)
Abstract [en]

When applying learning algorithms to histogram data, bins of such variables are normally treated as separate independent variables. However, this may lead to a loss of information as the underlying dependencies may not be fully exploited. In this paper, we adapt the standard decision tree learning algorithm to handle histogram data by proposing a novel method for partitioning examples using binned variables. Results from employing the algorithm to both synthetic and real-world data sets demonstrate that exploiting dependencies in histogram data may have positive effects on both predictive performance and model size, as measured by number of nodes in the decision tree. These gains are however associated with an increased computational cost and more complex split conditions. To address the former issue, an approximate method is proposed, which speeds up the learning process substantially while retaining the predictive performance.

Place, publisher, year, edition, pages
CSREA Press, 2015
Keywords
Histogram Learning, Histogram Tree
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-125140 (URN)978-1-60132-403-0 (ISBN)
Conference
11th International Conference on Data Mining (DMIN'15), Las Vegas, Nevada, USA, July 27-30, 2015
Available from: 2016-01-08 Created: 2016-01-08 Last updated: 2022-02-23Bibliographically approved
2. Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins
Open this publication in new window or tab >>Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins
2016 (English)In: Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference / [ed] Zdravko Markov, Ingrid Russell, AAAI Press, 2016, p. 430-435Conference paper, Published paper (Refereed)
Abstract [en]

The standard approach of learning decision trees from histogram data is to treat the bins as independent variables. However, as the underlying dependencies among the bins might not be completely exploited by this approach, an algorithm has been proposed for learning decision trees from histogram data by considering all bins simultaneously while partitioning examples at each node of the tree. Although the algorithm has been demonstrated to improve predictive performance, its computational complexity has turned out to be a major bottleneck, in particular for histograms with a large number of bins. In this paper, we propose instead a sliding window approach to select subsets of the bins to be considered simultaneously while partitioning examples. This significantly reduces the number of possible splits to consider, allowing for substantially larger histograms to be handled. We also propose to evaluate the original bins independently, in addition to evaluating the subsets of bins when performing splits. This ensures that the information obtained by treating bins simultaneously is an additional gain compared to what is considered by the standard approach. Results of experiments on applying the new algorithm to both synthetic and real world datasets demonstrate positive results in terms of predictive performance without excessive computational cost.

Place, publisher, year, edition, pages
AAAI Press, 2016
Keywords
histogram variables, histogram tree, histogram classifier
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-135432 (URN)978-1-57735-756-8 (ISBN)
Conference
Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, FLAIRS, Key Largo, Florida, May 16-18, 2016
Available from: 2016-11-08 Created: 2016-11-08 Last updated: 2022-02-28Bibliographically approved

Open Access in DiVA

fulltext(3044 kB)605 downloads
File information
File name FULLTEXT01.pdfFile size 3044 kBChecksum SHA-512
6e8f6d2ac6a3f039da25041e9192fd1bc308f79d392cdef0cacf9ba69abaa4736abac1a124a6cc3af88c174599163617020ba7373bb822e31693251e76f751ab
Type fulltextMimetype application/pdf

Authority records

Gurung, Ram Bahadur

Search in DiVA

By author/editor
Gurung, Ram Bahadur
By organisation
Department of Computer and Systems Sciences
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 606 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 283 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf