CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Random Forest for Histogram Data: An application in data-driven prognostic models for heavy-duty trucks
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Data mining and machine learning algorithms are trained on large datasets to find useful hidden patterns. These patterns can help to gain new insights and make accurate predictions. Usually, the training data is structured in a tabular format, where the rows represent the training instances and the columns represent the features of these instances. The feature values are usually real numbers and/or categories. As very large volumes of digital data are becoming available in many domains, the data is often summarized into manageable sizes for efficient handling. To aggregate data into histograms is one means to reduce the size of the data. However, traditional machine learning algorithms have a limited ability to learn from such data, and this thesis explores extensions of the algorithms to allow for more effective learning from histogram data.

The thesis focuses on the decision tree and random forest algorithms, which are easy to understand and implement. Although, a single decision tree may not result in the highest predictive performance, one of its benefits is that it often allows for easy interpretation. By combining many such diverse trees into a random forest, the performance can be greatly enhanced, however at the cost of reduced interpretability. By first finding out how to effectively train a single decision tree from histogram data, these findings could be carried over to building robust random forests from such data. The overarching research question for the thesis is: How can the random forest algorithm be improved to learn more effectively from histogram data, and how can the resulting models be interpreted? An experimental approach was taken, under the positivist paradigm, in order to answer the question. The thesis investigates how the standard decision tree and random forest algorithms can be adapted to make them learn more accurate models from histogram data. Experimental evaluations of the proposed changes were carried out on both real world data and synthetically generated experimental data. The real world data was taken from the automotive domain, concerning the operation and maintenance of heavy-duty trucks. Component failure prediction models were built from the operational data of a large fleet of trucks, where the information about their operation over many years have been summarized as histograms. The experimental results showed that the proposed approaches were more effective than the original algorithms, which treat bins of histograms as separate features. The thesis also contributes towards the interpretability of random forests by evaluating an interactive visual tool for assisting users to understand the reasons behind the output of the models.

Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences, Stockholm University , 2020. , p. 74
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 20-003
Keywords [en]
Histogram data, random forest, NOx sensor failure, random forest interpretation
National Category
Computer Systems
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-178776ISBN: 978-91-7911-024-6 (print)ISBN: 978-91-7911-025-3 (electronic)OAI: oai:DiVA.org:su-178776DiVA, id: diva2:1391878
Public defence
2020-03-20, Ka-Sal C (Sven-Olof Öhrvik), Electrum 1, våningsplan 2, Kistagången 16, KTH Kista, Stockholm, 10:00 (English)
Opponent
Supervisors
Note

At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 6: Accepted.

Available from: 2020-02-26 Created: 2020-02-05 Last updated: 2020-02-18Bibliographically approved
List of papers
1. Learning Decision Trees from Histogram Data
Open this publication in new window or tab >>Learning Decision Trees from Histogram Data
2015 (English)In: Proceedings of the 2015 International Conference on Data Mining: DMIN 2015 / [ed] Robert Stahlbock, Gary M. Weiss, CSREA Press, 2015, p. 139-145Conference paper, Published paper (Refereed)
Abstract [en]

When applying learning algorithms to histogram data, bins of such variables are normally treated as separate independent variables. However, this may lead to a loss of information as the underlying dependencies may not be fully exploited. In this paper, we adapt the standard decision tree learning algorithm to handle histogram data by proposing a novel method for partitioning examples using binned variables. Results from employing the algorithm to both synthetic and real-world data sets demonstrate that exploiting dependencies in histogram data may have positive effects on both predictive performance and model size, as measured by number of nodes in the decision tree. These gains are however associated with an increased computational cost and more complex split conditions. To address the former issue, an approximate method is proposed, which speeds up the learning process substantially while retaining the predictive performance.

Place, publisher, year, edition, pages
CSREA Press, 2015
Keywords
Histogram Learning, Histogram Tree
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-125140 (URN)978-1-60132-403-0 (ISBN)
Conference
11th International Conference on Data Mining (DMIN'15), Las Vegas, Nevada, USA, July 27-30, 2015
Available from: 2016-01-08 Created: 2016-01-08 Last updated: 2020-02-05Bibliographically approved
2. Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins
Open this publication in new window or tab >>Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins
2016 (English)In: Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference / [ed] Zdravko Markov, Ingrid Russell, AAAI Press, 2016, p. 430-435Conference paper, Published paper (Refereed)
Abstract [en]

The standard approach of learning decision trees from histogram data is to treat the bins as independent variables. However, as the underlying dependencies among the bins might not be completely exploited by this approach, an algorithm has been proposed for learning decision trees from histogram data by considering all bins simultaneously while partitioning examples at each node of the tree. Although the algorithm has been demonstrated to improve predictive performance, its computational complexity has turned out to be a major bottleneck, in particular for histograms with a large number of bins. In this paper, we propose instead a sliding window approach to select subsets of the bins to be considered simultaneously while partitioning examples. This significantly reduces the number of possible splits to consider, allowing for substantially larger histograms to be handled. We also propose to evaluate the original bins independently, in addition to evaluating the subsets of bins when performing splits. This ensures that the information obtained by treating bins simultaneously is an additional gain compared to what is considered by the standard approach. Results of experiments on applying the new algorithm to both synthetic and real world datasets demonstrate positive results in terms of predictive performance without excessive computational cost.

Place, publisher, year, edition, pages
AAAI Press, 2016
Keywords
histogram variables, histogram tree, histogram classifier
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-135432 (URN)978-1-57735-756-8 (ISBN)
Conference
Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, FLAIRS, Key Largo, Florida, May 16-18, 2016
Available from: 2016-11-08 Created: 2016-11-08 Last updated: 2020-02-05Bibliographically approved
3. Predicting NOx sensor failure in heavy duty trucks using histogram-based random forests
Open this publication in new window or tab >>Predicting NOx sensor failure in heavy duty trucks using histogram-based random forests
2017 (English)In: International Journal of Prognostics and Health Management, ISSN 2153-2648, E-ISSN 2153-2648, Vol. 8, no 1, article id 008Article in journal (Refereed) Published
Abstract [en]

Being able to accurately predict the impending failures of truck components is often associated with significant amount of cost savings, customer satisfaction and flexibility in maintenance service plans. However, because of the diversity in the way trucks typically are configured and their usage under different conditions, the creation of accurate prediction models is not an easy task. This paper describes an effort in creating such a prediction model for the NOx sensor, i.e., a component measuring the emitted level of nitrogen oxide in the exhaust of the engine. This component was chosen because it is vital for the truck to function properly, while at the same time being very fragile and costly to repair. As input to the model, technical specifications of trucks and their operational data are used. The process of collecting the data and making it ready for training the model via a slightly modified Random Forest learning algorithm is described along with various challenges encountered during this process. The operational data consists of features represented as histograms, posing an additional challenge for the data analysis task. In the study, a modified version of the random forest algorithm is employed, which exploits the fact that the individual bins in the histograms are related, in contrast to the standard approach that would consider the bins as independent features. Experiments are conducted using the updated random forest algorithm, and they clearly show that the modified version is indeed beneficial when compared to the standard random forest algorithm. The performance of the resulting prediction model for the NOx sensor is promising and may be adopted for the benefit of operators of heavy trucks.

Keywords
Histogram Features, NOx sensor prognostics, Histogram-based random forest
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-149432 (URN)
Available from: 2017-11-30 Created: 2017-11-30 Last updated: 2020-02-05Bibliographically approved
4. Learning Random Forest from Histogram Data Using Split Specific Axis Rotation
Open this publication in new window or tab >>Learning Random Forest from Histogram Data Using Split Specific Axis Rotation
2018 (English)In: International Journal of Machine Learning and Computing, ISSN 2010-3700, Vol. 8, no 1, p. 74-79Article in journal (Refereed) Published
Abstract [en]

Machine learning algorithms for data containing histogram variables have not been explored to any major extent. In this paper, an adapted version of the random forest algorithm is proposed to handle variables of this type, assuming identical structure of the histograms across observations, i.e., the histograms for a variable all use the same number and width of the bins. The standard approach of representing bins as separate variables, may lead to that the learning algorithm overlooks the underlying dependencies. In contrast, the proposed algorithm handles each histogram as a unit. When performing split evaluation of a histogram variable during tree growth, a sliding window of fixed size is employed by the proposed algorithm to constrain the sets of bins that are considered together. A small number of all possible set of bins are randomly selected and principal component analysis (PCA) is applied locally on all examples in a node. Split evaluation is then performed on each principal component. Results from applying the algorithm to both synthetic and real world data are presented, showing that the proposed algorithm outperforms the standard approach of using random forests together with bins represented as separate variables, with respect to both AUC and accuracy. In addition to introducing the new algorithm, we elaborate on how real world data for predicting NOx sensor failure in heavy duty trucks was prepared, demonstrating that predictive performance can be further improved by adding variables that represent changes of the histograms over time.

Keywords
Histogram random forest, histogram data, random forest PCA. histogram features.
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-156827 (URN)10.18178/ijmlc.2018.8.1.666 (DOI)
Available from: 2018-05-30 Created: 2018-05-30 Last updated: 2020-02-05Bibliographically approved
5. Adapted Random Survival Forest for Histograms to Analyze NOx Sensor Failure in Heavy Trucks
Open this publication in new window or tab >>Adapted Random Survival Forest for Histograms to Analyze NOx Sensor Failure in Heavy Trucks
2019 (English)In: Machine Learning, Optimization, and Data Science: Proceedings / [ed] Giuseppe Nicosia, Prof. Panos Pardalos, Renato Umeton, Prof. Giovanni Giuffrida, Vincenzo Sciacca, Springer, 2019, p. 83-94Conference paper, Published paper (Refereed)
Abstract [en]

In heavy duty trucks operation, important components need to be examined regularly so that any unexpected breakdowns can be prevented. Data-driven failure prediction models can be built using operational data from a large fleet of trucks. Machine learning methods such as Random Survival Forest (RSF) can be used to generate a survival model that can predict the survival probabilities of a particular component over time. Operational data from the trucks usually have many feature variables represented as histograms. Although bins of a histogram can be considered as an independent numeric variable, dependencies among the bins might exist that could be useful and neglected when bins are treated individually. Therefore, in this article, we propose extension to the standard RSF algorithm that can handle histogram variables and use it to train survival models for a NOx sensor. The trained model is compared in terms of overall error rate with the standard RSF model where bins of a histogram are treated individually as numeric features. The experiment results shows that the adapted approach outperforms the standard approach and the feature variables considered important are ranked.

Place, publisher, year, edition, pages
Springer, 2019
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 11943
Keywords
Histogram survival forest, Histogram features, NOx sensor failure
National Category
Computer Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-178506 (URN)10.1007/978-3-030-37599-7_8 (DOI)978-3-030-37598-0 (ISBN)978-3-030-37599-7 (ISBN)
Conference
5th International Conference, LOD 2019, Siena, Italy, September 10-13, 2019
Available from: 2020-01-31 Created: 2020-01-31 Last updated: 2020-02-17Bibliographically approved
6. An Interactive Visual Tool Enhance Understanding of Random Forest Prediction
Open this publication in new window or tab >>An Interactive Visual Tool Enhance Understanding of Random Forest Prediction
2020 (English)In: Archives of Data Science, Series A, E-ISSN 2363-9881Article in journal (Refereed) Accepted
Abstract [en]

Random forests are known to provide accurate predictions, but the predictions are not easy to understand. In order to provide support for understanding such predictions, an interactive visual tool has been developed. The tool can be used to manipulate selected features to explore what-if scenarios. It exploits the internal structure of decision trees in a trained forest model and presents these information as interactive plots and charts. In addition, the tool presents a simple decision rule as an explanation for the prediction. It also presents the recommendation for reassignments of feature values of the example that leads to change in the prediction to a preferred class. An evaluation of the tool was undertaken in a large truck manufacturing company, targeting a fault prediction of a selected component in trucks. A set of domain experts were invited to use the tool and provide feedback in post-task interviews. The result of this investigation suggests that the tool indeed may aid in understanding the predictions of random forest, and also allows for gaining new insights.

National Category
Computer Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-178513 (URN)
Available from: 2020-01-31 Created: 2020-01-31 Last updated: 2020-02-17

Open Access in DiVA

Random Forest for Histogram Data(2253 kB)11 downloads
File information
File name FULLTEXT01.pdfFile size 2253 kBChecksum SHA-512
f25b7dd8eff5d0867fbc45afd4ebb9f8dc7fad2665e89e33eaf6cb2508838d2623265218cdd94052f9761b650397b4c4458d103dcac9fdeebcf1911a6e6c3fd5
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Gurung, Ram Bahadur
By organisation
Department of Computer and Systems Sciences
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 11 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 143 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf