Change search
Link to record
Permanent link

Direct link
Gurung, Ram B.
Alternative names
Publications (9 of 9) Show all publications
Gurung, R. B., Lindgren, T. & Boström, H. (2020). An Interactive Visual Tool Enhance Understanding of Random Forest Prediction. Archives of Data Science, Series A, 6(1)
Open this publication in new window or tab >>An Interactive Visual Tool Enhance Understanding of Random Forest Prediction
2020 (English)In: Archives of Data Science, Series A, E-ISSN 2363-9881, Vol. 6, no 1Article in journal (Refereed) Published
Abstract [en]

Random forests are known to provide accurate predictions, but the predictions are not easy to understand. In order to provide support for understanding such predictions, an interactive visual tool has been developed. The tool can be used to manipulate selected features to explore what-if scenarios. It exploits the internal structure of decision trees in a trained forest model and presents these information as interactive plots and charts. In addition, the tool presents a simple decision rule as an explanation for the prediction. It also presents the recommendation for reassignments of feature values of the example that leads to change in the prediction to a preferred class. An evaluation of the tool was undertaken in a large truck manufacturing company, targeting a fault prediction of a selected component in trucks. A set of domain experts were invited to use the tool and provide feedback in post-task interviews. The result of this investigation suggests that the tool indeed may aid in understanding the predictions of random forest, and also allows for gaining new insights.

National Category
Computer Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-178513 (URN)10.5445/KSP/1000098011/08 (DOI)
Available from: 2020-01-31 Created: 2020-01-31 Last updated: 2022-03-23Bibliographically approved
Gurung, R. B. (2020). Random Forest for Histogram Data: An application in data-driven prognostic models for heavy-duty trucks. (Doctoral dissertation). Stockholm: Department of Computer and Systems Sciences, Stockholm University
Open this publication in new window or tab >>Random Forest for Histogram Data: An application in data-driven prognostic models for heavy-duty trucks
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Data mining and machine learning algorithms are trained on large datasets to find useful hidden patterns. These patterns can help to gain new insights and make accurate predictions. Usually, the training data is structured in a tabular format, where the rows represent the training instances and the columns represent the features of these instances. The feature values are usually real numbers and/or categories. As very large volumes of digital data are becoming available in many domains, the data is often summarized into manageable sizes for efficient handling. To aggregate data into histograms is one means to reduce the size of the data. However, traditional machine learning algorithms have a limited ability to learn from such data, and this thesis explores extensions of the algorithms to allow for more effective learning from histogram data.

The thesis focuses on the decision tree and random forest algorithms, which are easy to understand and implement. Although, a single decision tree may not result in the highest predictive performance, one of its benefits is that it often allows for easy interpretation. By combining many such diverse trees into a random forest, the performance can be greatly enhanced, however at the cost of reduced interpretability. By first finding out how to effectively train a single decision tree from histogram data, these findings could be carried over to building robust random forests from such data. The overarching research question for the thesis is: How can the random forest algorithm be improved to learn more effectively from histogram data, and how can the resulting models be interpreted? An experimental approach was taken, under the positivist paradigm, in order to answer the question. The thesis investigates how the standard decision tree and random forest algorithms can be adapted to make them learn more accurate models from histogram data. Experimental evaluations of the proposed changes were carried out on both real world data and synthetically generated experimental data. The real world data was taken from the automotive domain, concerning the operation and maintenance of heavy-duty trucks. Component failure prediction models were built from the operational data of a large fleet of trucks, where the information about their operation over many years have been summarized as histograms. The experimental results showed that the proposed approaches were more effective than the original algorithms, which treat bins of histograms as separate features. The thesis also contributes towards the interpretability of random forests by evaluating an interactive visual tool for assisting users to understand the reasons behind the output of the models.

Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences, Stockholm University, 2020. p. 74
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 20-003
Keywords
Histogram data, random forest, NOx sensor failure, random forest interpretation
National Category
Computer Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-178776 (URN)978-91-7911-024-6 (ISBN)978-91-7911-025-3 (ISBN)
Public defence
2020-03-20, Ka-Sal C (Sven-Olof Öhrvik), Electrum 1, våningsplan 2, Kistagången 16, KTH Kista, Stockholm, 10:00 (English)
Opponent
Supervisors
Note

At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 6: Accepted.

Available from: 2020-02-26 Created: 2020-02-05 Last updated: 2022-02-26Bibliographically approved
Gurung, R. B. (2019). Adapted Random Survival Forest for Histograms to Analyze NOx Sensor Failure in Heavy Trucks. In: Giuseppe Nicosia, Prof. Panos Pardalos, Renato Umeton, Prof. Giovanni Giuffrida, Vincenzo Sciacca (Ed.), Machine Learning, Optimization, and Data Science: Proceedings. Paper presented at 5th International Conference, LOD 2019, Siena, Italy, September 10-13, 2019 (pp. 83-94). Springer
Open this publication in new window or tab >>Adapted Random Survival Forest for Histograms to Analyze NOx Sensor Failure in Heavy Trucks
2019 (English)In: Machine Learning, Optimization, and Data Science: Proceedings / [ed] Giuseppe Nicosia, Prof. Panos Pardalos, Renato Umeton, Prof. Giovanni Giuffrida, Vincenzo Sciacca, Springer, 2019, p. 83-94Conference paper, Published paper (Refereed)
Abstract [en]

In heavy duty trucks operation, important components need to be examined regularly so that any unexpected breakdowns can be prevented. Data-driven failure prediction models can be built using operational data from a large fleet of trucks. Machine learning methods such as Random Survival Forest (RSF) can be used to generate a survival model that can predict the survival probabilities of a particular component over time. Operational data from the trucks usually have many feature variables represented as histograms. Although bins of a histogram can be considered as an independent numeric variable, dependencies among the bins might exist that could be useful and neglected when bins are treated individually. Therefore, in this article, we propose extension to the standard RSF algorithm that can handle histogram variables and use it to train survival models for a NOx sensor. The trained model is compared in terms of overall error rate with the standard RSF model where bins of a histogram are treated individually as numeric features. The experiment results shows that the adapted approach outperforms the standard approach and the feature variables considered important are ranked.

Place, publisher, year, edition, pages
Springer, 2019
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 11943
Keywords
Histogram survival forest, Histogram features, NOx sensor failure
National Category
Computer Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-178506 (URN)10.1007/978-3-030-37599-7_8 (DOI)978-3-030-37598-0 (ISBN)978-3-030-37599-7 (ISBN)
Conference
5th International Conference, LOD 2019, Siena, Italy, September 10-13, 2019
Available from: 2020-01-31 Created: 2020-01-31 Last updated: 2022-02-26Bibliographically approved
Gurung, R. B., Lindgren, T. & Boström, H. (2018). Learning Random Forest from Histogram Data Using Split Specific Axis Rotation. International Journal of Machine Learning and Computing, 8(1), 74-79
Open this publication in new window or tab >>Learning Random Forest from Histogram Data Using Split Specific Axis Rotation
2018 (English)In: International Journal of Machine Learning and Computing, ISSN 2010-3700, Vol. 8, no 1, p. 74-79Article in journal (Refereed) Published
Abstract [en]

Machine learning algorithms for data containing histogram variables have not been explored to any major extent. In this paper, an adapted version of the random forest algorithm is proposed to handle variables of this type, assuming identical structure of the histograms across observations, i.e., the histograms for a variable all use the same number and width of the bins. The standard approach of representing bins as separate variables, may lead to that the learning algorithm overlooks the underlying dependencies. In contrast, the proposed algorithm handles each histogram as a unit. When performing split evaluation of a histogram variable during tree growth, a sliding window of fixed size is employed by the proposed algorithm to constrain the sets of bins that are considered together. A small number of all possible set of bins are randomly selected and principal component analysis (PCA) is applied locally on all examples in a node. Split evaluation is then performed on each principal component. Results from applying the algorithm to both synthetic and real world data are presented, showing that the proposed algorithm outperforms the standard approach of using random forests together with bins represented as separate variables, with respect to both AUC and accuracy. In addition to introducing the new algorithm, we elaborate on how real world data for predicting NOx sensor failure in heavy duty trucks was prepared, demonstrating that predictive performance can be further improved by adding variables that represent changes of the histograms over time.

Keywords
Histogram random forest, histogram data, random forest PCA. histogram features.
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-156827 (URN)10.18178/ijmlc.2018.8.1.666 (DOI)
Available from: 2018-05-30 Created: 2018-05-30 Last updated: 2022-02-26Bibliographically approved
Boström, H., Asker, L., Gurung, R. B., Karlsson, I., Lindgren, T. & Papapetrou, P. (2017). Conformal prediction using random survival forests. In: Xuewen Chen, Bo Luo, Feng Luo, Vasile Palade, M. Arif Wani (Ed.), 16th IEEE International Conference on Machine Learning and Applications: Proceedings. Paper presented at 16th IEEE International Conference On Machine Learning And Applications, Cancun, Mexico, December 18-21, 2017 (pp. 812-817). IEEE
Open this publication in new window or tab >>Conformal prediction using random survival forests
Show others...
2017 (English)In: 16th IEEE International Conference on Machine Learning and Applications: Proceedings / [ed] Xuewen Chen, Bo Luo, Feng Luo, Vasile Palade, M. Arif Wani, IEEE, 2017, p. 812-817Conference paper, Published paper (Refereed)
Abstract [en]

Random survival forests constitute a robust approach to survival modeling, i.e., predicting the probability that an event will occur before or on a given point in time. Similar to most standard predictive models, no guarantee for the prediction error is provided for this model, which instead typically is empirically evaluated. Conformal prediction is a rather recent framework, which allows the error of a model to be determined by a user specified confidence level, something which is achieved by considering set rather than point predictions. The framework, which has been applied to some of the most popular classification and regression techniques, is here for the first time applied to survival modeling, through random survival forests. An empirical investigation is presented where the technique is evaluated on datasets from two real-world applications; predicting component failure in trucks using operational data and predicting survival and treatment of heart failure patients from administrative healthcare data. The experimental results show that the error levels indeed are very close to the provided confidence levels, as guaranteed by the conformal prediction framework, and that the error for predicting each outcome, i.e., event or no-event, can be controlled separately. The latter may, however, lead to less informative predictions, i.e., larger prediction sets, in case the class distribution is heavily imbalanced.

Place, publisher, year, edition, pages
IEEE, 2017
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-149417 (URN)10.1109/ICMLA.2017.00-57 (DOI)000425853000130 ()978-1-5386-1418-1 (ISBN)
Conference
16th IEEE International Conference On Machine Learning And Applications, Cancun, Mexico, December 18-21, 2017
Available from: 2017-11-30 Created: 2017-11-30 Last updated: 2022-02-28Bibliographically approved
Gurung, R. B. (2017). Learning Decision Trees and Random Forests from Histogram Data: An application to component failure prediction for heavy duty trucks. (Licentiate dissertation). Stockholm: Stockholm University
Open this publication in new window or tab >>Learning Decision Trees and Random Forests from Histogram Data: An application to component failure prediction for heavy duty trucks
2017 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

A large volume of data has become commonplace in many domains these days. Machine learning algorithms can be trained to look for any useful hidden patterns in such data. Sometimes, these big data might need to be summarized to make them into a manageable size, for example by using histograms, for various reasons. Traditionally, machine learning algorithms can be trained on data expressed as real numbers and/or categories but not on a complex structure such as histogram. Since machine learning algorithms that can learn from data with histograms have not been explored to a major extent, this thesis intends to further explore this domain.

This thesis has been limited to classification algorithms, tree-based classifiers such as decision trees, and random forest in particular. Decision trees are one of the simplest and most intuitive algorithms to train. A single decision tree might not be the best algorithm in term of its predictive performance, but it can be largely enhanced by considering an ensemble of many diverse trees as a random forest. This is the reason why both algorithms were considered. So, the objective of this thesis is to investigate how one can adapt these algorithms to make them learn better on histogram data. Our proposed approach considers the use of multiple bins of a histogram simultaneously to split a node during the tree induction process. Treating bins simultaneously is expected to capture dependencies among them, which could be useful. Experimental evaluation of the proposed approaches was carried out by comparing them with the standard approach of growing a tree where a single bin is used to split a node. Accuracy and the area under the receiver operating characteristic (ROC) curve (AUC) metrics along with the average time taken to train a model were used for comparison. For experimental purposes, real-world data from a large fleet of heavy duty trucks were used to build a component-failure prediction model. These data contain information about the operation of trucks over the years, where most operational features are summarized as histograms. Experiments were performed further on the synthetically generated dataset. From the results of the experiments, it was observed that the proposed approach outperforms the standard approach in performance and compactness of the model but lags behind in terms of training time. This thesis was motivated by a real-life problem encountered in the operation of heavy duty trucks in the automotive industry while building a data driven failure-prediction model. So, all the details about collecting and cleansing the data and the challenges encountered while making the data ready for training the algorithm have been presented in detail.

Place, publisher, year, edition, pages
Stockholm: Stockholm University, 2017. p. 66
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 17-008
Keywords
histogram decision trees, histogram random forest, prognostics
National Category
Computer Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-149060 (URN)
Presentation
2017-11-29, L50, Borgarfjordsgatan 12 (Nod Building), Campus Kista, Stockholm, 10:00 (English)
Supervisors
Available from: 2020-02-17 Created: 2017-11-15 Last updated: 2022-02-28Bibliographically approved
Gurung, R. B., Lindgren, T. & Boström, H. (2017). Predicting NOx sensor failure in heavy duty trucks using histogram-based random forests. International Journal of Prognostics and Health Management, 8(1), Article ID 008.
Open this publication in new window or tab >>Predicting NOx sensor failure in heavy duty trucks using histogram-based random forests
2017 (English)In: International Journal of Prognostics and Health Management, E-ISSN 2153-2648, Vol. 8, no 1, article id 008Article in journal (Refereed) Published
Abstract [en]

Being able to accurately predict the impending failures of truck components is often associated with significant amount of cost savings, customer satisfaction and flexibility in maintenance service plans. However, because of the diversity in the way trucks typically are configured and their usage under different conditions, the creation of accurate prediction models is not an easy task. This paper describes an effort in creating such a prediction model for the NOx sensor, i.e., a component measuring the emitted level of nitrogen oxide in the exhaust of the engine. This component was chosen because it is vital for the truck to function properly, while at the same time being very fragile and costly to repair. As input to the model, technical specifications of trucks and their operational data are used. The process of collecting the data and making it ready for training the model via a slightly modified Random Forest learning algorithm is described along with various challenges encountered during this process. The operational data consists of features represented as histograms, posing an additional challenge for the data analysis task. In the study, a modified version of the random forest algorithm is employed, which exploits the fact that the individual bins in the histograms are related, in contrast to the standard approach that would consider the bins as independent features. Experiments are conducted using the updated random forest algorithm, and they clearly show that the modified version is indeed beneficial when compared to the standard random forest algorithm. The performance of the resulting prediction model for the NOx sensor is promising and may be adopted for the benefit of operators of heavy trucks.

Keywords
Histogram Features, NOx sensor prognostics, Histogram-based random forest
National Category
Computer Sciences
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-149432 (URN)10.36001/ijphm.2017.v8i1.2535 (DOI)
Available from: 2017-11-30 Created: 2017-11-30 Last updated: 2023-07-24Bibliographically approved
Gurung, R. B., Lindgren, T. & Boström, H. (2016). Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins. In: Zdravko Markov, Ingrid Russell (Ed.), Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference: . Paper presented at Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, FLAIRS, Key Largo, Florida, May 16-18, 2016 (pp. 430-435). AAAI Press
Open this publication in new window or tab >>Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins
2016 (English)In: Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference / [ed] Zdravko Markov, Ingrid Russell, AAAI Press, 2016, p. 430-435Conference paper, Published paper (Refereed)
Abstract [en]

The standard approach of learning decision trees from histogram data is to treat the bins as independent variables. However, as the underlying dependencies among the bins might not be completely exploited by this approach, an algorithm has been proposed for learning decision trees from histogram data by considering all bins simultaneously while partitioning examples at each node of the tree. Although the algorithm has been demonstrated to improve predictive performance, its computational complexity has turned out to be a major bottleneck, in particular for histograms with a large number of bins. In this paper, we propose instead a sliding window approach to select subsets of the bins to be considered simultaneously while partitioning examples. This significantly reduces the number of possible splits to consider, allowing for substantially larger histograms to be handled. We also propose to evaluate the original bins independently, in addition to evaluating the subsets of bins when performing splits. This ensures that the information obtained by treating bins simultaneously is an additional gain compared to what is considered by the standard approach. Results of experiments on applying the new algorithm to both synthetic and real world datasets demonstrate positive results in terms of predictive performance without excessive computational cost.

Place, publisher, year, edition, pages
AAAI Press, 2016
Keywords
histogram variables, histogram tree, histogram classifier
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-135432 (URN)978-1-57735-756-8 (ISBN)
Conference
Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, FLAIRS, Key Largo, Florida, May 16-18, 2016
Available from: 2016-11-08 Created: 2016-11-08 Last updated: 2022-02-28Bibliographically approved
Gurung, R. B., Lindgren, T. & Boström, H. (2015). Learning Decision Trees from Histogram Data. In: Robert Stahlbock, Gary M. Weiss (Ed.), Proceedings of the 2015 International Conference on Data Mining: DMIN 2015. Paper presented at 11th International Conference on Data Mining (DMIN'15), Las Vegas, Nevada, USA, July 27-30, 2015 (pp. 139-145). CSREA Press
Open this publication in new window or tab >>Learning Decision Trees from Histogram Data
2015 (English)In: Proceedings of the 2015 International Conference on Data Mining: DMIN 2015 / [ed] Robert Stahlbock, Gary M. Weiss, CSREA Press, 2015, p. 139-145Conference paper, Published paper (Refereed)
Abstract [en]

When applying learning algorithms to histogram data, bins of such variables are normally treated as separate independent variables. However, this may lead to a loss of information as the underlying dependencies may not be fully exploited. In this paper, we adapt the standard decision tree learning algorithm to handle histogram data by proposing a novel method for partitioning examples using binned variables. Results from employing the algorithm to both synthetic and real-world data sets demonstrate that exploiting dependencies in histogram data may have positive effects on both predictive performance and model size, as measured by number of nodes in the decision tree. These gains are however associated with an increased computational cost and more complex split conditions. To address the former issue, an approximate method is proposed, which speeds up the learning process substantially while retaining the predictive performance.

Place, publisher, year, edition, pages
CSREA Press, 2015
Keywords
Histogram Learning, Histogram Tree
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-125140 (URN)978-1-60132-403-0 (ISBN)
Conference
11th International Conference on Data Mining (DMIN'15), Las Vegas, Nevada, USA, July 27-30, 2015
Available from: 2016-01-08 Created: 2016-01-08 Last updated: 2022-02-23Bibliographically approved
Organisations

Search in DiVA

Show all publications