Endre søk
Link to record
Permanent link

Direct link
Gurung, Ram B.
Alternativa namn
Publikasjoner (9 av 9) Visa alla publikasjoner
Gurung, R. B., Lindgren, T. & Boström, H. (2020). An Interactive Visual Tool Enhance Understanding of Random Forest Prediction. Archives of Data Science, Series A, 6(1)
Åpne denne publikasjonen i ny fane eller vindu >>An Interactive Visual Tool Enhance Understanding of Random Forest Prediction
2020 (engelsk)Inngår i: Archives of Data Science, Series A, E-ISSN 2363-9881, Vol. 6, nr 1Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Random forests are known to provide accurate predictions, but the predictions are not easy to understand. In order to provide support for understanding such predictions, an interactive visual tool has been developed. The tool can be used to manipulate selected features to explore what-if scenarios. It exploits the internal structure of decision trees in a trained forest model and presents these information as interactive plots and charts. In addition, the tool presents a simple decision rule as an explanation for the prediction. It also presents the recommendation for reassignments of feature values of the example that leads to change in the prediction to a preferred class. An evaluation of the tool was undertaken in a large truck manufacturing company, targeting a fault prediction of a selected component in trucks. A set of domain experts were invited to use the tool and provide feedback in post-task interviews. The result of this investigation suggests that the tool indeed may aid in understanding the predictions of random forest, and also allows for gaining new insights.

HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-178513 (URN)10.5445/KSP/1000098011/08 (DOI)
Tilgjengelig fra: 2020-01-31 Laget: 2020-01-31 Sist oppdatert: 2022-03-23bibliografisk kontrollert
Gurung, R. B. (2020). Random Forest for Histogram Data: An application in data-driven prognostic models for heavy-duty trucks. (Doctoral dissertation). Stockholm: Department of Computer and Systems Sciences, Stockholm University
Åpne denne publikasjonen i ny fane eller vindu >>Random Forest for Histogram Data: An application in data-driven prognostic models for heavy-duty trucks
2020 (engelsk)Doktoravhandling, med artikler (Annet vitenskapelig)
Abstract [en]

Data mining and machine learning algorithms are trained on large datasets to find useful hidden patterns. These patterns can help to gain new insights and make accurate predictions. Usually, the training data is structured in a tabular format, where the rows represent the training instances and the columns represent the features of these instances. The feature values are usually real numbers and/or categories. As very large volumes of digital data are becoming available in many domains, the data is often summarized into manageable sizes for efficient handling. To aggregate data into histograms is one means to reduce the size of the data. However, traditional machine learning algorithms have a limited ability to learn from such data, and this thesis explores extensions of the algorithms to allow for more effective learning from histogram data.

The thesis focuses on the decision tree and random forest algorithms, which are easy to understand and implement. Although, a single decision tree may not result in the highest predictive performance, one of its benefits is that it often allows for easy interpretation. By combining many such diverse trees into a random forest, the performance can be greatly enhanced, however at the cost of reduced interpretability. By first finding out how to effectively train a single decision tree from histogram data, these findings could be carried over to building robust random forests from such data. The overarching research question for the thesis is: How can the random forest algorithm be improved to learn more effectively from histogram data, and how can the resulting models be interpreted? An experimental approach was taken, under the positivist paradigm, in order to answer the question. The thesis investigates how the standard decision tree and random forest algorithms can be adapted to make them learn more accurate models from histogram data. Experimental evaluations of the proposed changes were carried out on both real world data and synthetically generated experimental data. The real world data was taken from the automotive domain, concerning the operation and maintenance of heavy-duty trucks. Component failure prediction models were built from the operational data of a large fleet of trucks, where the information about their operation over many years have been summarized as histograms. The experimental results showed that the proposed approaches were more effective than the original algorithms, which treat bins of histograms as separate features. The thesis also contributes towards the interpretability of random forests by evaluating an interactive visual tool for assisting users to understand the reasons behind the output of the models.

sted, utgiver, år, opplag, sider
Stockholm: Department of Computer and Systems Sciences, Stockholm University, 2020. s. 74
Serie
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 20-003
Emneord
Histogram data, random forest, NOx sensor failure, random forest interpretation
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-178776 (URN)978-91-7911-024-6 (ISBN)978-91-7911-025-3 (ISBN)
Disputas
2020-03-20, Ka-Sal C (Sven-Olof Öhrvik), Electrum 1, våningsplan 2, Kistagången 16, KTH Kista, Stockholm, 10:00 (engelsk)
Opponent
Veileder
Merknad

At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 6: Accepted.

Tilgjengelig fra: 2020-02-26 Laget: 2020-02-05 Sist oppdatert: 2022-02-26bibliografisk kontrollert
Gurung, R. B. (2019). Adapted Random Survival Forest for Histograms to Analyze NOx Sensor Failure in Heavy Trucks. In: Giuseppe Nicosia, Prof. Panos Pardalos, Renato Umeton, Prof. Giovanni Giuffrida, Vincenzo Sciacca (Ed.), Machine Learning, Optimization, and Data Science: Proceedings. Paper presented at 5th International Conference, LOD 2019, Siena, Italy, September 10-13, 2019 (pp. 83-94). Springer
Åpne denne publikasjonen i ny fane eller vindu >>Adapted Random Survival Forest for Histograms to Analyze NOx Sensor Failure in Heavy Trucks
2019 (engelsk)Inngår i: Machine Learning, Optimization, and Data Science: Proceedings / [ed] Giuseppe Nicosia, Prof. Panos Pardalos, Renato Umeton, Prof. Giovanni Giuffrida, Vincenzo Sciacca, Springer, 2019, s. 83-94Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

In heavy duty trucks operation, important components need to be examined regularly so that any unexpected breakdowns can be prevented. Data-driven failure prediction models can be built using operational data from a large fleet of trucks. Machine learning methods such as Random Survival Forest (RSF) can be used to generate a survival model that can predict the survival probabilities of a particular component over time. Operational data from the trucks usually have many feature variables represented as histograms. Although bins of a histogram can be considered as an independent numeric variable, dependencies among the bins might exist that could be useful and neglected when bins are treated individually. Therefore, in this article, we propose extension to the standard RSF algorithm that can handle histogram variables and use it to train survival models for a NOx sensor. The trained model is compared in terms of overall error rate with the standard RSF model where bins of a histogram are treated individually as numeric features. The experiment results shows that the adapted approach outperforms the standard approach and the feature variables considered important are ranked.

sted, utgiver, år, opplag, sider
Springer, 2019
Serie
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 11943
Emneord
Histogram survival forest, Histogram features, NOx sensor failure
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-178506 (URN)10.1007/978-3-030-37599-7_8 (DOI)978-3-030-37598-0 (ISBN)978-3-030-37599-7 (ISBN)
Konferanse
5th International Conference, LOD 2019, Siena, Italy, September 10-13, 2019
Tilgjengelig fra: 2020-01-31 Laget: 2020-01-31 Sist oppdatert: 2022-02-26bibliografisk kontrollert
Gurung, R. B., Lindgren, T. & Boström, H. (2018). Learning Random Forest from Histogram Data Using Split Specific Axis Rotation. International Journal of Machine Learning and Computing, 8(1), 74-79
Åpne denne publikasjonen i ny fane eller vindu >>Learning Random Forest from Histogram Data Using Split Specific Axis Rotation
2018 (engelsk)Inngår i: International Journal of Machine Learning and Computing, ISSN 2010-3700, Vol. 8, nr 1, s. 74-79Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Machine learning algorithms for data containing histogram variables have not been explored to any major extent. In this paper, an adapted version of the random forest algorithm is proposed to handle variables of this type, assuming identical structure of the histograms across observations, i.e., the histograms for a variable all use the same number and width of the bins. The standard approach of representing bins as separate variables, may lead to that the learning algorithm overlooks the underlying dependencies. In contrast, the proposed algorithm handles each histogram as a unit. When performing split evaluation of a histogram variable during tree growth, a sliding window of fixed size is employed by the proposed algorithm to constrain the sets of bins that are considered together. A small number of all possible set of bins are randomly selected and principal component analysis (PCA) is applied locally on all examples in a node. Split evaluation is then performed on each principal component. Results from applying the algorithm to both synthetic and real world data are presented, showing that the proposed algorithm outperforms the standard approach of using random forests together with bins represented as separate variables, with respect to both AUC and accuracy. In addition to introducing the new algorithm, we elaborate on how real world data for predicting NOx sensor failure in heavy duty trucks was prepared, demonstrating that predictive performance can be further improved by adding variables that represent changes of the histograms over time.

Emneord
Histogram random forest, histogram data, random forest PCA. histogram features.
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-156827 (URN)10.18178/ijmlc.2018.8.1.666 (DOI)
Tilgjengelig fra: 2018-05-30 Laget: 2018-05-30 Sist oppdatert: 2022-02-26bibliografisk kontrollert
Boström, H., Asker, L., Gurung, R. B., Karlsson, I., Lindgren, T. & Papapetrou, P. (2017). Conformal prediction using random survival forests. In: Xuewen Chen, Bo Luo, Feng Luo, Vasile Palade, M. Arif Wani (Ed.), 16th IEEE International Conference on Machine Learning and Applications: Proceedings. Paper presented at 16th IEEE International Conference On Machine Learning And Applications, Cancun, Mexico, December 18-21, 2017 (pp. 812-817). IEEE
Åpne denne publikasjonen i ny fane eller vindu >>Conformal prediction using random survival forests
Vise andre…
2017 (engelsk)Inngår i: 16th IEEE International Conference on Machine Learning and Applications: Proceedings / [ed] Xuewen Chen, Bo Luo, Feng Luo, Vasile Palade, M. Arif Wani, IEEE, 2017, s. 812-817Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Random survival forests constitute a robust approach to survival modeling, i.e., predicting the probability that an event will occur before or on a given point in time. Similar to most standard predictive models, no guarantee for the prediction error is provided for this model, which instead typically is empirically evaluated. Conformal prediction is a rather recent framework, which allows the error of a model to be determined by a user specified confidence level, something which is achieved by considering set rather than point predictions. The framework, which has been applied to some of the most popular classification and regression techniques, is here for the first time applied to survival modeling, through random survival forests. An empirical investigation is presented where the technique is evaluated on datasets from two real-world applications; predicting component failure in trucks using operational data and predicting survival and treatment of heart failure patients from administrative healthcare data. The experimental results show that the error levels indeed are very close to the provided confidence levels, as guaranteed by the conformal prediction framework, and that the error for predicting each outcome, i.e., event or no-event, can be controlled separately. The latter may, however, lead to less informative predictions, i.e., larger prediction sets, in case the class distribution is heavily imbalanced.

sted, utgiver, år, opplag, sider
IEEE, 2017
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-149417 (URN)10.1109/ICMLA.2017.00-57 (DOI)000425853000130 ()978-1-5386-1418-1 (ISBN)
Konferanse
16th IEEE International Conference On Machine Learning And Applications, Cancun, Mexico, December 18-21, 2017
Tilgjengelig fra: 2017-11-30 Laget: 2017-11-30 Sist oppdatert: 2022-02-28bibliografisk kontrollert
Gurung, R. B. (2017). Learning Decision Trees and Random Forests from Histogram Data: An application to component failure prediction for heavy duty trucks. (Licentiate dissertation). Stockholm: Stockholm University
Åpne denne publikasjonen i ny fane eller vindu >>Learning Decision Trees and Random Forests from Histogram Data: An application to component failure prediction for heavy duty trucks
2017 (engelsk)Licentiatavhandling, med artikler (Annet vitenskapelig)
Abstract [en]

A large volume of data has become commonplace in many domains these days. Machine learning algorithms can be trained to look for any useful hidden patterns in such data. Sometimes, these big data might need to be summarized to make them into a manageable size, for example by using histograms, for various reasons. Traditionally, machine learning algorithms can be trained on data expressed as real numbers and/or categories but not on a complex structure such as histogram. Since machine learning algorithms that can learn from data with histograms have not been explored to a major extent, this thesis intends to further explore this domain.

This thesis has been limited to classification algorithms, tree-based classifiers such as decision trees, and random forest in particular. Decision trees are one of the simplest and most intuitive algorithms to train. A single decision tree might not be the best algorithm in term of its predictive performance, but it can be largely enhanced by considering an ensemble of many diverse trees as a random forest. This is the reason why both algorithms were considered. So, the objective of this thesis is to investigate how one can adapt these algorithms to make them learn better on histogram data. Our proposed approach considers the use of multiple bins of a histogram simultaneously to split a node during the tree induction process. Treating bins simultaneously is expected to capture dependencies among them, which could be useful. Experimental evaluation of the proposed approaches was carried out by comparing them with the standard approach of growing a tree where a single bin is used to split a node. Accuracy and the area under the receiver operating characteristic (ROC) curve (AUC) metrics along with the average time taken to train a model were used for comparison. For experimental purposes, real-world data from a large fleet of heavy duty trucks were used to build a component-failure prediction model. These data contain information about the operation of trucks over the years, where most operational features are summarized as histograms. Experiments were performed further on the synthetically generated dataset. From the results of the experiments, it was observed that the proposed approach outperforms the standard approach in performance and compactness of the model but lags behind in terms of training time. This thesis was motivated by a real-life problem encountered in the operation of heavy duty trucks in the automotive industry while building a data driven failure-prediction model. So, all the details about collecting and cleansing the data and the challenges encountered while making the data ready for training the algorithm have been presented in detail.

sted, utgiver, år, opplag, sider
Stockholm: Stockholm University, 2017. s. 66
Serie
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 17-008
Emneord
histogram decision trees, histogram random forest, prognostics
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-149060 (URN)
Presentation
2017-11-29, L50, Borgarfjordsgatan 12 (Nod Building), Campus Kista, Stockholm, 10:00 (engelsk)
Veileder
Tilgjengelig fra: 2020-02-17 Laget: 2017-11-15 Sist oppdatert: 2022-02-28bibliografisk kontrollert
Gurung, R. B., Lindgren, T. & Boström, H. (2017). Predicting NOx sensor failure in heavy duty trucks using histogram-based random forests. International Journal of Prognostics and Health Management, 8(1), Article ID 008.
Åpne denne publikasjonen i ny fane eller vindu >>Predicting NOx sensor failure in heavy duty trucks using histogram-based random forests
2017 (engelsk)Inngår i: International Journal of Prognostics and Health Management, E-ISSN 2153-2648, Vol. 8, nr 1, artikkel-id 008Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Being able to accurately predict the impending failures of truck components is often associated with significant amount of cost savings, customer satisfaction and flexibility in maintenance service plans. However, because of the diversity in the way trucks typically are configured and their usage under different conditions, the creation of accurate prediction models is not an easy task. This paper describes an effort in creating such a prediction model for the NOx sensor, i.e., a component measuring the emitted level of nitrogen oxide in the exhaust of the engine. This component was chosen because it is vital for the truck to function properly, while at the same time being very fragile and costly to repair. As input to the model, technical specifications of trucks and their operational data are used. The process of collecting the data and making it ready for training the model via a slightly modified Random Forest learning algorithm is described along with various challenges encountered during this process. The operational data consists of features represented as histograms, posing an additional challenge for the data analysis task. In the study, a modified version of the random forest algorithm is employed, which exploits the fact that the individual bins in the histograms are related, in contrast to the standard approach that would consider the bins as independent features. Experiments are conducted using the updated random forest algorithm, and they clearly show that the modified version is indeed beneficial when compared to the standard random forest algorithm. The performance of the resulting prediction model for the NOx sensor is promising and may be adopted for the benefit of operators of heavy trucks.

Emneord
Histogram Features, NOx sensor prognostics, Histogram-based random forest
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-149432 (URN)10.36001/ijphm.2017.v8i1.2535 (DOI)
Tilgjengelig fra: 2017-11-30 Laget: 2017-11-30 Sist oppdatert: 2023-07-24bibliografisk kontrollert
Gurung, R. B., Lindgren, T. & Boström, H. (2016). Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins. In: Zdravko Markov, Ingrid Russell (Ed.), Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference: . Paper presented at Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, FLAIRS, Key Largo, Florida, May 16-18, 2016 (pp. 430-435). AAAI Press
Åpne denne publikasjonen i ny fane eller vindu >>Learning Decision Trees from Histogram Data Using Multiple Subsets of Bins
2016 (engelsk)Inngår i: Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference / [ed] Zdravko Markov, Ingrid Russell, AAAI Press, 2016, s. 430-435Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

The standard approach of learning decision trees from histogram data is to treat the bins as independent variables. However, as the underlying dependencies among the bins might not be completely exploited by this approach, an algorithm has been proposed for learning decision trees from histogram data by considering all bins simultaneously while partitioning examples at each node of the tree. Although the algorithm has been demonstrated to improve predictive performance, its computational complexity has turned out to be a major bottleneck, in particular for histograms with a large number of bins. In this paper, we propose instead a sliding window approach to select subsets of the bins to be considered simultaneously while partitioning examples. This significantly reduces the number of possible splits to consider, allowing for substantially larger histograms to be handled. We also propose to evaluate the original bins independently, in addition to evaluating the subsets of bins when performing splits. This ensures that the information obtained by treating bins simultaneously is an additional gain compared to what is considered by the standard approach. Results of experiments on applying the new algorithm to both synthetic and real world datasets demonstrate positive results in terms of predictive performance without excessive computational cost.

sted, utgiver, år, opplag, sider
AAAI Press, 2016
Emneord
histogram variables, histogram tree, histogram classifier
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-135432 (URN)978-1-57735-756-8 (ISBN)
Konferanse
Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, FLAIRS, Key Largo, Florida, May 16-18, 2016
Tilgjengelig fra: 2016-11-08 Laget: 2016-11-08 Sist oppdatert: 2022-02-28bibliografisk kontrollert
Gurung, R. B., Lindgren, T. & Boström, H. (2015). Learning Decision Trees from Histogram Data. In: Robert Stahlbock, Gary M. Weiss (Ed.), Proceedings of the 2015 International Conference on Data Mining: DMIN 2015. Paper presented at 11th International Conference on Data Mining (DMIN'15), Las Vegas, Nevada, USA, July 27-30, 2015 (pp. 139-145). CSREA Press
Åpne denne publikasjonen i ny fane eller vindu >>Learning Decision Trees from Histogram Data
2015 (engelsk)Inngår i: Proceedings of the 2015 International Conference on Data Mining: DMIN 2015 / [ed] Robert Stahlbock, Gary M. Weiss, CSREA Press, 2015, s. 139-145Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

When applying learning algorithms to histogram data, bins of such variables are normally treated as separate independent variables. However, this may lead to a loss of information as the underlying dependencies may not be fully exploited. In this paper, we adapt the standard decision tree learning algorithm to handle histogram data by proposing a novel method for partitioning examples using binned variables. Results from employing the algorithm to both synthetic and real-world data sets demonstrate that exploiting dependencies in histogram data may have positive effects on both predictive performance and model size, as measured by number of nodes in the decision tree. These gains are however associated with an increased computational cost and more complex split conditions. To address the former issue, an approximate method is proposed, which speeds up the learning process substantially while retaining the predictive performance.

sted, utgiver, år, opplag, sider
CSREA Press, 2015
Emneord
Histogram Learning, Histogram Tree
HSV kategori
Forskningsprogram
data- och systemvetenskap
Identifikatorer
urn:nbn:se:su:diva-125140 (URN)978-1-60132-403-0 (ISBN)
Konferanse
11th International Conference on Data Mining (DMIN'15), Las Vegas, Nevada, USA, July 27-30, 2015
Tilgjengelig fra: 2016-01-08 Laget: 2016-01-08 Sist oppdatert: 2022-02-23bibliografisk kontrollert
Organisasjoner