Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Trinary Decision Trees for handling missing data
Stockholm University, Faculty of Science, Department of Mathematics.
(English)Manuscript (preprint) (Other academic)
Abstract [en]

This paper introduces the Trinary decision tree, an algorithm designed to improve the handling of missing data in decision tree regressors and classifiers. Unlike other approaches, the Trinary decision tree does not assume that missing values contain any information about the response. Both theoretical calculations on estimator bias and numerical illustrations using real data sets are presented to compare its performance with established algorithms in different missing data scenarios (Missing Completely at Random (MCAR), and Informative Missingness (IM)). Notably, the Trinary tree outperforms its peers in MCAR settings, especially when data is only missing out-of-sample, while lacking behind in IM settings. A hybrid model, the TrinaryMIA tree, which combines the Trinary tree and the Missing In Attributes (MIA) approach, shows robust performance in all types of missingness. Despite the potential drawback of slower training speed, the Trinary tree offers a promising and more accurate method of handling missing data in decision tree algorithms.

National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:su:diva-226745OAI: oai:DiVA.org:su-226745DiVA, id: diva2:1838774
Available from: 2024-02-19 Created: 2024-02-19 Last updated: 2024-02-26Bibliographically approved
In thesis
1. Tree-based machine learning methods with non-life insurance applications
Open this publication in new window or tab >>Tree-based machine learning methods with non-life insurance applications
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Non-life insurance is a field which has been data-driven for a long time, with the statistical framework behind modern-day actuarial sciences laid out at the beginning of the 20th century. Problems regarding the estimation and prediction of risk are relevant to the insurance industry specifically, but also for society as a whole. The rise of machine learning methods has created a new set of tools that can be used to solve these problems. This thesis contains five individual papers, all of which are related to developing machine learning- or data-driven methods and algorithms that can be applied to, but are not limited to, non-life insurance applications.

Paper I takes an existing probabilistic model for claims reserving, the Collective Reserving Model (CRM), and replaces the linear modeling approach of the original paper with non-linear machine learning methods. The paper addresses issues in these applications and provides a framework for how to implement and evaluate machine learning models in a reserving setting. It also discusses how to implement early stopping methods given different levels of data granularity. The models are evaluated on a series of simulated data sets with promising results.

Paper II does not use a machine learning method per se but instead develops the CRM used in Paper I by adding the openness status of the claims to the dynamics and presents the CRM with Openness (CRMO), as a means to model the non-linear effects implied in Paper I. The paper presents how the model can be estimated using regression methods, and provides recursive formulas for the moments of the predicted reserve. The algorithm is evaluated in terms of accuracy on the same data set as in Paper I and shows results that are comparable to the machine learning implementations of the CRM model.

Paper III presents a new boosting algorithm called the Cyclic Gradient Boosting Machine (CGBM). The algorithm extends the classical gradient boosting machine to provide multi-dimensional function approximation. The paper shows how the CGBM can be used to estimate entire probability distributions rather than just the mean of the distribution. The paper also discusses potential problems with hyperparameter tuning in this higher-dimensional hyperparameter space and provides a dimension-wise early stopping method, which is proven useful to avoid overfitting. Numerical illustrations show accurate results on simulated and real data sets.

Paper IV is a paper that is not directly related to non-life insurance but rather to so-called decision trees used for classification and regression. The paper presents the trinary tree algorithm, which is a new way to handle missing input data for tree-based models, meant to provide a more regularized model than other suggested methods. The algorithm is benchmarked against standard methods for missing data-handling and shows promising results even for high rates of missing data.

Paper V presents a generalized linear model with non-linear effects induced by varying coefficients, with the varying coefficients estimated using the CGBM from Paper III. This is a special case of a varying coefficient model (VCM). The model that can handle highly non-linear effects while maintaining local interpretability. The paper also shows how tuning, feature selection, and evaluation of interaction effects can be simplified as compared to other VCMs. The model is evaluated on the same data set as in Paper III and shows promising results in terms of accuracy and interpretability.

Place, publisher, year, edition, pages
Stockholm: Department of Mathematics, Stockholm University, 2024. p. 65
National Category
Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:su:diva-226748 (URN)978-91-8014-677-7 (ISBN)978-91-8014-678-4 (ISBN)
Public defence
2024-04-12, hörsal 4, hus 2, Campus Albano, Greta Arwidssons väg 28, Stockholm, 13:00 (English)
Opponent
Supervisors
Available from: 2024-03-20 Created: 2024-02-19 Last updated: 2024-03-12Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

https://arxiv.org/abs/2309.03561

Authority records

Zakrisson, Henning

Search in DiVA

By author/editor
Zakrisson, Henning
By organisation
Department of Mathematics
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 75 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf