Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Shape-aware generalized silhouette analysis to evaluate fuzzy clustering at the point-wise, cluster-wise and global levels
Stockholm University, Faculty of Science, Department of Mathematics.
Stockholm University, Faculty of Science, Department of Mathematics.ORCID iD: 0000-0001-8009-6265
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Validation is an essential part of clustering analysis to assess the quality of the detected patterns. One of the most well-known validation methods is the silhouette index that is only applicable to hard clustering results. In this paper, we develop a fuzzy clustering validation framework based on the silhouette index, termed Shape-aware Generalized Silhouette Analysis (SAGSA), which allows for an extensive evaluation and diagnoses of possible problems in the clustering results at the point-wise, cluster-wise and global levels.

In particular, a probabilistic framework to quantify the cohesion (compactness) and separation of the detected clusters is formulated to handle fuzzy clustering results. Furthermore, graph-based (shape-aware) distances are employed to faithfully capture nonlinear structures enabling an accurate validation of curved clusters. Finally, a graphical tool, cohesion-separation (CS) plot, is introduced that allows us to visually assess clustering results at different levels regardless of the dimensionality of the dataset. To show its effectiveness in diagnosing problems in clustering results, SAGSA is compared with other fuzzy clustering validation methods on test cases with different types of clustering challenges, namely, clusters with arbitrary shapes, imbalance sizes, overlapping, hierarchical structures, mixed with noises, etc.

Keywords [en]
Clustering validation, Silhouette index, Graph distance, Fuzzy clustering
National Category
Computational Mathematics
Identifiers
URN: urn:nbn:se:su:diva-231994OAI: oai:DiVA.org:su-231994DiVA, id: diva2:1883705
Available from: 2024-07-11 Created: 2024-07-11 Last updated: 2024-07-11
In thesis
1. Contemporary developments and applications of unsupervised machine learning methods
Open this publication in new window or tab >>Contemporary developments and applications of unsupervised machine learning methods
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This thesis presents state-of-the-art developments in the field of unsupervised learning, particularly in clustering analysis. Unsupervised learning is a branch of machine learning whose task is to discover hidden patterns and relationships in high-dimensional data without any labels. It is an important step in providing valuable insights, e.g., the existence of important discrete structures and low-dimensional features, for downstream statistical analyses as well as revealing anomalies. The achievements of this thesis detailed below advance our toolboxes in pattern recognition and anomaly detection that have potential applications in many scientific areas with unstructured and unlabelled data.

Paper I presents the application of unsupervised change point (CP) detection to molecular time series to explain the dynamics of motor proteins. Data-driven non-parametric detection of CP enables an objective identification and modelling of stepping patterns in molecular motors. Beyond CP detection, this study provides further tools to analyze molecular motors, such as the reliable extraction of reaction statistics and establishing a predictive model for the reaction rates. The methods developed and applied in this paper are applicable to time series data from a broad range of scientific fields.

Paper II proposes the Graph-based Fuzzy Density Peak Clustering (GF-DPC) method that comprises comprehensive generalizations of existing density-based clustering methods. The first generalization is employing graph-based methods to estimate densities and capture nonlinearities in the data that enhances the power of detecting clusters with arbitrary shapes. On the other hand, a fuzzy extension is formulated to provide a probabilistic framework to assign data points to clusters. Finally, the identification of cluster centers and the number of clusters is automated in terms of the fuzzy clustering validation index. Compared with other well-known fuzzy clustering methods, the superior performances of GF-DPC in discovering clusters with arbitrary shapes, densities, separations and overlapping are demonstrated using both intuitive examples and real datasets.

Paper III establishes a validation framework versatile for fuzzy clustering, termed the Shape-aware Generalized Silhouette Analysis (SAGSA), based on the silhouette index. In SAGSA, a probabilistic framework is formulated to quantify the degree of cohesion and separation for the detected fuzzy clusters. In addition, graph-based distances are employed in SAGSA to facilitate an accurate validation of nonlinear clustering structures. Most importantly, a 2-dimensional graphical tool, the cohesion-separation (CS) plot, is introduced to enable visual diagnoses of possible problems in the clustering results at the point-wise, cluster-wise and global levels regardless of the dimensionality of the dataset. Finally, we illustrate the effectiveness of SAGSA in cluster validation compared with other commonly used methods in terms of various test examples of clustering challenges, these include clusters with arbitrary shapes, imbalance sizes, overlapping, hierarchical structures, mixed with noises, etc.

Place, publisher, year, edition, pages
Stockholm: Department of Mathematics, Stockholm University, 2024. p. 35
Keywords
Clustering analysis, Fuzzy clustering, Graph-based methods, Clustering validation, Time series analysis, Change point detection
National Category
Computational Mathematics
Research subject
Computational Mathematics
Identifiers
urn:nbn:se:su:diva-231996 (URN)978-91-8014-869-6 (ISBN)978-91-8014-870-2 (ISBN)
Public defence
2024-09-13, Hörsal 2, Hus 2, Campus Albano, Albanovägen 18, Stockholm, 13:00 (English)
Opponent
Supervisors
Available from: 2024-08-21 Created: 2024-07-11 Last updated: 2024-08-13Bibliographically approved

Open Access in DiVA

No full text in DiVA

Authority records

Tas Kiper, BusraLi, Chun-Biu

Search in DiVA

By author/editor
Tas Kiper, BusraLi, Chun-Biu
By organisation
Department of Mathematics
Computational Mathematics

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 62 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf