CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Contemporary developments and applications of unsupervised machine learning methods
Stockholm University, Faculty of Science, Department of Mathematics.
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This thesis presents state-of-the-art developments in the field of unsupervised learning, particularly in clustering analysis. Unsupervised learning is a branch of machine learning whose task is to discover hidden patterns and relationships in high-dimensional data without any labels. It is an important step in providing valuable insights, e.g., the existence of important discrete structures and low-dimensional features, for downstream statistical analyses as well as revealing anomalies. The achievements of this thesis detailed below advance our toolboxes in pattern recognition and anomaly detection that have potential applications in many scientific areas with unstructured and unlabelled data.

Paper I presents the application of unsupervised change point (CP) detection to molecular time series to explain the dynamics of motor proteins. Data-driven non-parametric detection of CP enables an objective identification and modelling of stepping patterns in molecular motors. Beyond CP detection, this study provides further tools to analyze molecular motors, such as the reliable extraction of reaction statistics and establishing a predictive model for the reaction rates. The methods developed and applied in this paper are applicable to time series data from a broad range of scientific fields.

Paper II proposes the Graph-based Fuzzy Density Peak Clustering (GF-DPC) method that comprises comprehensive generalizations of existing density-based clustering methods. The first generalization is employing graph-based methods to estimate densities and capture nonlinearities in the data that enhances the power of detecting clusters with arbitrary shapes. On the other hand, a fuzzy extension is formulated to provide a probabilistic framework to assign data points to clusters. Finally, the identification of cluster centers and the number of clusters is automated in terms of the fuzzy clustering validation index. Compared with other well-known fuzzy clustering methods, the superior performances of GF-DPC in discovering clusters with arbitrary shapes, densities, separations and overlapping are demonstrated using both intuitive examples and real datasets.

Paper III establishes a validation framework versatile for fuzzy clustering, termed the Shape-aware Generalized Silhouette Analysis (SAGSA), based on the silhouette index. In SAGSA, a probabilistic framework is formulated to quantify the degree of cohesion and separation for the detected fuzzy clusters. In addition, graph-based distances are employed in SAGSA to facilitate an accurate validation of nonlinear clustering structures. Most importantly, a 2-dimensional graphical tool, the cohesion-separation (CS) plot, is introduced to enable visual diagnoses of possible problems in the clustering results at the point-wise, cluster-wise and global levels regardless of the dimensionality of the dataset. Finally, we illustrate the effectiveness of SAGSA in cluster validation compared with other commonly used methods in terms of various test examples of clustering challenges, these include clusters with arbitrary shapes, imbalance sizes, overlapping, hierarchical structures, mixed with noises, etc.

Place, publisher, year, edition, pages
Stockholm: Department of Mathematics, Stockholm University , 2024. , p. 35
Keywords [en]
Clustering analysis, Fuzzy clustering, Graph-based methods, Clustering validation, Time series analysis, Change point detection
National Category
Computational Mathematics
Research subject
Computational Mathematics
Identifiers
URN: urn:nbn:se:su:diva-231996ISBN: 978-91-8014-869-6 (print)ISBN: 978-91-8014-870-2 (electronic)OAI: oai:DiVA.org:su-231996DiVA, id: diva2:1883744
Public defence
2024-09-13, Hörsal 2, Hus 2, Campus Albano, Albanovägen 18, Stockholm, 13:00 (English)
Opponent
Supervisors
Available from: 2024-08-21 Created: 2024-07-11 Last updated: 2024-08-13Bibliographically approved
List of papers
1. Rotary properties of hybrid F1-ATPases consisting of subunits from different species
Open this publication in new window or tab >>Rotary properties of hybrid F1-ATPases consisting of subunits from different species
Show others...
2023 (English)In: iScience, E-ISSN 2589-0042, Vol. 26, no 5, article id 106626Article in journal (Refereed) Published
Abstract [en]

F-1-ATPase (F-1) is an ATP-driven rotary motor protein ubiquitously found in many species as the catalytic portion of FoF1-ATP synthase. Despite the highly conserved amino acid sequence of the catalytic core subunits: alpha and beta, F-1 shows diversity in the maximum catalytic turnover rate V-max and the number of rotary steps per turn. To study the design principle of F-1, we prepared eight hybrid F(1)s composed of subunits from two of three genuine (F)1s: thermophilic Bacillus PS3 (TF1), bovine mitochondria (bMF(1)), and Paracoccus denitrificans (PdF1), differing in the V-max and the number of rotary steps. The V-max of the hybrids can be well fitted by a quadratic model highlighting the dominant roles of 0 and the couplings between alpha-beta. Although there exist no simple rules on which subunit dominantly determines the number of steps, our findings show that the stepping behavior is characterized by the combination of all subunits.

National Category
Biochemistry and Molecular Biology
Identifiers
urn:nbn:se:su:diva-229678 (URN)10.1016/j.isci.2023.106626 (DOI)001001097500001 ()37192978 (PubMedID)2-s2.0-85153262159 (Scopus ID)
Available from: 2024-05-27 Created: 2024-05-27 Last updated: 2024-07-11Bibliographically approved
2. Automated graph-based fuzzy density peak clustering to detect high-dimensional discrete structures of arbitrary shapes
Open this publication in new window or tab >>Automated graph-based fuzzy density peak clustering to detect high-dimensional discrete structures of arbitrary shapes
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Density-based clustering methods are prominent clustering approaches to discover discrete structures buried in high-dimensional (HD) data in terms of density variations. Among them is the well-known Density Peak Clustering (DPC) proposed by Rodriguez and Laio (2014) that performs fairly well in detecting clusters with nonlinear shapes and varying densities. However, it has several shortcomings that it does not learn about the nonlinear shapes of the underlying HD data, is lack of a probabilistic framework to handle overlapping clusters, and is not fully automated.

Here we develop comprehensive generalizations of DPC, termed Graph-based Fuzzy Density Peak Clustering (GF-DPC), to circumvent these limitations. In GF-DPC, graph-based methods are employed to robustly estimate densities and capture nonlinearities in the HD data that enhances its power in detecting clusters with arbitrary shapes. Furthermore, a fuzzy extension is introduced that returns a probabilistic assignment of data points to the detected clusters. Finally, the identification of cluster centers and the number of clusters are automated and generalized in terms of fuzzy clustering validation index. The superior performances of GF-DPC compared to other well-known fuzzy clustering methods in discovering clusters with arbitrary shapes, densities, separations and overlapping are demonstrated using both intuitive examples and real datasets.

Keywords
Density based clustering, Fuzzy clustering, Graph distance, Automatic validation
National Category
Computational Mathematics
Identifiers
urn:nbn:se:su:diva-231993 (URN)
Available from: 2024-07-11 Created: 2024-07-11 Last updated: 2024-07-11
3. Shape-aware generalized silhouette analysis to evaluate fuzzy clustering at the point-wise, cluster-wise and global levels
Open this publication in new window or tab >>Shape-aware generalized silhouette analysis to evaluate fuzzy clustering at the point-wise, cluster-wise and global levels
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Validation is an essential part of clustering analysis to assess the quality of the detected patterns. One of the most well-known validation methods is the silhouette index that is only applicable to hard clustering results. In this paper, we develop a fuzzy clustering validation framework based on the silhouette index, termed Shape-aware Generalized Silhouette Analysis (SAGSA), which allows for an extensive evaluation and diagnoses of possible problems in the clustering results at the point-wise, cluster-wise and global levels.

In particular, a probabilistic framework to quantify the cohesion (compactness) and separation of the detected clusters is formulated to handle fuzzy clustering results. Furthermore, graph-based (shape-aware) distances are employed to faithfully capture nonlinear structures enabling an accurate validation of curved clusters. Finally, a graphical tool, cohesion-separation (CS) plot, is introduced that allows us to visually assess clustering results at different levels regardless of the dimensionality of the dataset. To show its effectiveness in diagnosing problems in clustering results, SAGSA is compared with other fuzzy clustering validation methods on test cases with different types of clustering challenges, namely, clusters with arbitrary shapes, imbalance sizes, overlapping, hierarchical structures, mixed with noises, etc.

Keywords
Clustering validation, Silhouette index, Graph distance, Fuzzy clustering
National Category
Computational Mathematics
Identifiers
urn:nbn:se:su:diva-231994 (URN)
Available from: 2024-07-11 Created: 2024-07-11 Last updated: 2024-07-11

Open Access in DiVA

Contemporary developments and applications of unsupervised machine learning methods(1665 kB)19 downloads
File information
File name FULLTEXT01.pdfFile size 1665 kBChecksum SHA-512
32679d307d45a5e0e703be12f584d8a11f4e6c649d3800c8a7ca61b7a4d266d065a0f0124bd2382b07b3a58202c11673b2255928bd3c19a98054524c6bdcdd7e
Type fulltextMimetype application/pdf

Authority records

Tas Kiper, Busra

Search in DiVA

By author/editor
Tas Kiper, Busra
By organisation
Department of Mathematics
Computational Mathematics

Search outside of DiVA

GoogleGoogle Scholar
Total: 19 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 704 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf