Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Learning predictive models from graph data using pattern mining
Stockholm University, Faculty of Social Sciences, Department of Computer and Systems Sciences.
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Learning from graphs has become a popular research area due to the ubiquity of graph data representing web pages, molecules, social networks, protein interaction networks etc. However, standard graph learning approaches are often challenged by the computational cost involved in the learning process, due to the richness of the representation. Attempts made to improve their efficiency are often associated with the risk of degrading the performance of the predictive models, creating tradeoffs between the efficiency and effectiveness of the learning. Such a situation is analogous to an optimization problem with two objectives, efficiency and effectiveness, where improving one objective without the other objective being worse off is a better solution, called a Pareto improvement. In this thesis, it is investigated how to improve the efficiency and effectiveness of learning from graph data using pattern mining methods. Two objectives are set where one concerns how to improve the efficiency of pattern mining without reducing the predictive performance of the learning models, and the other objective concerns how to improve predictive performance without increasing the complexity of pattern mining. The employed research method mainly follows a design science approach, including the development and evaluation of artifacts. The contributions of this thesis include a data representation language that can be characterized as a form in between sequences and itemsets, where the graph information is embedded within items. Several studies, each of which look for Pareto improvements in efficiency and effectiveness are conducted using sets of small graphs. Summarizing the findings, some of the proposed methods, namely maximal frequent itemset mining and constraint based itemset mining, result in a dramatically increased efficiency of learning, without decreasing the predictive performance of the resulting models. It is also shown that additional background knowledge can be used to enhance the performance of the predictive models, without increasing the complexity of the graphs.

Place, publisher, year, edition, pages
Stockholm: Department of Computer and Systems Sciences, Stockholm University , 2014. , 118 p.
Series
Report Series / Department of Computer & Systems Sciences, ISSN 1101-8526 ; 14-003
Keyword [en]
Machine Learning, Graph Data, Pattern Mining, Classification, Regression, Predictive Models
National Category
Computer Science
Research subject
Computer and Systems Sciences
Identifiers
URN: urn:nbn:se:su:diva-100713ISBN: 978-91-7447-837-2 (print)OAI: oai:DiVA.org:su-100713DiVA: diva2:695664
Public defence
2014-03-25, room B, Forum, Isafjordsgatan 39, Kista, 13:00 (English)
Opponent
Supervisors
Available from: 2014-03-03 Created: 2014-02-11 Last updated: 2014-03-04Bibliographically approved
List of papers
1. DIFFER: A Propositionalization Approach for Learning from Structured Data
Open this publication in new window or tab >>DIFFER: A Propositionalization Approach for Learning from Structured Data
2006 (English)In: Proceedings of World Academy of Science, Engineering and Technology, ISSN 2010-376X, E-ISSN 2070-3740, Vol. 15, 49-51 p.Article in journal (Refereed) Published
Abstract [en]

Logic based methods for learning from structured data is limited w.r.t. handling large search spaces, preventing large-sized substructures from being considered by the resulting classifiers. A novel approach to learning from structured data is introduced that employs a structure transformation method, called finger printing, for addressing these limitations. The method, which generates features corresponding to arbitrarily complex substructures, is implemented in a system, called DIFFER. The method is demonstrated to perform comparably to an existing state-of-art method on some benchmark data sets without requiring restrictions on the search space. Furthermore, learning from the union of features generated by finger printing and the previous method outperforms learning from each individual set of features on all benchmark data sets, demonstrating the benefit of developing complementary, rather than competing, methods for structure classification.

Keyword
WASET, Engineering, Technology
National Category
Engineering and Technology
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-101072 (URN)
Available from: 2014-02-24 Created: 2014-02-24 Last updated: 2017-12-05Bibliographically approved
2. Learning from Structured Data by Finger Printing
Open this publication in new window or tab >>Learning from Structured Data by Finger Printing
2006 (English)In: Proceedings of the Ninth Scandinavian Conference on Artificial Intelligence (SCAI 2006), Helsinki: Finnish Artificial Intelligence society FAIS , 2006, 120-126 p.Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Helsinki: Finnish Artificial Intelligence society FAIS, 2006
Series
Publications of the Finnish Artificial Intelligence Society, ISSN 1238-4658 ; 22
National Category
Computer Science
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-37961 (URN)978-952-5677-00-3 (ISBN)
Conference
The Ninth Scandinavian Conference on Artificial Intelligence, Helsinki University of Technology Espoo, Finland, October 25-27, 2006
Available from: 2011-01-18 Created: 2010-03-24 Last updated: 2014-02-26Bibliographically approved
3. Learning to Classify Structured Data by Graph Propositionalization
Open this publication in new window or tab >>Learning to Classify Structured Data by Graph Propositionalization
2006 (English)In: Proceedings of the Second IASTED International Conference on Computational Intelligence, 2006Conference paper, Published paper (Refereed)
Abstract [en]

Existing methods for learning from structured data are limited with respect to handling large or isolated substructures and also impose constraints on search depth and induced structure length. An approach to learning from structured data using a graph based propositionalization method, called finger printing, is introduced that addresses the limitations of current methods. The method is implemented in a system called DIFFER, which is demonstrated to compare favorable to existing state-of-art methods on some benchmark data sets. It is shown that further improvements can be obtained by combining the features generated by finger printing with features generated by previous methods.

Keyword
Machine Learning, Graph, Classification, Structured data
National Category
Computer Science
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-38408 (URN)
Conference
The IASTED International Conference on Computational Intelligence, November 20 – 22, 2006, San Francisco, USA
Available from: 2010-04-13 Created: 2010-04-13 Last updated: 2014-02-26Bibliographically approved
4. The effect of background knowledge in graph-based learning in the chemoinformatics domain
Open this publication in new window or tab >>The effect of background knowledge in graph-based learning in the chemoinformatics domain
2008 (English)In: Trends in Intelligent Systems and Computer Engineering / [ed] Oscar Castillo, Li Xu, Sio-Iong Ao, Springer, 2008, 141-153 p.Chapter in book (Refereed)
Abstract [en]

Typical machine learning systems often use a set of previous experiences (examples) to learn concepts, patterns, or relations hidden within the data [1]. Current machine learning approaches are challenged by the growing size of the data repositories and the growing complexity of those data [1, 2]. In order to accommodate the requirement of being able to learn from complex data, several methods have been introduced in the field of machine learning [2]. Based on the way the input and resulting hypotheses are represented, two main categories of such methods exist, namely, logic-based and graph-based methods [3]. The demarcation line between logic- and graph-based methods lies in the differences of their data representation methods, hypothesis formation, and testing as well as the form of the output produced.

The main purpose of our study is to investigate the effect of incorporating background knowledge into graph learning methods. The ability of graph learning methods to obtain accurate theories with a minimum of background knowledge is of course a desirable property, but not being able to effectively utilize additional knowledge that is available and has been proven important is clearly a disadvantage. Therefore we examine how far additional, already available, background knowledge can be effectively used for increasing the performance of a graph learner. Another contribution of our study is that it establishes a neutral ground to compare classifi- cation accuracies of the two closely related approaches, making it possible to study whether graph learning methods actually would outperform ILP methods if the same background knowledge were utilized [9].

The rest of this chapter is organized as follows. The next section discusses related work concerning the contribution of background knowledge when learning from complex data. Section 10.3 provides a description of the graph learning method that is used in our study. The experimental setup, empirical evaluation, and the results from the study are described in Sect. 10.4. Finally, Sect. 10.5 provides conclusions from the experiments and points out interesting extensions of the work reported in this study.

Place, publisher, year, edition, pages
Springer, 2008
Series
Lecture Notes in Electrical Engineering, ISSN 1876-1100 ; 6
National Category
Engineering and Technology
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-101073 (URN)10.1007/978-0-387-74935-8_10 (DOI)978-0-387-74934-1 (ISBN)978-0-387-74935-8 (ISBN)
Available from: 2014-02-24 Created: 2014-02-24 Last updated: 2014-02-26Bibliographically approved
5. Graph propositionalization for random forests
Open this publication in new window or tab >>Graph propositionalization for random forests
2009 (English)In: The Eighth International Conference on Machine Learning and Applications: Proceedings, IEEE Computer Society, 2009, 196-201 p.Conference paper, Published paper (Refereed)
Abstract [en]

Graph propositionalization methods transform structured and relational data into a fixed-length feature vector format that can be used by standard machine learning methods. However, the choice of propositionalization method may have a significant impact on the performance of the resulting classifier. Six different propositionalization methods are evaluated when used in conjunction with random forests. The empirical evaluation shows that the choice of propositionalization method has a significant impact on the resulting accuracy for structured data sets. The results furthermore show that the maximum frequent itemset approach and a combination of this approach and maximal common substructures turn out to be the most successful propositionalization methods for structured data, each significantly outperforming the four other considered methods.

Place, publisher, year, edition, pages
IEEE Computer Society, 2009
National Category
Engineering and Technology
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-101075 (URN)10.1109/ICMLA.2009.113 (DOI)
Conference
The Eighth International Conference on Machine Learning and Applications (ICMLA), Miami Beach, Florida, 13 - 15 December 2009
Available from: 2014-02-24 Created: 2014-02-24 Last updated: 2014-02-26Bibliographically approved
6. Pre-Processing Structured Data for Standard Machine Learning Algorithms by Supervised Graph Propositionalization - a Case Study with Medicinal Chemistry Datasets
Open this publication in new window or tab >>Pre-Processing Structured Data for Standard Machine Learning Algorithms by Supervised Graph Propositionalization - a Case Study with Medicinal Chemistry Datasets
2010 (English)In: Ninth International Conference on Machine Learning and Applications (ICMLA), 2010: Proceedings, IEEE Computer Society, 2010, 828-833 p.Conference paper, Published paper (Refereed)
Abstract [en]

Graph propositionalization methods can be used to transform structured and relational data into fixed-length feature vectors, enabling standard machine learning algorithms to be used for generating predictive models. It is however not clear how well different propositionalization methods work in conjunction with different standard machine learning algorithms. Three different graph propositionalization methods are investigated in conjunction with three standard learning algorithms: random forests, support vector machines and nearest neighbor classifiers. An experiment on 21 datasets from the domain of medicinal chemistry shows that the choice of propositionalization method may have a significant impact on the resulting accuracy. The empirical investigation further shows that for datasets from this domain, the use of the maximal frequent item set approach for propositionalization results in the most accurate classifiers, significantly outperforming the two other graph propositionalization methods considered in this study, SUBDUE and MOSS, for all three learning methods.

Place, publisher, year, edition, pages
IEEE Computer Society, 2010
National Category
Information Science
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-51976 (URN)10.1109/ICMLA.2010.128 (DOI)978-1-4244-9211-4 (ISBN)
Conference
Ninth International Conference on Machine Learning and Applications (ICMLA), 12-14 December 2010, Washington D.C., USA
Available from: 2011-01-12 Created: 2011-01-12 Last updated: 2014-02-26Bibliographically approved
7. Is Frequent Pattern Mining useful in building predictive models?
Open this publication in new window or tab >>Is Frequent Pattern Mining useful in building predictive models?
2011 (English)In: ECML/PKDD 2011: Workshop of Collective Learning and Inference on Structured Data, 2011Conference paper, Published paper (Refereed)
Abstract [en]

The recent studies of pattern mining have given more attention to discovering patterns that are interesting, significant, discriminative and so forth, than simply frequent. Does this imply that the frequent patterns are not useful anymore? In this paper we carry out a survey of frequent pattern mining and, using an empirical study, show how far the frequent pattern mining is useful in building predictive models.

Keyword
Frequent pattern mining, predictive models, chemoinformatics
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-67149 (URN)
Conference
ECML/PKDD: Workshop of Collective Learning and Inference on Structured Data, Athens, Greece, 5-9 September 2011
Available from: 2011-12-26 Created: 2011-12-26 Last updated: 2014-02-26Bibliographically approved
8. Comparative analysis of the use of chemoinformatics-based and substructure-based descriptors for quantitative structure-activity relationship (QSAR) modeling
Open this publication in new window or tab >>Comparative analysis of the use of chemoinformatics-based and substructure-based descriptors for quantitative structure-activity relationship (QSAR) modeling
2013 (English)In: Intelligent Data Analysis, ISSN 1088-467X, E-ISSN 1571-4128, Vol. 17, no 2, 327-341 p.Article in journal (Refereed) Published
Abstract [en]

Quantitative structure-activity relationship (QSAR) models have gained popularity in the pharmaceutical industry due to their potential to substantially decrease drug development costs by reducing expensive laboratory and clinical tests. QSAR modeling consists of two fundamental steps, namely, descriptor discovery and model building. Descriptor discovery methods are either based on chemical domain knowledge or purely data-driven. The former, chemoinformatics-based, and the latter, substructures-based, methods for QSAR modeling, have been developed quite independently. As a consequence, evaluations involving both types of descriptor discovery method are rarely seen. In this study, a comparative analysis of chemoinformatics-based and substructure-based approaches is presented. Two chemoinformatics-based approaches; ECFI and SELMA, are compared to five approaches for substructure discovery; CP, graphSig, MFI, MoFa and SUBDUE, using 18 QSAR datasets. The empirical investigation shows that one of the chemo-informatics-based approaches, ECFI, results in significantly more accurate models compared to all other methods, when used on their own. Results from combining descriptor sets are also presented, showing that the addition of ECFI descriptors to any other descriptor set leads to improved predictive performance for that set, while the use of ECFI descriptors in many cases also can be improved by adding descriptors generated by the other methods.

Keyword
QSAR modeling, chemical descriptors, graph mining
National Category
Computer Science
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-91543 (URN)10.3233/IDA-130581 (DOI)000319344300010 ()
Funder
Swedish Foundation for Strategic Research , IIS11-0053
Note

AuthorCount:3;

Available from: 2013-07-01 Created: 2013-06-28 Last updated: 2017-12-06Bibliographically approved
9. Can frequent itemset mining be efficiently and effectively used for learning from graph data?
Open this publication in new window or tab >>Can frequent itemset mining be efficiently and effectively used for learning from graph data?
2012 (English)In: 11th International Conference on Machine Learning and Applications (ICMLA) / [ed] Juan E. Guerrero, IEEE Computer Society, 2012, 409-414 p.Conference paper, Published paper (Refereed)
Abstract [en]

Standard graph learning approaches are often challenged by the computational cost involved when learning from very large sets of graph data. One approach to overcome this problem is to transform the graphs into less complex structures that can be more efficiently handled. One obvious potential drawback of this approach is that it may degrade predictive performance due to loss of information caused by the transformations. An investigation of the tradeoff between efficiency and effectiveness of graph learning methods is presented, in which state-of-the-art graph mining approaches are compared to representing graphs by itemsets, using frequent itemset mining to discover features to use in prediction models. An empirical evaluation on 18 medicinal chemistry datasets is presented, showing that employing frequent itemset mining results in significant speedups, without sacrificing predictive performance for both classification and regression.

Place, publisher, year, edition, pages
IEEE Computer Society, 2012
Keyword
Graph learning, frequent itemset mining, classification, regression
National Category
Information Systems
Research subject
Computer and Systems Sciences
Identifiers
urn:nbn:se:su:diva-86335 (URN)10.1109/ICMLA.2012.74 (DOI)978-0-7695-4913-2 (ISBN)
Conference
ICMLA 2012, December 12-15, Boca Raton, Florida, USA
Available from: 2013-01-12 Created: 2013-01-12 Last updated: 2014-02-26Bibliographically approved

Open Access in DiVA

fulltext(3507 kB)574 downloads
File information
File name FULLTEXT03.pdfFile size 3507 kBChecksum SHA-512
03427a4149fa186a00c9ddbe06f2cdefa6e4467f06be33b60e776eb8cfa79d6862d464fd8f9b64eb0be39efb70c8d55d379ec772cfbb060093ff64b8927b8b9d
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Karunaratne, Thashmee M.
By organisation
Department of Computer and Systems Sciences
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 574 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 412 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf