Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Large-scale structure prediction by improved contact predictions and model quality assessment
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics. Stockholm University, Science for Life Laboratory (SciLifeLab).
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics. Stockholm University, Science for Life Laboratory (SciLifeLab).ORCID iD: 0000-0003-3534-2986
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics. Stockholm University, Science for Life Laboratory (SciLifeLab).ORCID iD: 0000-0003-2232-3006
Stockholm University, Faculty of Science, Department of Biochemistry and Biophysics. Stockholm University, Science for Life Laboratory (SciLifeLab).ORCID iD: 0000-0002-7115-9751
2017 (English)In: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 33, no 14, p. 123-129Article in journal (Refereed) Published
Abstract [en]

Motivation: Accurate contact predictions can be used for predicting the structure of proteins. Until recently these methods were limited to very big protein families, decreasing their utility. However, recent progress by combining direct coupling analysis with machine learning methods has made it possible to predict accurate contact maps for smaller families. To what extent these predictions can be used to produce accurate models of the families is not known. Results: We present the PconsFold2 pipeline that uses contact predictions from PconsC3, the CONFOLD folding algorithm and model quality estimations to predict the structure of a protein. We show that the model quality estimation significantly increases the number of models that reliably can be identified. Finally, we apply PconsFold2 to 6379 Pfam families of unknown structure and find that PconsFold2 can, with an estimated 90% specificity, predict the structure of up to 558 Pfam families of unknown structure. Out of these 415 have not been reported before. Availability: Datasets as well as models of all the 558 Pfam families are available at http://c3.pcons.net. All programs used here are freely available.

Place, publisher, year, edition, pages
2017. Vol. 33, no 14, p. 123-129
National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
URN: urn:nbn:se:su:diva-141945DOI: 10.1093/bioinformatics/btx239ISI: 000405289100005OAI: oai:DiVA.org:su-141945DiVA, id: diva2:1089815
Available from: 2017-04-21 Created: 2017-04-21 Last updated: 2022-03-23Bibliographically approved
In thesis
1. From Sequence to Structure: Using predicted residue contacts to facilitate template-free protein structure prediction
Open this publication in new window or tab >>From Sequence to Structure: Using predicted residue contacts to facilitate template-free protein structure prediction
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Despite the fundamental role of experimental protein structure determination, computational methods are of essential importance to bridge the ever growing gap between available protein sequence and structure data. Common structure prediction methods rely on experimental data, which is not available for about half of the known protein families.

Recent advancements in amino acid contact prediction have revolutionized the field of protein structure prediction. Contacts can be used to guide template-free structure predictions that do not rely on experimentally solved structures of homologous proteins. Such methods are now able to produce accurate models for a wide range of protein families.

We developed PconsC2, an approach that improved existing contact prediction methods by recognizing intra-molecular contact patterns and noise reduction. An inherent problem of contact prediction based on maximum entropy models is that large alignments with over 1000 effective sequences are needed to infer contacts accurately. These are however not available for more than 80% of all protein families that do not have a representative structure in PDB. With PconsC3, we could extend the applicability of contact prediction to families as small as 100 effective sequences by combining global inference methods with machine learning based on local pairwise measures.

By introducing PconsFold, a pipeline for contact-based structure prediction, we could show that improvements in contact prediction accuracy translate to more accurate models. Finally, we applied a similar technique to Pfam, a comprehensive database of known protein families. In addition to using a faster folding protocol we employed model quality assessment methods, crucial for estimating the confidence in the accuracy of predicted models. We propose models tobe accurate for 558 families that do not have a representative known structure. Out of those, over 75% have not been reported before.

Place, publisher, year, edition, pages
Stockholm: Department of Biochemistry and Biophysics, Stockholm University, 2017
Keywords
protein bioinformatics, protein structure prediction, contact prediction, machine learning
National Category
Bioinformatics (Computational Biology)
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-141946 (URN)978-91-7649-811-8 (ISBN)978-91-7649-812-5 (ISBN)
Public defence
2017-06-09, Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

At the time of the doctoral defense, the following papers were unpublished and had a status as follows: Paper 2: Submitted. Paper 4: In press.

 

Available from: 2017-05-17 Created: 2017-04-25 Last updated: 2022-02-28Bibliographically approved
2. Structured Learning for Structural Bioinformatics: Applications of Deep Learning to Protein Structure Prediction
Open this publication in new window or tab >>Structured Learning for Structural Bioinformatics: Applications of Deep Learning to Protein Structure Prediction
2019 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Proteins are the basic molecular machines of the cell, performing a broad range of tasks, from structural support to catalysisof chemical reactions. Their function is determined by their 3D structure, which in turn is dictated by the order of their components, the amino acids.

This thesis is dedicated to applications of machine learning to the problems of contact prediction, ab-initio, and model quality assessment. In particular, my research has been focused on developing methods that are both effective, and easy to use.

In the first paper, we improved the already state-of-the-art model quality assessment (MQA) program ProQ3 replacing the underlying machine learning algorithm from svm to Deep Learning, baptised ProQ3D. The correlation between predicted and true scores was improved from 0.85 to 0.90, using the same training data and features.

The second paper joined several programs into a single pipeline for ab-initio structure prediction: contact prediction,folding, and model selection. We attempted to predict the structures of all 6379 PFAM families with unknown structure, ofwhich 558 we believe to be accurate. Of these, 415 had not been reported before.

The third paper uses advances in machine learning to build a contact predictor, PconsC4, that is fast and easy to deployin large-scale studies, since it requires a single Multiple Sequence Alignment (MSA), and no external dependencies. The predictions are state-of-the-art, yielding a 12% improvement in precision over PconsC3, and 244 times faster.

With ProQ4, in the fourth paper, we introduce a novel way of training deep networks for MQA in a way that minimises the bias of the training data, and emphasises model ranking, and demonstrate its viability with a minimal description ofthe protein. The ranking correlation was improved with respect to ProQ3D from 0.82 to 0.90.

Lastly, in the fifth paper, weshow the results of ProQ3D and ProQ4 in a completely blind test: CASP13.

Place, publisher, year, edition, pages
Stockholm: Department of Biochemistry and Biophysics, Stockholm University, 2019. p. 63
National Category
Bioinformatics and Computational Biology
Research subject
Biochemistry towards Bioinformatics
Identifiers
urn:nbn:se:su:diva-172395 (URN)978-91-7797-797-1 (ISBN)978-91-7797-798-8 (ISBN)
Public defence
2019-10-11, Magnélisalen, Kemiska övningslaboratoriet, Svante Arrhenius väg 16 B, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

At the time of the doctoral defense, the following paper was unpublished and had a status as follows: Paper 4: Manuscript.

Available from: 2019-09-18 Created: 2019-08-28 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Michel, MircoMenéndez Hurtado, DavidUziela, KarolisElofsson, Arne

Search in DiVA

By author/editor
Michel, MircoMenéndez Hurtado, DavidUziela, KarolisElofsson, Arne
By organisation
Department of Biochemistry and BiophysicsScience for Life Laboratory (SciLifeLab)
In the same journal
Bioinformatics
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 1189 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf