Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Publikationer (9 of 9) Visa alla publikationer
Shenoy, A., Kalakoti, Y., Sundar, D. & Elofsson, A. (2024). M-Ionic: prediction of metal-ion-binding sites from sequence using residue embeddings. Bioinformatics, 40(1), Article ID btad782.
Öppna denna publikation i ny flik eller fönster >>M-Ionic: prediction of metal-ion-binding sites from sequence using residue embeddings
2024 (Engelska)Ingår i: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 40, nr 1, artikel-id btad782Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Motivation

Understanding metal–protein interaction can provide structural and functional insights into cellular processes. As the number of protein sequences increases, developing fast yet precise computational approaches to predict and annotate metal-binding sites becomes imperative. Quick and resource-efficient pre-trained protein language model (pLM) embeddings have successfully predicted binding sites from protein sequences despite not using structural or evolutionary features (multiple sequence alignments). Using residue-level embeddings from the pLMs, we have developed a sequence-based method (M-Ionic) to identify metal-binding proteins and predict residues involved in metal binding.

Results

On independent validation of recent proteins, M-Ionic reports an area under the curve (AUROC) of 0.83 (recall = 84.6%) in distinguishing metal binding from non-binding proteins compared to AUROC of 0.74 (recall = 61.8%) of the next best method. In addition to comparable performance to the state-of-the-art method for identifying metal-binding residues (Ca2+, Mg2+, Mn2+, Zn2+), M-Ionic provides binding probabilities for six additional ions (i.e. Cu2+, Po43−4, So2−4⁠, Fe2+, Fe3+, Co2+). We show that the pLM embedding of a single residue contains sufficient information about its neighbours to predict its binding properties.

Nationell ämneskategori
Bioinformatik och beräkningsbiologi Biokemi Molekylärbiologi
Identifikatorer
urn:nbn:se:su:diva-226508 (URN)10.1093/bioinformatics/btad782 (DOI)001148521100004 ()38175787 (PubMedID)2-s2.0-85182781206 (Scopus ID)
Tillgänglig från: 2024-02-19 Skapad: 2024-02-19 Senast uppdaterad: 2025-02-20Bibliografiskt granskad
Shenoy, A. (2024). Unlocking protein sequences: Advances in protein structure and ligand-binding site prediction. (Doctoral dissertation). Stockholm: Department of Biochemistry and Biophysics, Stockholm University
Öppna denna publikation i ny flik eller fönster >>Unlocking protein sequences: Advances in protein structure and ligand-binding site prediction
2024 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

The protein sequence determines how it will fold into its unique three-dimensional structure. Once folded, proteins perform their functions by interacting with other proteins or molecules called ligands within the cell. Experimental determination of protein structure and function is tedious. Computational approaches aim to accurately predict the properties of proteins to complement experimental efforts of understanding biochemical mechanisms within the cell. This thesis introduces computational techniques that predict the structure of protein complexes and identify protein residues involved in interactions with common biomolecules, such as metal ions and nucleic acids, based on sequence information. 

AlphaFold, a method that predicted protein structure using sequence information with almost experimental accuracy, was a critical breakthrough that shaped the field of protein structure prediction. Subsequently, approaches such as FoldDock adapted the AlphaFold pipeline for dimer complexes. Paper I applies the FoldDock protocol to understand toxin-antitoxin systems. These protein complexes are highly evolutionary conserved, and high-confidence dimer predictions were generated. Paper II applies the FoldDock protocol to study protein-protein interactions in the human proteome. To verify the reliability of machine-learning-based computational methods, they must be tested on independent data different from the data used to train the method. Paper III involves generating and using a homology-reduced independent test set to benchmark the performance of protein complex structure predictors, including the recent AlphaFold release adapted for multi-chain proteins – AlphaFold-Multimer. A confidence score (pDockQ2) was proposed to estimate the quality of the interfaces within multimers. Paper I, Paper II and Paper III are associated with predicting and evaluating protein-protein interactions. 

Representation learning involves finding effective representations of input data to maximise available information, making it easier to understand and process them for downstream prediction tasks. A recent advance in protein representation learning is Protein Language models (pLMs), where large language models are trained on a massive corpus of protein sequences. Highly contextualised and informative vector representations contained in the last hidden layer of the model have been used to predict numerous properties, such as ligand binding sites, subcellular localisation, and post-translational modifications, among others. Paper IV uses residue-level embeddings to predict whether a protein binds to one or more of the ten most common ions. It also predicts residue-level binding probabilities for multiple ions simultaneously. Paper V expands this approach beyond metals. It explores the impact of structure-informed features alongside sequence embeddings to predict whether a residue binds to nucleic acids, small molecules or metals.  Paper IV and Paper V are associated with developing machine learning methods to predict and evaluate protein-ligand interactions. 

In summary, the research conducted within this thesis offers valuable insights into three crucial levers to systematically harness the potential of machine learning for protein bioinformatics. These are (1) construction of homology-reduced non-redundant datasets, (2) finding optimal protein representations, and (3) rigorous evaluation and inference. 

Ort, förlag, år, upplaga, sidor
Stockholm: Department of Biochemistry and Biophysics, Stockholm University, 2024. s. 55
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Forskningsämne
biokemi med inriktning mot bioinformatik
Identifikatorer
urn:nbn:se:su:diva-224344 (URN)978-91-8014-613-5 (ISBN)978-91-8014-614-2 (ISBN)
Disputation
2024-01-26, Air & Fire, SciLifeLab, Tomtebodavägen 23A, Solna, 09:00 (Engelska)
Opponent
Handledare
Tillgänglig från: 2024-01-02 Skapad: 2023-12-07 Senast uppdaterad: 2023-12-20Bibliografiskt granskad
Zhu, W., Shenoy, A., Kundrotas, P. & Elofsson, A. (2023). Evaluation of AlphaFold-Multimer prediction on multi-chain protein complexes. Bioinformatics, 39(7), Article ID btad424.
Öppna denna publikation i ny flik eller fönster >>Evaluation of AlphaFold-Multimer prediction on multi-chain protein complexes
2023 (Engelska)Ingår i: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 39, nr 7, artikel-id btad424Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Motivation: Despite near-experimental accuracy on single-chain predictions, there is still scope for improvement among multimeric predictions. Methods like AlphaFold-Multimer and FoldDock can accurately model dimers. However, how well these methods fare on larger complexes is still unclear. Further, evaluation methods of the quality of multimeric complexes are not well established.

Results: We analysed the performance of AlphaFold-Multimer on a homology-reduced dataset of homo- and heteromeric protein complexes. We highlight the differences between the pairwise and multi-interface evaluation of chains within a multimer. We describe why certain complexes perform well on one metric (e.g. TM-score) but poorly on another (e.g. DockQ). We propose a new score, Predicted DockQ version 2 (pDockQ2), to estimate the quality of each interface in a multimer. Finally, we modelled protein complexes (from CORUM) and identified two highly confident structures that do not have sequence homology to any existing structures.

Availability and implementation: All scripts, models, and data used to perform the analysis in this study are freely available at https://gitlab.com/ElofssonLab/afm-benchmark.

Nationell ämneskategori
Bioinformatik (beräkningsbiologi) Bioinformatik och beräkningsbiologi
Identifikatorer
urn:nbn:se:su:diva-219972 (URN)10.1093/bioinformatics/btad424 (DOI)001030747300005 ()2-s2.0-85166268973 (Scopus ID)
Forskningsfinansiär
Vetenskapsrådet, 2021-03979Knut och Alice Wallenbergs Stiftelse
Tillgänglig från: 2023-08-10 Skapad: 2023-08-10 Senast uppdaterad: 2025-02-05Bibliografiskt granskad
Ernits, K., Saha, C. K., Brodiazhenko, T., Chouhan, B., Shenoy, A., Buttress, J. A., . . . Atkinson, G. C. (2023). The structural basis of hyperpromiscuity in a core combinatorial network of type II toxin–antitoxin and related phage defense systems. Proceedings of the National Academy of Sciences of the United States of America, 120(33), Article ID e2305393120.
Öppna denna publikation i ny flik eller fönster >>The structural basis of hyperpromiscuity in a core combinatorial network of type II toxin–antitoxin and related phage defense systems
Visa övriga...
2023 (Engelska)Ingår i: Proceedings of the National Academy of Sciences of the United States of America, ISSN 0027-8424, E-ISSN 1091-6490, Vol. 120, nr 33, artikel-id e2305393120Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Toxin-antitoxin (TA) systems are a large group of small genetic modules found in prokaryotes and their mobile genetic elements. Type II TAs are encoded as bicistronic (two-gene) operons that encode two proteins: a toxin and a neutralizing antitoxin. Using our tool NetFlax (standing for Network-FlaGs for toxins and antitoxins), we have performed a large-scale bioinformatic analysis of proteinaceous TAs, revealing interconnected clusters constituting a core network of TA-like gene pairs. To understand the structural basis of toxin neutralization by antitoxins, we have predicted the structures of 3,419 complexes with AlphaFold2. Together with mutagenesis and functional assays, our structural predictions provide insights into the neutralizing mechanism of the hyperpromiscuous Panacea antitoxin domain. In antitoxins composed of standalone Panacea, the domain mediates direct toxin neutralization, while in multidomain antitoxins the neutralization is mediated by other domains, such as PAD1, Phd-C, and ZFD. We hypothesize that Panacea acts as a sensor that regulates TA activation. We have experimentally validated 16 NetFlax TA systems and used domain annotations and metabolic labeling assays to predict their potential mechanisms of toxicity (such as membrane disruption, and inhibition of cell division or protein synthesis) as well as biological functions (such as antiphage defense). We have validated the antiphage activity of a RosmerTA system encoded by Gordonia phage Kita, and used fluorescence microscopy to confirm its predicted membrane-depolarizing activity. The interactive version of the NetFlax TA network that includes structural predictions can be accessed at http://netflax.webflags.se/.

Nyckelord
toxin, antitoxin, AlphaFold, phage, Panacea
Nationell ämneskategori
Biokemi Molekylärbiologi
Identifikatorer
urn:nbn:se:su:diva-224333 (URN)10.1073/pnas.2305393120 (DOI)37556498 (PubMedID)2-s2.0-85167528527 (Scopus ID)
Forskningsfinansiär
Knut och Alice Wallenbergs Stiftelse, 2020.0037Vetenskapsrådet, 2019-01085Vetenskapsrådet, 2022-01603Vetenskapsrådet, 2021-01146Vetenskapsrådet, 2021-03979Carl Tryggers stiftelse för vetenskaplig forskning , CTS19:24Kempestiftelserna, SMK-2061.1Cancerfonden, 20 0872 PjCrafoordska stiftelsen, 20220562Ragnar Söderbergs stiftelse, M23/14
Tillgänglig från: 2023-12-06 Skapad: 2023-12-06 Senast uppdaterad: 2025-02-20Bibliografiskt granskad
Burke, D. F., Bryant, P., Barrio-Hernandez, I., Memon, D., Pozzati, G., Shenoy, A., . . . Elofsson, A. (2023). Towards a structurally resolved human protein interaction network. Nature Structural & Molecular Biology, 30(2), 216-225
Öppna denna publikation i ny flik eller fönster >>Towards a structurally resolved human protein interaction network
Visa övriga...
2023 (Engelska)Ingår i: Nature Structural & Molecular Biology, ISSN 1545-9993, E-ISSN 1545-9985, Vol. 30, nr 2, s. 216-225Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Cellular functions are governed by molecular machines that assemble through protein-protein interactions. Their atomic details are critical to studying their molecular mechanisms. However, fewer than 5% of hundreds of thousands of human protein interactions have been structurally characterized. Here we test the potential and limitations of recent progress in deep-learning methods using AlphaFold2 to predict structures for 65,484 human protein interactions. We show that experiments can orthogonally confirm higher-confidence models. We identify 3,137 high-confidence models, of which 1,371 have no homology to a known structure. We identify interface residues harboring disease mutations, suggesting potential mechanisms for pathogenic variants. Groups of interface phosphorylation sites show patterns of co-regulation across conditions, suggestive of coordinated tuning of multiple protein interactions as signaling responses. Finally, we provide examples of how the predicted binary complexes can be used to build larger assemblies helping to expand our understanding of human cell biology.

Nationell ämneskategori
Bioinformatik och beräkningsbiologi
Identifikatorer
urn:nbn:se:su:diva-215904 (URN)10.1038/s41594-022-00910-8 (DOI)000928325000001 ()36690744 (PubMedID)2-s2.0-85146676554 (Scopus ID)
Tillgänglig från: 2023-03-29 Skapad: 2023-03-29 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Quaglia, F., Mészáros, B., Salladini, E., Hatos, A., Pancsa, R., Chemes, L. B., . . . Piovesan, D. (2022). DisProt in 2022: improved quality and accessibility of protein intrinsic disorder annotation. Nucleic Acids Research, 50(D1), D480-D487
Öppna denna publikation i ny flik eller fönster >>DisProt in 2022: improved quality and accessibility of protein intrinsic disorder annotation
Visa övriga...
2022 (Engelska)Ingår i: Nucleic Acids Research, ISSN 0305-1048, E-ISSN 1362-4962, Vol. 50, nr D1, s. D480-D487Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

The Database of Intrinsically Disordered Proteins (DisProt, URL: https://disprot.org) is the major repository of manually curated annotations of intrinsically disordered proteins and regions from the literature. We report here recent updates of DisProt version 9, including a restyled web interface, refactored Intrinsically Disordered Proteins Ontology (IDPO), improvements in the curation process and significant content growth of around 30%. Higher quality and consistency of annotations is provided by a newly implemented reviewing process and training of curators. The increased curation capacity is fostered by the integration of DisProt with APICURON, a dedicated resource for the proper attribution and recognition of biocuration efforts. Better interoperability is provided through the adoption of the Minimum Information About Disorder (MIADE) standard, an active collaboration with the Gene Ontology (GO) and Evidence and Conclusion Ontology (ECO) consortia and the support of the ELIXIR infrastructure.

Nationell ämneskategori
Biologiska vetenskaper
Identifikatorer
urn:nbn:se:su:diva-201891 (URN)10.1093/nar/gkab1082 (DOI)000743496700059 ()34850135 (PubMedID)2-s2.0-85125157608 (Scopus ID)
Tillgänglig från: 2022-02-10 Skapad: 2022-02-10 Senast uppdaterad: 2022-10-07Bibliografiskt granskad
Bryant, P., Pozzati, G., Zhu, W., Shenoy, A., Kundrotas, P. & Elofsson, A. (2022). Predicting the structure of large protein complexes using AlphaFold and Monte Carlo tree search. Nature Communications, 13(1), Article ID 6028.
Öppna denna publikation i ny flik eller fönster >>Predicting the structure of large protein complexes using AlphaFold and Monte Carlo tree search
Visa övriga...
2022 (Engelska)Ingår i: Nature Communications, E-ISSN 2041-1723, Vol. 13, nr 1, artikel-id 6028Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

AlphaFold can predict the structure of single- and multiple-chain proteins with very high accuracy. However, the accuracy decreases with the number of chains, and the available GPU memory limits the size of protein complexes which can be predicted. Here we show that one can predict the structure of large complexes starting from predictions of subcomponents. We assemble 91 out of 175 complexes with 10–30 chains from predicted subcomponents using Monte Carlo tree search, with a median TM-score of 0.51. There are 30 highly accurate complexes (TM-score ≥0.8, 33% of complete assemblies). We create a scoring function, mpDockQ, that can distinguish if assemblies are complete and predict their accuracy. We find that complexes containing symmetry are accurately assembled, while asymmetrical complexes remain challenging. The method is freely available and accesible as a Colab notebook https://colab.research.google.com/github/patrickbryant1/MoLPC/blob/master/MoLPC.ipynb.

Nationell ämneskategori
Biologiska vetenskaper
Identifikatorer
urn:nbn:se:su:diva-211010 (URN)10.1038/s41467-022-33729-4 (DOI)000867312100019 ()36224222 (PubMedID)2-s2.0-85139763194 (Scopus ID)
Tillgänglig från: 2022-11-09 Skapad: 2022-11-09 Senast uppdaterad: 2023-08-10Bibliografiskt granskad
Shenoy, A. & Elofsson, A.Impact of joint structure and sequence representations for ligand binding site prediction..
Öppna denna publikation i ny flik eller fönster >>Impact of joint structure and sequence representations for ligand binding site prediction.
(Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

Summary: Accurate ligand-binding site prediction provides insights into molecular interactions, drug discovery and design. Most computational methods can identify residues binding to a single ligand but cannot simultaneously predict binding sites for multiple ligands. Sequence-based methods that predict multiple ligands are fast but have poor performance. Structure-based methods primarily use protein surface properties to predict ligand binding sites. These methods are accurate but slow. We studied the impact of combining structure-informed representations with sequence embeddings to generate a quick yet accurate predictor. While the protein binding surface interacting with each ligand is unique, we find that structure-informed representations do not significantly improve prediction performance. Availability and Implementation: Source code available at https://github.com/aditishenoy/ligandbinding

Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Identifikatorer
urn:nbn:se:su:diva-224343 (URN)
Forskningsfinansiär
Knut och Alice Wallenbergs Stiftelse
Tillgänglig från: 2023-12-07 Skapad: 2023-12-07 Senast uppdaterad: 2023-12-07
Shenoy, A., Kalakoti, Y., Sundar, D. & Elofsson, A.M-Ionic: Prediction of metal ion binding sites from sequence using residue embeddings.
Öppna denna publikation i ny flik eller fönster >>M-Ionic: Prediction of metal ion binding sites from sequence using residue embeddings
(Engelska)Manuskript (preprint) (Övrigt vetenskapligt)
Abstract [en]

Motivation: Understanding metal-protein interaction can provide structural and functional insights into cellular processes. As the number of protein sequences increases, developing fast yet precise computational approaches to predict and annotate metal binding sites becomes imperative. Quick and resource-efficient pre-trained protein language model (PLM) embeddings have successfully predicted binding sites from protein sequences despite not using structural or evolutionary features (multiple sequence alignments). Using residue-level embeddings from the PLMs, we have developed a sequence-based method (M-Ionic) to identify metal-binding proteins and predict residues involved in metal-binding.Results: On independent validation of recent proteins, M-Ionic reports an area under the curve (AUROC) of 0.83 (recall=84.6%) in distinguishing metal-binding from non-binding proteins compared to AUROC of 0.74 (recall =61.8%) of the next best method. In addition to comparable performance to the state-of-the-art method for identifying metal-binding residues (Ca2+, Mg2+, Mn2+, Zn2+), M-Ionic provides binding probabilities for six additional ions (i.e., Cu2+, Po43-, So42-, Fe2+, Fe3+, Co2+). We show that the PLM embedding of a single residue contains sufficient information about its neighbours to predict its binding properties. Availability and Implementation: M-Ionic can be used on your protein of interest using a Google Colab Notebook (https://bit.ly/40FrRbK). The GitHub repository (https://github.com/TeamSundar/m-ionic) contains all code and data.

Nyckelord
Metal-binding, Protein, Protein Language Models
Nationell ämneskategori
Bioinformatik (beräkningsbiologi)
Identifikatorer
urn:nbn:se:su:diva-224342 (URN)10.1101/2023.04.06.535847 (DOI)
Forskningsfinansiär
Knut och Alice Wallenbergs Stiftelse
Tillgänglig från: 2023-12-07 Skapad: 2023-12-07 Senast uppdaterad: 2023-12-07
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0001-7748-2501

Sök vidare i DiVA

Visa alla publikationer