In the present age of genome sequencing, a vast number of predicted genes are initially known only by their putative nucleotide sequence. The newly established field of bioinformatics is concerned with the computational prediction of structural and functional properties of genes and the proteins they encode, based on their nucleotide and amino acid sequences.
Since one of the crucial properties of a protein is its subcellular location, prediction of protein sorting is an important question in bioinformatics. A fundamental distinction in protein sorting is that between secretory and non-secretory proteins, determined by a cleavable N-terminal sorting signal, the secretory signal peptide.
The main part of this thesis, including four of the six papers, concerns prediction of secretory signal peptides in both eukaryotic and bacterial data using two machine learning techniques: artificial neural networks and hidden Markov models. A central result is the SignalP prediction method, which has been made available as a World Wide Web server and is very widely used.
Two additional prediction methods are also included, with one paper each. ChloroP predicts chloroplast transit peptides, another cleavable N-terminal sorting signal; while NetStart predicts start codons in eukaryotic genes. For prediction of all N-terminal signals, the assignment of correct start codon can be critical, which is why prediction of translation initiation from the nucleotide sequence is also important for protein sorting prediction.
This thesis comprises a detailed review of the molecular biology of protein secretion, a short introduction to the most important machine learning algorithms in bioinformatics, and a critical review of existing methods for protein sorting prediction. In addition, it contains general treatment of the principles of data set construction and performance evaluation for prediction methods in bioinformatics.
Stockholm: Stockholm University , 1999. , 96 p.