Skip to main content Accessibility help
  • Print publication year: 2009
  • Online publication date: February 2011

5 - Secondary structure prediction with learning methods (nearest neighbors)



Nearest neighbor searching


Binary search trees

Learning methods are, in general, methods which adapt their parameters using historical data. Nearest neighbors (NN) is an extreme in this direction, as all training data are stored and the most suitable part is used for predictions. Unless there are contradictory data, every time that NN sees the same information as in the training set, it will return the same answer as in the training set. In this sense, it is a perfect method, as it repeats the training data exactly.

Generalities of nearest neighbor methods (Figure 5.1)

(i) NN methods extract data from close neighbors, which means that a distance function has to be defined between the data points (to determine what is close and what is distant). The data columns may be in very different units, for example kg, years, mm, US$, etc., so the data have to be normalized before computing distances.

(ii) Finding neighbors for small sets of training data (e.g. m < 1000) is best done with a sequential search of all the data. So our interest here is in problems where m is very large and we have to compute neighbors many times.

(iii) As opposed to best basis where extra variables were not obviously harmful (in some cases random columns were even useful), NN deteriorates if we use data columns which are not related to our data.

(iv) NN methods have a close relationship with clustering for which our main algorithm for NN can also be used.