Alignments, Algorithms and Statistics


Alignments

Global
  • compares the entire length of 2 sequences, alignment must begin at the beginning and extend to the end of each sequence
  • global similarity scores can be calculated with or without gap penalties at the ends of sequences
  • don't know how to do statistics on global similarity scores
  • most appropriate for building evolutionary trees from homologous sequences
  • hard to detect mosaic and other protein domains

    Local

  • compares a portion of 2 sequences, ignores regions of dissimilarity
  • identifies the most similar region of 2 sequences
  • can never have a negative local similarity score
  • can do statistics on local similarity scores (Karlin-Altschul)
  • usually most appropriate for searching protein and DNA databases
  • can find similar domains in non-homologous proteins
  • can be used to find internal duplications
  • can be used to find exons in genomic DNA from

    Pattern-based (blocks, motifs, profiles), rather than similarity-based (sequence), comparison methods may be preferred when searching for functionally conserved non-homologous domains.

    Classes of Algorithms

    Rigorous - Guarantee to calculate an optimal similarity score
    Require O(MN) number of comparison operations, where
    M = length of query sequence
    N = number of amino acids in the sequence library

    Examples:
    Needleman-Wunsch - global alignment
    Smith-Waterman (SSEARCH, BLITZ, BLITZ) - local alignment

    Heuristic - Much faster becuase they examine only a portion of the potential alignments between 2 sequences, does not quarantee to calculate an optimal similarity score

    Examples:
    FASTA - local alignment, 20x faster than Smith-Waterman
    BLAST - local alignment, 100x faster than Smith-Waterman

    Substition/Scoring Matrices

    The PAM matrix (Point Accepted Mutations) is based on the frequency with which each amino acid will mutate into another over a short evolutionary distance 1% of the time (99% identical). Rare mutations have large negative values and conservative changes have positive values. A PAM250 matrix reflects the substitution frequency for proteins that have changed (diverged) 250% (about 15% identical).

    The BLOSUM matix is calculated from "blocks" of aligned sequences that differ by no more than X%. A BLOSUM62 matrix is derived from blocks of aligned sequences that are at least 62% identical. BLASTP uses the BLOSUM62 matrix and FASTA uses BLOSUM50.

    Matrix values are log-odds scores:

            q        = observed replacement frequency for sequence i to 
    log      ij        sequence j
       b   ----- 
            p        = expected replacement frequence for sequence i to
             ij        sequence j based on residue composition alone
    

    Local Alignment Statistics

    Local alignments are produced by dynamic programming algorithms that first calculate the optimal score between residues and then calculate the three possible ways to extend the alignment. The way that gives the optimal score is choosen.
    1. extend the alignment by one residue in each sequence
    2. extend the alignment by one residue in the first sequence and align it with a gap in the second
    3. extend the alignment by one residue in the second sequence and align it with a gap in the first

    The maximum of many independant random variables (an optimal local alignment) follows an extreme value distribution. Given that the choice of residues at any position in the sequence is random (independent and identically distributed) and the expected score when replacing one residue for another is negative, Karlin-Altschul statistics can be applied.

                               -lambdaS
                       E = KMNe
    
    E = number of matches expected by chance alone scoring S or higher
    M = query sequence length
    N = database sequence length
    S = alignment score of the match = maximal segment pair score (MSP)
    K = dimensionless quantity representing the relative independence of
        each position in a sequence or scoring matrix (ie., 0 < K <= 1) 
    lambda = a scaling factor (information/unit score)
    

    When you combine lambda and S you get a normalized score with dimensions of information (bits). This allows you to compare scores produced by different alignments statistically.

    Edge Effects

    If the alignment begins (and extends) past the ends of the sequences, you can not get a statistically significant result. For non-infinite sequences use a modified equation:
                                   -lambdaS
                         E = KM'N'e
    
    M' = M - L
    N' = N - L
    L = expected length of MSP
    
          lambdaS
    H =  --------    = relative entropy of the observed and expected frequencies
            L
    

    The higher H, the more information is in a short alignment. If H is low, you need a longer alignment to get same amount of information. H decreases with increasing PAM distance, while L (critical length) increases with increasing PAM distance.

    Poisson statistics were later incorporated into BLAST to statistically combine multiple high-scoring segment pairs (HSP). HSPs replace the single MSP in Karlin-Altschul statistics.

    "Sum" statistics were most recently incorporated into BLAST to calculate scores for gapped alignments.

    Smith-Waterman algorithm (SSEARCH, BLITZ, BLAZE)

    Calculates an optimal smilarity score given 2 sequences, a socring matrix and gap penalties. The optimal score may be unique, but there may be several alignments that produce that give the same score. There are no limits on the number of gaps that can be inserted.

    FASTA

    Looks at regions where there are either pairs (ktup = 2) or single aligned (ktup = 1) identities. Sensitivity in version 2.0 of FASTA was improved by calculating an "optimized" score for a majority of the sequences in the library search and selectivity was improved by correcting similarity scores for the expected increase due to library sequence length.

    ktup parameter determines the speed and sensisty of the search: ktup=2 is about 4 times faster than ktup = 1, but not as sensitive.

    FASTA Expectation values = odds of getting a score by chance E()-value of 10e-20 = odds of getting this score by chance E()-value of .02 = 98 times out of 100 of getting this score by chance

    The expectation value is the expected number of times that a sequence would obtain a Z-score as high or higher. Z-scores match the extreme value distribution very closely.

    Steps in FASTA:

    1. Identify regions shared by the 2 sequences with the highest density of identities (ktup = 1) or pairs of identities (ktup = 2).
    2. Rescan the 10 regions with the highest density of identities using the BLOSUM50 matrix. Trim the ends of the region to include only those residues contributing to the highest score. Each region is a partial alignment without gaps.
    3. If there are several initial regions with scores greater than a threshold, check to see it the trimmed initial regions can be joined to form an approximate alignment with gaps. Calculate a similarity score that is the sum of the joined initial regions minus a penalty (usually 20) for each gap.
    4. For sequences with scores greater than a threshold, construct an optimal local alignment of the query sequence and the library sequence, using only those residues that lie in a band centered on the best initial region found in Step 2. For protein serches with kup=2, a 16 residue band is used by default. A 32 residue band is used with kup=1. This is the optimized (opt) score.
    5. After all (or the first 10-20,000) scores have been calculated, the raw similarity scores are normalized against ln (library-sequence length) creating Z-values which are used to rank the library sequences.
    6. Smith-Waterman algorithm (without limitation on gap size) is used to produce the final alignments.

    BLAST

    Looks for triples (ktup = 3) identities and conserved replacements. Accurately estimates for the statistical significancee of similarity scores. Uses a word-based scanning procedure to identify regions of local similarity without gaps.

  • BLASTP values always < 1 because its a probability
  • BLASTP is the most widely used program for rapid sequence comparison.
  • BLASTP combines good sensitivity with exceptional selectivity.

    P-values (BLASTP) are roughly the same as E()-values (FASTA and SSEARCH) when E < 0.1.

    Steps in BLASTP:

    1. For every 3 amino acids in the query sequence, identify all substitutions of each word that have a similarity score greater than a threshold score T = 11.
    2. Build a "table" to recognize the list of identical and substituted 3 letters words.
    3. Use the table to identify all the matching words in sequences in the database. If a match is found, attempt to extend the match forwards and backwards using the BLOSUM62 matrix to produce a score that is higher than a threshold score. Save all high scoring regions shared by the query sequence and each library sequence. The best of these scores is reported to be the best single MSP score. These high scoring regions do not contain gaps.
    4. Attempt to combine multiple MSP regions. For each "consistent" combination, calculate the probability of obtaining the match using either "poisson" or "sum" statistics. Report the lowest probability score.
    5. Report all the significant alignments. Frequently a query and library sequence will contain multiple MSPs because of the requirement that they do not contain gaps.

    Tips on which algorithm and scoring matrix to use

    No statistical difference between results of Smith-Waterman, BLAST and FASTA at ktup = 1. If FASTA is not optimized, BLAST is better.

    BLAST guaranteed not to have any false positives. Noise ratio raises with FASTA and Smith-Waterman. If you improve the score (reduce) for non-homologous sequences, you reduce the signal to noise ratio.

    BLOSUM50/BLOSUM55 are best scoring matrices with gap penalities of (-12, -2; -14, -2).

    FASTA uses default gap penalty of -12, -2 (open gap, extend gap). If obtain too many unrelated high scoring sequences or when partial sequences are compared (ESTs) use a higher gap penalty (-16, -4).

    If the query sequence does not include a region of low complexity, FASTA E()-values and BLASTP P()-values < 0.02 indicate homology.

    For alignments without gaps, as the library sequence length increases the simiarity scores for unrelated sequences also increase. Similarity scores are therefore normalized for library sequence length in order to detect more distant relationships.

    For alignments with gaps, low gap penalities cause similarity scores to lose their selectivity. Use randomly shuffled sequences to test whether similarity scores are reasonable. Since a sequence should not have significant similarity with a random sequence, the expectation values for the highest scoring library sequence search with a random query sequence should be between 0.2-2.0.


    Compiled from lectures by Warren Gish and William Pearson, Cold Spring Harbor Laboratory Course - Computational Genomics, Oct 31-Nov 5, 1996

    References

    1. Altschul, Stephen, F., Mark S. Boguski, Warren Gish and John C. Wootton. (1994). Issues in searching molecular sequence databases. Nature Genetics 6:119-129.
    2. Altschul, Stephen F. and Warren Gish. (1996). Local Alignment Statistics. Meth. Enz. 266:460-480.
    3. Gish, Warren. (1995). BLASTP 2.0a2MP-WashU manual pages.
    4. Pearson, William R. (1996). Protein Sequence Comparison and Protein Evolution. In preparation.