Local
Pattern-based (blocks, motifs, profiles), rather than similarity-based (sequence), comparison methods may be preferred when searching for functionally conserved non-homologous domains.
Require O(MN) number of comparison operations, where
M = length of query sequence
N = number of amino acids in the sequence libraryExamples:
Needleman-Wunsch - global alignment
Smith-Waterman (SSEARCH, BLITZ, BLITZ) - local alignment
Heuristic - Much faster becuase they examine only a portion of the potential alignments between 2 sequences, does not quarantee to calculate an optimal similarity score
Examples:
FASTA - local alignment, 20x faster than Smith-Waterman
BLAST - local alignment, 100x faster than Smith-Waterman
The BLOSUM matix is calculated from "blocks" of aligned sequences that differ by no more than X%. A BLOSUM62 matrix is derived from blocks of aligned sequences that are at least 62% identical. BLASTP uses the BLOSUM62 matrix and FASTA uses BLOSUM50.
Matrix values are log-odds scores:
q = observed replacement frequency for sequence i to
log ij sequence j
b -----
p = expected replacement frequence for sequence i to
ij sequence j based on residue composition alone
The maximum of many independant random variables (an optimal local alignment) follows an extreme value distribution. Given that the choice of residues at any position in the sequence is random (independent and identically distributed) and the expected score when replacing one residue for another is negative, Karlin-Altschul statistics can be applied.
-lambdaS
E = KMNe
E = number of matches expected by chance alone scoring S or higher
M = query sequence length
N = database sequence length
S = alignment score of the match = maximal segment pair score (MSP)
K = dimensionless quantity representing the relative independence of
each position in a sequence or scoring matrix (ie., 0 < K <= 1)
lambda = a scaling factor (information/unit score)
When you combine lambda and S you get a normalized score with dimensions of information (bits). This allows you to compare scores produced by different alignments statistically.
-lambdaS
E = KM'N'e
M' = M - L
N' = N - L
L = expected length of MSP
lambdaS
H = -------- = relative entropy of the observed and expected frequencies
L
The higher H, the more information is in a short alignment. If H is low, you need a longer alignment to get same amount of information. H decreases with increasing PAM distance, while L (critical length) increases with increasing PAM distance.
Poisson statistics were later incorporated into BLAST to statistically combine multiple high-scoring segment pairs (HSP). HSPs replace the single MSP in Karlin-Altschul statistics.
"Sum" statistics were most recently incorporated into BLAST to calculate scores for gapped alignments.
ktup parameter determines the speed and sensisty of the search: ktup=2 is about 4 times faster than ktup = 1, but not as sensitive.
FASTA Expectation values = odds of getting a score by chance E()-value of 10e-20 = odds of getting this score by chance E()-value of .02 = 98 times out of 100 of getting this score by chance
The expectation value is the expected number of times that a sequence would obtain a Z-score as high or higher. Z-scores match the extreme value distribution very closely.
Steps in FASTA:
P-values (BLASTP) are roughly the same as E()-values (FASTA and SSEARCH) when E < 0.1.
Steps in BLASTP:
BLAST guaranteed not to have any false positives. Noise ratio raises with FASTA and Smith-Waterman. If you improve the score (reduce) for non-homologous sequences, you reduce the signal to noise ratio.
BLOSUM50/BLOSUM55 are best scoring matrices with gap penalities of (-12, -2; -14, -2).
FASTA uses default gap penalty of -12, -2 (open gap, extend gap). If obtain too many unrelated high scoring sequences or when partial sequences are compared (ESTs) use a higher gap penalty (-16, -4).
If the query sequence does not include a region of low complexity, FASTA E()-values and BLASTP P()-values < 0.02 indicate homology.
For alignments without gaps, as the library sequence length increases the simiarity scores for unrelated sequences also increase. Similarity scores are therefore normalized for library sequence length in order to detect more distant relationships.
For alignments with gaps, low gap penalities cause similarity scores to lose their selectivity. Use randomly shuffled sequences to test whether similarity scores are reasonable. Since a sequence should not have significant similarity with a random sequence, the expectation values for the highest scoring library sequence search with a random query sequence should be between 0.2-2.0.
Compiled from lectures by Warren Gish and William Pearson, Cold Spring Harbor Laboratory Course - Computational Genomics, Oct 31-Nov 5, 1996
References