More Recent Comments

Saturday, June 21, 2008

Sequence Alignment

Sequence alignment is one of the crucial steps in deciding whether two genes/proteins are homologous. The two sequences are aligned from one end to the other and the number of identical, or similar, residues is counted. If this number reaches a significant percentage of the total length (usually >25%) then the two sequences are homologous—they descend from a common ancestor.

Sequence alignment is not straightforward, even for two sequences, because in addition to substitutions the genes might have undergone insertions or deletions (indels). In order to identify conserved residues, one needs to insert gaps in one sequence or the other to compensate for these indel events.

You can't just willy-nilly stick in gaps to maximize the number of aligned residues because the gaps represent true historical events (insertions and deletions). In theory, you can get high identity scores with any two sequences as long as you insert enough gaps but that isn't allowed. When the alignment is done by computer algorithm, each gap is associated with a gap penalty.

The determination of proper gap penalties is a major challenge in multiple sequence alignment. A crude estimate is that each gap comes with a penalty of 3—that is you have to generate at least three identities in order to make the gap worthwhile. The number of gaps and gap penalties have to be subtracted from the identity/similarity scores when deciding about homology. (This isn't always done.)

Here's an example of a multiple sequence alignment from a region of bacterial HSP70 genes. The letters represent the amino acid residues and the dashes are gaps due to insertions and deletions.


The HSP70 genes are the most highly conserved genes in biology so, in principle, it should be easy to align them. In fact, it is easy in most regions but the one shown above is the most difficult. This is a manual alignment that takes into account the similarities of groups of sequences. Those that are most similar are clustered together and whenever possible the alignment is adjusted so that the positions of the gaps in the most closely related sequences are identical.

This is a procedure known as phylogenetic alignment but it would be better to call it similarity alignment because what we're actually doing is clustering sequences by their overall similarity and not their phylogeny. (The fact that their phylogenetic relatedness closely corresponds to their similarity is a consequence of the the analysis and not a cause.)1.

The placing of gaps in this region of HSP70 sequences is very difficult. No computer program can come close to achieving the quality of alignments that well trained humans can achieve. That's because the overall alignment has to take into account a number of variables simultaneously and the progressive alignment takes many trial-and-error steps. As a general rule of thumb, if you see a paper where phylogenetic trees are constructed using computer-generated multiple sequence alignments only, then you should assign a low confidence value to that work.

Is this important? Indeed it is. The exact nature and position of the large gap in the above sequences, for example, plays an important role in testing the Three Domain Hypothesis. Different alignments give different trees and the most important variable is the position of gaps.

This brings me to an important paper just published in this week's issue of Science. Löytynoja and Goldman (2008) have developed a new algorithm for multiple sequence alignment. The abstract of their paper describes the problem, and their solution.
Genetic sequence alignment is the basis of many evolutionary and comparative studies, and errors in alignments lead to errors in the interpretation of evolutionary information in genomes. Traditional multiple sequence alignment methods disregard the phylogenetic implications of gap patterns that they create and infer systematically biased alignments with excess deletions and substitutions, too few insertions, and implausible insertion-deletion–event histories. We present a method that prevents these systematic errors by recognizing insertions and deletions as distinct evolutionary events. We show theoretically and practically that this improves the quality of sequence alignments and downstream analyses over a wide range of realistic alignment problems. These results suggest that insertions and sequence turnover are more common than is currently thought and challenge the conventional picture of sequence evolution and mechanisms of functional and structural changes.
The authors test their phylogeny-aware program (PRANK) against several other multiple sequence alignment programs (ClustalW, MAFFT, MUSCLE, and T-COFFEE) using a set of sequences that were "evolved" using a computer program that created substitutions and insertions/deletions. Since the true phylogeny of this artificial set is known, they were able to evaluate the performance of the various programs.

As you might expect, PRANK came out best in this test. I'm not sure that it would work best with real data but that's not really my point. My point is that this is an ongoing problem that has not been fully solved. It is still best to avoid multiple sequence alignments that have not been manually improved by humans with considerable experience in sequence alignment.

I'll close by quoting from the discussion in Löytynoja and Goldman (2008) just to remind everyone how important this is. They argue that even post-alignment human "refinement" of computer generated sequence alignments suffers from systemic bias.
Our analyses show that sequence alignment remains a challenging task, and alignments generated with methods based on the traditional progressive algorithm may lead to seriously incorrect conclusions in evolutionary and comparative studies. The main reason for their systematic error is disregard of the phylogenetic implications of gap patterns created—which is not corrected by considering alignment consistency (13) or using post alignment refinement (14, 15)—and this error is intensified by methods that intentionally force gaps into tight blocks. Affected methods can be positively misleading and become increasingly confident of erroneous solutions as more sequences are included. It is not the progressive algorithm as such that is defective, rather, correct alignment requires that we take account of sequences' phylogeny, irrespective of alignment method used or data type, but the original implementations of the progressive algorithm have a flaw that has gone unnoticed as long as different methods have been consistent in the error they create.

That such a significant error has passed undetected may be explained by the alignment field's historical focus on proteins, where these biases tend to be manifested in less-constrained regions such as loops (compare Fig. 1). Alignments with insertions and deletions squeezed compactly between conserved blocks may suffice for, and even be preferred by, some molecular biologists working with proteins. We have shown, however, that these patterns are, in fact, imposed by systematic biases in alignment algorithms, even in cases where they are incorrect and, indeed, phylogenetically unreasonable. We contend that algorithms that impose gap patterns like those found in structural alignments of proteins are inappropriate for the increasingly widespread analysis of genomic DNA and are likely to cause error when the resulting alignments are used for evolutionary inferences.


1. In a sense, phylogenetic alignment creates a circular argument. What we're trying to do is to build a phylogenetic tree from the multiple sequence alignments. If we use the presumed phylogeny to generate the alignments then we have a problem. Part of the problem goes away once we recognize that the alignment is driven by clustering similar sequences rather than phylogenetically related sequence.

Löytynoja, A. and Goldman, N. (2008) Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis. Science 320:1632-1635. [DOI: 10.1126/science.1158395]

18 comments :

Ron said...

Doesn't homologous mean related by descent regardless of the % identity or % similarity? (Which becomes absurd when taken to extremes just like a lot of other rules.) Stating that some sequences are homologous implies some understanding of the history of the sequences beyond some threshold % identity. At least that's the teaching I got when learning about alignments.

PonderingFool said...

Slightly related, some proteins have added domains or different domains appended to them relative to other members in the protein family. Should those added domains be included in the alignment used as the basis for the phylogenetic analysis or should they be removed? If you remove them, then what is the criteria for when to do it and when not to?

Anonymous said...

Ron is right - assessing homology by percent identity, score or E-value is a misunderstanding. Quality or probability - maybe, but not homology. Deciding whether two aligned sequences are homologous requires the same human expert as in your example of hsp70 alignment.

Rosie Redfield said...

'Homologous' means 'similar because of descent from a common ancestor'. I tell my students that we infer homology for characters whose similarity is too strong to have arisen by chance and too arbitrary to have arisen by convergence (by selection on non-homologous characters to serve the same function).

The usual goal of sequence alignments is to align positions that are truly homologous, not just similar. The states of the aligned positions can then be treated as states of characters, and used to infer the phylogenetic relationships of the sequences.

The paper's authors aren't assuming homology, but using a recursive method (I gather also used by others) where an initial alignment is used to infer a preliminary phylogeny, which is then used to refine estimates of which positions are truly homologous. The refined alignment is then used to infer a refined phylogeny, etc.

I think the authors' improvement is to use the preliminary phylogenies specifically to distinguish between hypothesized insertion and deletion events (to deconstruct 'indels') so they can be treated differently in improving the alignment. They've already published this method, and this new paper demonstrates that the alignments it gives are indeed phylogenetically better.

Torbjörn Larsson said...
This comment has been removed by the author.
Larry Moran said...

Pawel Szczesny says,

Ron is right - assessing homology by percent identity, score or E-value is a misunderstanding.

There has to be some way of deciding whether two gene are homologous or not. The best evidence is whether they are similar at the level of nucleotide sequence.

If you align two proteins and they are completely dissimilar (<10% identity) then you'd have a tough time convincing anyone that the genes descend from a common ancestor. Conversely, if the genes are 99% identical then you'd be a fool to deny homology.

Nobody is saying that sequence similarity is the only criterion that can be used to assess homology but to claim that it's a misunderstanding to use it, is bizarre. In fact, it's so bizarre that I have to assume it's you who misunderstands what I'm saying.

Perhaps you thought I was one of those people who say, incorrectly, "25% homology" instead of reading what I actually wrote?

Anonymous said...

The third sentence of this post says:

If this number reaches a significant percentage of the total length (usually >25%) then the two sequences are homologous—they descend from a common ancestor.

And that's the only point I don't agree. If you align low complexity sequences or coiled-coils, or transmembrane helices, or whatever with a residue composition bias, sequence _identity_ is likely to be over 25% (similarity much more). I know that you know that in such case these sequences are not homologous. So why such a strong statement?

Sequence identity is the worst measure of similarity between two sequences, because the value alone tells you nothing. Unless you see the alignment, there's no way to distinguish false from true hits (XAXAXAXA and XXXXXXXX aligned have 50% identity). I'm aware of the fact that it's the most widely used measure. However it's used rather as an indication of difficulty of aligning two sequences or probability that two sequences are homologous (your 99% identity example), but not as a binary indication of homology.

I actually HAVE though time convincing people that 10%-15% identity or Evalue of 90 can mean homology.

Larry Moran said...

Pawel Szczesny says,

And that's the only point I don't agree. If you align low complexity sequences or coiled-coils, or transmembrane helices, or whatever with a residue composition bias, sequence _identity_ is likely to be over 25% (similarity much more). I know that you know that in such case these sequences are not homologous. So why such a strong statement?

This wasn't a posting about how to decide if two genes are homologous. If it had been, I would also have specified that the similarity has to extend over the whole gene. The minimum length is about 100 amino acid residues and more than 200 nucleotides.

Sequence identity is the worst measure of similarity between two sequences, because the value alone tells you nothing. Unless you see the alignment, there's no way to distinguish false from true hits (XAXAXAXA and XXXXXXXX aligned have 50% identity).

Russell Doolittle, among others, recognized the problem of false positives three decades ago. That's why we don't pay any attention to similarities between short sequences.

I actually HAVE though time convincing people that 10%-15% identity or Evalue of 90 can mean homology.

Give me an example.

Anonymous said...

this was an excellent post. that's what science blogging is about!

wicker

Peter said...

If you align two proteins and they are completely dissimilar (<10% identity) then you'd have a tough time convincing anyone that the genes descend from a common ancestor.

Not really. But this is where you start to get into more complexity than you wanted to address in your post, perhaps.

It depends also to a degree on what you mean by "homology". If you mean simply that they descend from a common ancestor, then you and I are both homologous to bananas. Life is monophyletic. What an analysis of sequence similarity does is tell you something about [i]how recently[/i] two genes are related to each other.

For two genes that share a recent common ancestor, then the nucleotide sequence similarity will be high along the complete length of the gene. For more distantly related genes, the similarity may not be apparent at all at the nucleotide level , but only be seen when you compare the [i]protein[/i] sequence. For even more distantly related genes, the homology may not even be apparent at the protein sequence level, but only be seen when you look at higher-level structural aspects of the protein and its folding.

To take a good example of this, look at histone genes.

Two very closely-related genes such as rat H2A and mouse H2A (XM_577577and NM_175660) will show high nucleotide sequence identity. In this case it's 93% over 393 nt sequence length.

More distant relatives such as Giardia H2A (XM_001706865) show no detectable identity at the nucleotide level, or at least no better than chance. However there's a clear alignment at the protein level - 48% identity and 67 similarity over a length of 85 amino acids.

More distantly, it's a good bet that several of the different histone subunits (or at least substantial parts of each subunit) within the nucleosome are in fact descended from a common ancestor. However at this point you can't even easily detect the sequence similarity at the protein level, and homology is inferred from considerations such as shared function, similar fold structure, number of different nucleosome components in different taxa, etc.

Anonymous said...

I wasn't referring to short matches but rather giving example of residue compositional bias that can give false high sequence identity.

Anyway, literature of last five or ten years is full of examples of low sequence identity of proteins that are homologous (chaperones, histones etc.). Example? Here you are: I took two hsp70 proteins from your database and made a global alignment (similarity over the whole gene, as you wrote in the comment). Sequence identity was 13.8%. Local alignment was slightly better, 19.4%. According to the criteria from the post, these sequences are not homologous, period.

As Peter wrote, it's not all that simple. Even if details are beyond the scope of this post, writing that you can say that two sequences are not homologous by looking at sequence identity is misleading.

Larry Moran said...

Peter says,

For even more distantly related genes, the homology may not even be apparent at the protein sequence level, but only be seen when you look at higher-level structural aspects of the protein and its folding.

This is controversial. If there's no significant sequence similarity then structural similarity may be due to convergence. Right now you can't decide except in a few rare cases where the conclusion of homology looks convincing. In all those cases that I'm aware of, the sequences similarity is in the twilight zone or close to it.

Larry Moran said...

Pawel Szczesny says,

Anyway, literature of last five or ten years is full of examples of low sequence identity of proteins that are homologous (chaperones, histones etc.).

Correction. The literature is full of claims that two genes/proteins are homologous when the sequence similarity is low.

The question we should all be concerned about is whether those claims are valid or not.

Example? Here you are: I took two hsp70 proteins from your database and made a global alignment (similarity over the whole gene, as you wrote in the comment). Sequence identity was 13.8%. Local alignment was slightly better, 19.4%. According to the criteria from the post, these sequences are not homologous, period.

Which two sequence are you referring to? I'm not aware of any two HSP70 sequences with such low sequence identity.

BTW, I don't want to be held precisely to the claim that the entire amino acid sequence has to be included. We have many examples of N-terminal leader sequences in the HSP70 database and the C-terminal region or about 50aa or so is usually excluded.

It's the central 580 aa's that count.

As Peter wrote, it's not all that simple. Even if details are beyond the scope of this post, writing that you can say that two sequences are not homologous by looking at sequence identity is misleading.

I didn't say any such thing. I choose my words very carefully. Here they are again ...

If this number reaches a significant percentage of the total length (usually >25%) then the two sequences are homologous—they descend from a common ancestor.

Proving the negative is much more difficult and that's why I did not say that anything below 25% can't be homologous. You were the one that jumped to that false conclusion.

There's one important point that we all have to keep in mind. Once you get below a certain level of sequence similarity then any conclusions about possible homology must be qualified. Be skeptical of claims that two genes are homologous if they do not share significant sequence similarity.

Anonymous said...

So, starting from the middle... I assumed that we had agreed, that exactly what you had written is not necessarily valid (there can be high sequence identity false positives due to residue composition bias), so I wrote only about reverse.

As for the hsp70 proteins: I used Q4EV69_LISMO and Q0V0K5_PHANO (I'm not sure under which id they exist on your page, some of the links didn't work properly). And of course this example doesn't prove anything. I chose the least similar proteins from the set. The procedure I used applied to any kind of universally conserved protein family will give similar outcome - sequence identity of the most dissimilar family members at around 10% - because tools for finding remote similarity improved a lot over the last couple of years. Of course I have noticed that the second sequence is significantly longer, and the local alignment has much more sense etc. What is more, any kind of modern annotation system would say that these two sequences are homologous with a high probability. But all that is kind of obvious. I fully agree with your point that at a certain conditions, conclusions about homology must be qualified. However my point is that sequence identity alone is not suitable for that purpose unless it's very high. And while this measure is easy to use and understand, it's also often misleading (when explaining that to people I often show circularly permuted and multidomain proteins - two classic examples where seq. id. goes bananas).

Larry Moran said...

Pawel Szczesny says,

As for the hsp70 proteins: I used Q4EV69_LISMO and Q0V0K5_PHANO (I'm not sure under which id they exist on your page, some of the links didn't work properly). And of course this example doesn't prove anything. I chose the least similar proteins from the set.

As far as I'm concerned, neither of these proteins is homologous to HSP70. You appear to have made the common mistake of trusting database annotations in site of the fact that your own analysis contradicts it.

The procedure I used applied to any kind of universally conserved protein family will give similar outcome - sequence identity of the most dissimilar family members at around 10% - because tools for finding remote similarity improved a lot over the last couple of years.

I don't believe those tools and I reject the conclusion that proteins with a similarity of 10% can be considered homologous. Unless, of course, there is overwhelming evidence from other sources. I don't know of a single example, do you?

What is more, any kind of modern annotation system would say that these two sequences are homologous with a high probability.

I argue that modern annotations are full of errors and can't be trusted. My kind of "modern annotation" would conclude that these two sequences are not related by common descent, with high probability.

Pawel, you are behaving as though this were a done deal and everyone agrees that sequences with such low similarities are homologous. That's just not true. There are lots of us who disagree.

Anonymous said...

I must admit: the first one is not a hsp70 - I mapped one the viral protein identifiers from your database to uniprot and forgot to check the species. The first one is some membrane protein from Listeria, not a hsp70 from a virus.

However, it would be interesting to hear why you think the second protein is not homologous to hsp70 either, given its BLAST hits, annotations, etc. I'm willing to learn what makes this protein not homologous, when it has more than 40% identical residues over the whole length with other hsp70 proteins.

And the last general comment also concerning the line of my arguments: up to your last comment I thought we are really discussing sequence identity as an indicator of homology (BTW, see http://www.ncbi.nlm.nih.gov/pubmed/16984632). And I argued that in case of low sequence identity homology is possible but deciding requires expert. However, in your last comment you clearly stated that you _reject_ a possibility of two sequences with low similarity being homologous, unless there's an overhelming evidence. I find it completely suprising. Intro to SCOP database (Structural Classification Of Proteins) clearly state that for some of the families (level of clear homology) members have sequence identity of 15% (http://scop.berkeley.edu/intro.html, less than 15% also happen). Majority of the modern tools easily find such sequences homologous despite low sequence identity. Don't believe these tools? Fine. Many other people don't believe them, but usually they are open enough to be convinced. Want to reject homology of sequences with 10% sequence identity _by definition_ (unless there's an overhelming evidence)? Good luck, but in such case that simply means end of topic for me. I don't discuss with other people belief systems.

Larry Moran said...

Pawel Szczesny says,

However, in your last comment you clearly stated that you _reject_ a possibility of two sequences with low similarity being homologous, unless there's an overhelming evidence.

Let's not quibble about semantics. What I'm saying is that you cannot conclude that two genes/proteins are homologous unless there's very good evidence. Usually this means significant sequence similarity.

It's not that I categorically *reject* the possibility of homology it's more like I reject the definite conclusion that they are homologous. In some cases I'll put the decision in the "possible" category.

This is different from those who made a firm conclusion of homology and enter their decision into the online databases. That's not good science, in my opinion.

I find it completely suprising. Intro to SCOP database (Structural Classification Of Proteins) clearly state that for some of the families (level of clear homology) members have sequence identity of 15% (http://scop.berkeley.edu/intro.html, less than 15% also happen).

I have very little confidence in the evolutionary conclusions of the SCOP database. They have arbitrarily decided that structural similarity trumps sequence similarity when it comes to making decisions about homology. One can make a case but it it's not as open and shut as the SCOP curators make out.

Majority of the modern tools easily find such sequences homologous despite low sequence identity.

No, that's wrong. What you mean to say is that some "modern tools" use different, and controversial, criteria for establishing homology. Since their criteria are not universally accepted it would be more appropriate to say that their conclusions are "tentative" or "speculative."

There are many sophisticated tools that detect very low levels of sequence similarity. What we need to do is decide whether this is due to chance or whether it represents divergence from a common ancestor or convergence from separate ancestors. Some workers have made a knee-jerk, and somewhat arbitrary, decision that it must be divergence from a common ancestor. That's not good science.

This is the issue we're debating.

Don't believe these tools?

It's not that I don't believe the "tools" it's that I don't believe the interpretations.

Fine. Many other people don't believe them, but usually they are open enough to be convinced. Want to reject homology of sequences with 10% sequence identity _by definition_ (unless there's an overhelming evidence)? Good luck, but in such case that simply means end of topic for me. I don't discuss with other people belief systems.

Hmmm ... That sounds very much like someone who is close-minded. Why aren't you willing to debate whether these sequences are homologous or not? Is it because you realize that you have a weak case and you are embarrassed because you never stopped to question your assumptions? :-)

Anonymous said...

First you say that you carefully choose your words, and little later that we shouldn't quibble about semantics? :) I wouldn't pay so much attention to details of your previous comment, if I didn't take it verbatim :P

Anyway, I'm not embarassed and willing to continue the discussion. I'll be back later.