Given that the reason for the investigation’s failure to align 0.6 to 0.8 billion base pairs in the two genomes stems from the extensive genetic differences, it is unlikely that these regions display only a 3 percent difference, as is the case for the rest of the genomes. Instead the genetic difference in these regions must be greater. When this greater genetic difference is considered, it is reasonable to conclude that the overall difference between humans and chimpanzees is less than 97 percent and may well be as low as about 90 percent.
In a nutshell, Rana claims that there is a large portion of the chimpanzee genome that "does not align" with the human genome. That must mean that those unalignable sequences are very different, right? Otherwise you should be able to align them and calculate a similarity score. Since that unalignable portion is always conveniently ignored by those who tout 95-99% identity between the human and chimpanzee genomes, the true similarity of the whole genomes must be very much lower than 95-99%. I have to admit that this reasoning sounds good, and I began this analysis wondering if Rana might be on to something. Unfortunately for Rana, my preliminary investigation of this unalignable DNA indicates that the alignment problem is not because the chimp DNA is too different from the human DNA. It's too similar.
Wait ... how could two sequences that were too similar fail to align? How can two sequences be too similar? To understand this phenomenon, it's helpful to remember that the chimp genome was generated by a random sequencing process. Thousands and thousands of small segments (500-2000 nucleotides) of the chimpanzee genome were targeted for sequencing at random. The current draft of the chimp genome is based on a "6x coverage," meaning that they sequenced six times the amount of DNA present in the chimp genome (6x = ~18 billion nucleotides). Why sequence so much extra? To make sure that most parts of the genome get sequenced at least once.
The challenge with this type of genome sequencing is the assembly stage. That's where you have a computer read all those random pieces and try to arrange them into contiguous segments of DNA based on overlapping fragments. The problem is that the human and chimpanzee genomes are highly repetitive. Think of it like this: Imagine you're at a party where there are lots of guests, and the host is giving away door prizes by drawing names at random from a hat. The host draws the name "Bill," and 10 different guys come forward to get their prize. Without any additional information about which Bill it is, you can't tell who should get the prize. That's exactly the problem with the chimp genome. Whenever you sequence lots of fragments of DNA from the chimp genome, chances are you'll eventually end up sequencing a piece that's found in lots of different places in the genome. So you can't tell the exact place it should go.
When comparing the chimp and human genomes, then, you want to compare only parts that correspond to the "same" (orthologous) positions in the two genomes. If you have a piece of chimp DNA that you can't place unambiguously, you have to leave it out of the comparison. So your alignment fails not because the chimp DNA is too different, but because it's too similar to different parts of the human genome and you can't be certain which part it really corresponds to.
OK, well that's some nice handwaving, but yesterday I promised data. What I'm going to present here is preliminary data from a larger project I'm working on. (Stay tuned for more information when that project is published.) I went to the chimp genome repository at Ensembl and downloaded the chimp genome sequence that corresponds to the "unassigned" and "nonchromosomal" sequences, i.e., the stuff that "doesn't align." I then used Megablast to compare this "unalignable" DNA to the human genome to see how much of it aligns somewhere and how much doesn't align at all. If I'm right, I should find that most of this DNA will have a match somewhere in the human genome. If Rana is right, Megablast should return no results at all.
The "unalignable" DNA consists of 48,975 fragments totaling 157.0 million nucleotides (Mb). That's about 5.2% of a 3 billion nucleotide (Gb) genome. That's different from the original chimp genome paper that indicated that ~20% didn't align. Why the discrepancy? I'm not entirely sure, but my guess at this point is that the original paper was based on a 4x genome rather than a more accurate and complete 6x genome, which is what I'm using.
The Megablast results indicated that 42,570 of these 48,975 "unalignable" sequences aligned to some region of the human genome. How similar are those fragments? The median percent identity is 97.4%. So they're just as similar as everything else in the chimp and human genomes. The only difference is that you can't specify which precise region of the human genome the "unalignable" chimp fragments should align to.
What does that mean for the actual amount of sequence that doesn't align because it's too different? I'm still crunching the numbers right now, but my preliminary estimate is that around 25.3 Mb of chimp DNA truly has no similar match in the sequenced human genome. That's about 0.84% of the length of the human genome. As I said, these are preliminary numbers, but I have every reason to expect that they will hold as I continue to process the data.
So Rana's interpretation of the chimp genome is wrong. The "unaligned" chimp DNA is not too different; it's too similar. And the parts of the chimp genome that don't align because there is no corresponding sequencing in the human genome are just a tiny fraction of the length of the human genome.
What does this mean? In my next post, I'm going to evaluate Rana's claims about the "meaning" of genomic similarity.
Feedback? Email me at toddcharleswood [at] gmail [dot] com.