RTB and the chimp genome Part 3

In my previous posts in this series (one and two), I've examined claims made by Fuz Rana of Reasons to Believe about a blog post by Dennis Venema that raised important questions about RTB's presentation of research on the similarity of the human and chimpanzee genomes. In this post, I want to examine the details of Rana's technical justification for RTB's position on the human/chimp genome similarity, as summarized in Rana's second post responding to Venema.

According to Rana,
As discussed in Who Was Adam?, researchers have performed a number of studies that indicate a 98 to 99 percent genetic similarity between humans and chimps. But as Hugh Ross and I point out, these comparisons were based on relatively limited genetic regions and focused on a single type of genetic difference (called substitutions or single nucleotide polymorphisms, SNPs). Comparisons that encompass larger regions of the genomes and include other types of genetic differences (like indels) show that the DNA similarity between humans and chimpanzees is much less than 98 to 99 percent.
One point of clarification about this passage: Substitutions are not the same thing as single nucleotide polymorphisms (SNPs). A substitution occurs whenever one nucleotide is substituted for another in a DNA sequence. A SNP is a term for a substituted nucleotide that is present in some members of a species but not other members of the same species. When comparing different species, though, you are looking for differences between the species that every member of each species possesses (called fixed differences). By definition, fixed differences cannot be SNPs. Furthermore, creationists should really refer to the fixed differences between the human and chimp genomes as "differences" not as "substitutions," which implies a human/chimp common ancestor.

So what evidence does Rana give for his assertion that other types of genetic differences indicate a genomic similarity between humans and chimps that is lower than 98-99% identity?

First is a paper from Britten that argued that the human and chimp genomes should actually be considered about 95% identical rather than 98-99%, because insertions are differences that need to be counted too. I've addressed Britten's approach before:
Britten was wrong. His strategy of counting indels doesn't actually make any sense at all. Consider a simple example. Say you have two sequences, one 50,000 nucleotides long and the other 55,000 nucleotides long. The only difference between them is a single insertion of 5,000 nucleotides. Otherwise, the sequences are identical. What then should the percent identity be? Should it be 90%, counting the 5000 nucleotide difference as 10% of the smaller sequence? Or should it be 91%, counting the 5000 nucleotide difference as 9% of the total sequence in comparison (55,000)? Neither one makes any sense, since the reality is that there is only one difference between the sequences. It's a single insertion or deletion, representing one mutation. Why should we count that as 5000 differences when there's only one mutation?
That's a technical issue, though, and Rana has faithfully summarized Britten's work.

For his second supporting evidence, Rana cites a paper on the striking differences between the MHC class I region in humans and chimps, which is an inherently variable sequence and should not be taken as representative of the entire genome. Venema made that point in his post, and Rana seems to acknowledge that,
He makes a good point. However, he also overlooks the larger point we are trying to make: namely, that including indels reduces the genetic similarity between humans and chimps.

Rana's third supporting evidence comes from comparison of chimp chromosome 21 and human chromosome 22, where he notes that there are "68,000 indels in the two sequences with some indels up to 54,000 nucleotides in length." He's right about that.

For his fourth supporting evidence, Rana cites a paper by Thomas et al. (2003) that looked at sequence similarity at the CFTR locus in multiple species. (CFTR is the gene linked to cystic fibrosis.) Here's Rana's summary of the paper:
Only a third of the differences between humans and chimpanzees involved substitutions. Indels accounted for roughly two-thirds of the sequence differences between these two primates and about one-half of these were greater than 100 base-pairs long.
Notice that he identifies these "differences" as insertion or deletion events, since he notes that "one-half of these were greater than 100 base-pairs long." Here's how Thomas et al. state it in their paper:
Although most mutational events leading to human-chimpanzee differences are single-nucleotide changes, they account for only 33% of the bases that differ between these two species
Thus Thomas et al. are not counting those differences as insertion or deletion events but by how many nucleotides are contained in the indel. Rana confused the statistics.

Rana then cites the chimpanzee genome paper. Here we should get at last some inkling of what might have caused RTB to alter the text in MTT to deny that this very study had taken place. Here's what Rana says:
...the actual genetic similarity is around 97 percent. This figure, however, overestimates genetic similarity. When performing the comparison, the researchers examined only about 2.4 billion base pairs, which represent around 75 to 80 percent of the genomes. ... The reason for this limited comparison stems from the fact that they struggled to get a significant fraction of the genomes to align, in part, because of differences. ... Given that the reason for the investigation’s failure to align 0.6 to 0.8 billion base pairs in the two genomes stems from the extensive genetic differences, it is unlikely that these regions display only a 3 percent difference, as is the case for the rest of the genomes. Instead the genetic difference in these regions must be greater. When this greater genetic difference is considered, it is reasonable to conclude that the overall difference between humans and chimpanzees is less than 97 percent and may well be as low as about 90 percent.

Note that last section: The "genetic difference" in the unaligned regions of the chimp and human genomes "must be greater" than the 3 percent difference in the rest of the genome because they can't be aligned. So that's the missing piece, which is what I suspected all along. If I had to summarize what I think RTB's response should have been, I would have said it this way: The chimp genome paper does not represent a true "complete" genome comparison, since they compared only 80% of the two genomes. Though Rana never spells it out, I think the text in MTT was changed because RTB does not consider the chimp genome paper an example of "comparisons between the complete human and chimpanzee genomes." If that's the case, I don't know why Rana didn't just say that in the first place.

I will return to this hypothesis of unalignable DNA in a future post, but first, Rana makes a few more claims in that same post to reinforce this notion that the human and chimpanzee genomes have regions that just can't be aligned. Rana first cites a paper by Fujiyama et al. that looked at similarities of chimpanzee BAC end sequences (a notoriously messy method of looking at similarity), where he claims, "the researchers also found that about 15,000 of the 65,000 chimp DNA fragments did not align with any sequence in the Human Genome Database." That's not correct. According to table 1 of their paper, they found that ~15,000 sequences from a collection of 114,421 chimp BAC end sequences had no match in human DNA. The "65,000" number that Rana cites is actually the number of BAC clones.

Next Rana summarizes the findings of Ebersberger et al.:
... a team from the Max Planck Institute achieved a similar result when they compared over 10,000 regions (encompassing nearly 3,000,000 nucleotide base pairs). Only two-thirds of the sequences from the chimp genome aligned with the sequences in the human genome. As expected in those that did align, a 98.76 percent genetic similarity was measured - yet one-third found no matches.
Here's what Ebersberger et al. reported: "For 7% of the chimpanzee sequences, no region with similarity could be detected in the human genome." How does Rana get a third out of that? I think he merely divided the test set of 2 Mb of aligned DNA by the total 3 Mb sequenced. That would leave one third unaligned, but Ebersberger et al. explained in their methods that they used really strict alignment criteria to make sure they were comparing corresponding (orthologous) positions. Why? Because the chimp and human genomes contain lots of highly similar repeat sequences. To make sure they were comparing the right parts of the genome, Ebersberger only looked at alignments of at least 60 nucleotides with at least a 10% sequence identity difference between the top two matches in the human genome. That doesn't mean that the other sequences don't align at all. That just means that they simply can't be aligned unambigously with a single region of the human genome.

You might be thinking by now that I'm being really nitpicky in my response so far, and I can sympathize with that. In my next post, I will definitely examine Rana's main point: "as much as 25 percent of the two genomes won’t align." At the same time, though, I think it's reasonable to expect that Rana should be able to accurately summarize the things that he reads. Unfortunately, these nitpicks only add to other evidence of Rana's misunderstanding of published claims. Previously on this blog, I noted that he (and Hugh Ross) made false claims about the Neandertal genome paper, and later I discovered an example of Rana misquoting what he himself wrote. In my second post in this series, I showed how Rana misunderstood Venema's critique, and here I've documented four more errors (not knowing the difference between a SNP and a substitution, and misrepresenting the work of Thomas et al., Fujiyama et al., and Ebersberger et al.). Notice that these problems I've documented here aren't technical disagreements. My appraisal of Britten's work is a technical discussion that I think even experts in the field can and do disagree on. The problems I've documented are all basic factual errors, and you can verify them yourself if you don't believe me.

Tomorrow: Actual DATA from a brand new comparison of the human and chimp genomes. Is there really a big fraction of the chimp genome that won't align to the human genome? I intend to find out.

Feedback? Email me at toddcharleswood [at] gmail [dot] com.