Spaced seeds puzzled me the first time I came across the claim that they increase sensitivity. The claim is correct and not actually that mysterious, but I intuitively thought about it the wrong way. My initially fallacy was to consider the probability of an exact match which clearly only depends on the length of the seed (assuming equal character probabilities). However, to detect homology, what matters is not the expected number of matches but the probability of finding at least one match. Different seeds differ in the way matches are clustered along the genome. When moving a seed without gaps by one letter, only one additional letter is queried. Hence if position \(i\) is a match, it is very likely that position \(i+1\) is a match as well -- in other words seed matches cluster. This clustering makes it more likely that homology is missed in a particular gene. When moving a spaced seed by one letter, several new positions contribute to the seed. The code below illustrates this behavior.
import numpy as np
import matplotlib.pyplot as plt
L = 1000
s1 = np.array(np.fromstring('1111111111', 'S1')==b'1', dtype=int)
s2 = np.array(np.fromstring('101100111011101', 'S1')==b'1', dtype=int)
ps = np.linspace(0.0, 0.8, 31)
sensitivity = []
mean_match = []
std_match = []
for p in ps:
matches = []
for i in range(1000):
aln = np.random.random(size=L)>p
matches.append([(np.convolve(aln, mask, mode='valid')==np.sum(mask)).sum() for mask in [s1, s2]])
matches = np.array(matches)
print("average number of matches:", matches.mean(axis=0))
print("fraction of alignments with match:", (matches>=1).mean(axis=0))
sensitivity.append((matches>=1).mean(axis=0))
mean_match.append(matches.mean(axis=0))
std_match.append(matches.std(axis=0))
sensitivity = np.array(sensitivity)
mean_match = np.array(mean_match)
std_match = np.array(std_match)
fig, axs = plt.subplots(1,3, figsize=(12,6))
axs[0].plot(1-ps, sensitivity[:,0], label='not spaced')
axs[0].plot(1-ps, sensitivity[:,1], label='spaced seed')
axs[0].legend()
axs[0].set_yscale('log')
axs[0].set_ylabel('sensitivity')
axs[0].set_xlabel('similarity')
axs[1].plot(1-ps, mean_match[:,0], label='not spaced')
axs[1].plot(1-ps, mean_match[:,1], label='spaced seed')
axs[1].legend()
axs[1].set_yscale('log')
axs[1].set_ylabel('mean number of matches')
axs[1].set_xlabel('similarity')
axs[2].plot(1-ps, std_match[:,0], label='not spaced')
axs[2].plot(1-ps, std_match[:,1], label='spaced seed')
axs[2].legend()
axs[2].set_yscale('log')
axs[2].set_ylabel('standard deviation of number of matches')
axs[2].set_xlabel('similarity')