We are searching data for your request:

**Forums and discussions:**

**Manuals and reference books:**

**Data from registers:**

**Wait the end of the search in all databases.**

Upon completion, a link will appear to access the found materials.

Upon completion, a link will appear to access the found materials.

I was reading Tajima's 1989 paper on his test for neutrality.

Tajima, Fumio. "Statistical method for testing the neutral mutation hypothesis by DNA polymorphism." Genetics 123.3 (1989): 585-595.

Here is the question: Suppose we have three sequences labelled $C,D,E$, and their genealogy follows ${{CD}E}$ (i.e. $C$ and $D$ coalesce first before they coalesce with $E$). Let $B$ be the most recent common ancestor of $C$ and $D$.

Now, let the random variable $k_{ij}$ be the number of nucleotide differences between sequence $i$ and sequence $j$, then Tajima shows that $k_{BC}$ and $k_{BD}$ have a non zero covariance.

But aren't mutations in the branch $BC$ independent from mutations in the branch $BD$? I was thinking that the total numbers of mutations in the brand $BC$ and $BD$ are two independent identically distributed variables, so $k_{BC}$ and $k_{BD}$ are independent, then why they have a non-zero covariance?

============Update=============

Now I have some basic ideas, but haven't worked out the full answer.

Tajima's definition of $k_{ij}$ is neither independent of the sample size, nor for a fixed coalescent time. (See his 1983 paper: Tajima, Fumio. "Evolutionary relationship of DNA sequences in finite populations." Genetics 105.2 (1983): 437-460.)

For example, in a sample size of 3, if you pick 2 individuals, and condition on that the two coalesce first, their coalescent time will follow: egin{align*} mathbb{P}(t=T)=p(T)=frac{3}{2N}e^{-frac{3}{2N}T} end{align*} Now conditioning on a fixed coalescent time $t$, the number of mutations under the infinite sites model in each branch either from $B$ to $C$ or from $B$ to $D$ will follow a poisson distribution with parameter $mu t$, where $mu$ is the mutation rate per sequence per generation. Let this poisson random variable be $xi_t$ in branch $BC$ and $eta_t$ in branch $BD$. Then egin{align*} k_{BC}=sum_{t=0}^{infty}xi_tp(t) k_{BD}=sum_{t=0}^{infty}eta_tp(t) end{align*} If we only consider the partial sum of the above series, egin{align*} k_{BC}^{(n)}=sum_{t=0}^{n}xi_tp(t) k_{BD}^{(m)}=sum_{t=0}^{n}eta_tp(t) end{align*} then $k_{BC}^{(n)}$ and $k_{BD}^{(n)}$ clearly have a zero covariance, because $xi_t$ and $eta_t$ are independent poisson variables, hence egin{align*} mathbb{E}(k_{BC}^{(n)}k_{BD}^{(n)})&=mathbb{E}(sum_{t=0}^{n}xi_tp(t)sum_{t=0}^{n}eta_tp(t))=sum_{i=0}^{n}sum_{j=0}^np(i)p(j)mathbb{E}(xi_ieta_j)=sum_{i=0}^{n}sum_{j=0}^np(i)p(j)mathbb{E}xi_imathbb{E}eta_j &=mathbb{E}k_{BC}^{(n)}mathbb{E}k_{BD}^{(n)}, end{align*} so their covariance is zero.

But as $n o+infty$, how $k_{BC}^{(n)}k_{BD}^{(n)}$ converges to $k_{BC}k_{BD}$ is questionable. It will not uniformly converge to $k_{BC}k_{BD}$ because otherwise we can first computing the expectation then take the limit, which gives us a zero covariance. Tajima didn't explicitly show us how he computed the covariance by summing together three infinite series (Line 7, Pg. 448, 1983's paper). I tried directly working on that series but failed in the last sum. His result is correct, though, I hope someone can give some hint on why there is inherent correlation between these seemingly independent random variables.

=======Update: a simple explanation has been posted==============

Here is a simple answer to my question. The reason two numbers of total mutations accumulated in two divergent branches are not independent of each other is because they experience the same amount of coalescent time.

While the stationary Poisson mutation processes *are* independent from each other as long as they happen in different branches of a genealogy, they are likely to produce a similar number of mutations if two such processes happen together for a same amount of time. Thus, the non-zero part of the covariance between $k_{BC}$ and $k_{BD}$ is not from mutation processes themselves, but from the shared coalescent time $T$.