{ "cells": [ { "cell_type": "markdown", "id": "224b2831", "metadata": {}, "source": [ "# Genetic diversity, with trees" ] }, { "cell_type": "markdown", "id": "f7376884", "metadata": {}, "source": [ "Before we measured genetic diversity\n", "using expected heterozygosity,\n", "which is the proportion of sites that differ between two randomly chosen genomes:\n", "for a genome of length $L$, with allele frequency $p_i$ at the $i^\\text{th}$ site, this is:\n", "\n", "$$\n", " \\pi = \\frac{1}{L}\\sum_{i=1}^L 2 p_i (1-p_i) .\n", "$$\n", "\n", "We'll now derive this in a different way.\n", "First, let's think about where the differences between trees come from.\n", "As we saw before, they come from mutation, somewhere -\n", "concretely, they come from mutations that happened\n", "somewhere on the path from the two genomes\n", "back up to their common ancestor.\n", "(If there weren't any mutations, then they'd be identical;\n", "if there's only one mutation, then they differ,\n", "and if there was more than one mutation then it depends,\n", "but this is rare and we mostly ignore it.)\n", "\n", "Let's have a look at this in a small example." ] }, { "cell_type": "code", "execution_count": 1, "id": "1c8d84e1", "metadata": {}, "outputs": [], "source": [ "%load_ext slim_magic\n", "\n", "import tskit, pyslim\n", "import pandas as pd\n", "import numpy as np\n", "from matplotlib import pyplot as plt\n", "from IPython.display import display, SVG" ] }, { "cell_type": "code", "execution_count": 2, "id": "554183e0", "metadata": {}, "outputs": [], "source": [ "%%slim_ts --out ts\n", "initialize()\n", "{\n", " setSeed(123);\n", " initializeTreeSeq();\n", " initializeMutationRate(7e-8);\n", " initializeMutationType(\"m1\", 0.5, \"f\", 0.0);\n", " initializeGenomicElementType(\"g1\", c(m1), c(1.0));\n", " initializeGenomicElement(g1, 0, 99999);\n", " initializeRecombinationRate(1e-8);\n", " suppressWarnings(T);\n", "}\n", "\n", "1 {\n", " sim.addSubpop(\"p1\", 500);\n", "}\n", "\n", "3000 late() {\n", " sim.treeSeqOutput(\"tmp.trees\");\n", " sim.simulationFinished();\n", "}" ] }, { "cell_type": "markdown", "id": "5a56a644", "metadata": {}, "source": [ "What we get here is a *tree sequence*:\n", "see [this tutorial](https://tskit.dev/tutorials/intro.html) for an introduction,\n", "and [the documentation](https://tskit.dev/tskit/docs/stable/introduction.html)\n", "for what you can do with them." ] }, { "cell_type": "code", "execution_count": 3, "id": "719c6633", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " \n", " Tree Sequence \n", "
Trees15
Sequence Length100000.0
Time Unitsgenerations
Sample Nodes1000
Total Size208.9 KiB
Metadata\n", "
\n", " \n", "
\n", " dict\n", " \n", "
\n", " SLiM:\n", "
\n", " dict\n", " file_version: 0.7
generation: 3000
model_type: WF
nucleotide_based: False
separate_sexes: False
spatial_dimensionality:
spatial_periodicity:
stage: late
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TableRowsSizeHas Metadata
Edges185558.0 KiB\n", " \n", "
Individuals50050.6 KiB\n", " ✅\n", "
Migrations08 Bytes\n", " \n", "
Mutations1298.5 KiB\n", " ✅\n", "
Nodes181868.1 KiB\n", " ✅\n", "
Populations22.4 KiB\n", " ✅\n", "
Provenances12.1 KiB\n", " \n", "
Sites1293.0 KiB\n", " \n", "
\n", "
\n", "
\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ts" ] }, { "cell_type": "markdown", "id": "77eaf4f3", "metadata": {}, "source": [ "Let's suppose we take a sample of just 3 diploids from this population,\n", "so we can easily look at their genealogies and genotypes.\n", "We can do this with the `simplify( )` method:" ] }, { "cell_type": "code", "execution_count": 4, "id": "f2c99755", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", " \n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
\n", " \n", " Tree Sequence \n", "
Trees6
Sequence Length100000.0
Time Unitsgenerations
Sample Nodes6
Total Size112.7 KiB
Metadata\n", "
\n", " \n", "
\n", " dict\n", " \n", "
\n", " SLiM:\n", "
\n", " dict\n", " file_version: 0.7
generation: 3000
model_type: WF
nucleotide_based: False
separate_sexes: False
spatial_dimensionality:
spatial_periodicity:
stage: late
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", "
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
TableRowsSizeHas Metadata
Edges25808 Bytes\n", " \n", "
Individuals32.1 KiB\n", " ✅\n", "
Migrations08 Bytes\n", " \n", "
Mutations393.3 KiB\n", " ✅\n", "
Nodes141.2 KiB\n", " ✅\n", "
Populations12.3 KiB\n", " ✅\n", "
Provenances22.5 KiB\n", " \n", "
Sites39991 Bytes\n", " \n", "
\n", "
\n", "
\n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sts = ts.simplify(range(6))\n", "sts = pyslim.generate_nucleotides(sts)\n", "sts = pyslim.convert_alleles(sts)\n", "sts" ] }, { "cell_type": "code", "execution_count": 5, "id": "cc0d5f9b", "metadata": { "scrolled": false }, "outputs": [ { "data": { "image/svg+xml": [ "Genome position042588479656074874448966801000004013176691411832101510125341120151213450161681132121713415089112120183212191342412505276911326212282322133234354056113303112337123629133127371243805611" ], "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SVG(sts.draw_svg(size=(1000, 400)))" ] }, { "cell_type": "markdown", "id": "351ae6cb", "metadata": {}, "source": [ "There are 38 sites (out of 100kb) at which we have mutations.\n", "Some of these (shown on the roots of the trees above) are fixed in this sample,\n", "i.e., don't differ between these six genomes.\n", "\n", "Now, let's look at the six genotypes at these 38 sites:" ] }, { "cell_type": "code", "execution_count": 6, "id": "88370e91", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sample 0: GCCGTGAAACGCGTCCTAAAACTGGACCGAGGCAGGGAG\n", "sample 1: GCCGTGAAACGCGACCTAAAACTGTGCAGATACGGGGCA\n", "sample 2: GCCGTGTTATGCGTTCCAAAACTGGGGAGAGGCGGGGCA\n", "sample 3: GCCGTGTTCTGCGTTCCATAGGTGGGCAGAGGCAGGGCA\n", "sample 4: GCCGTGTTACGCGTCCCAAAACTGGGCAGAGGAACCGAA\n", "sample 5: GGCGTGTTATCCGTTCTAAAACTGGGCCGAGGCAGGGAA\n" ] } ], "source": [ "for i, h in enumerate(sts.haplotypes()):\n", " print(f\"sample {i}: \", h)" ] }, { "cell_type": "code", "execution_count": 7, "id": "99156b1d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Expected heterozygosity in the subsample is 9.9e-05, while in the entire population it was 0.00012.\n" ] } ], "source": [ "print(f\"Expected heterozygosity in the subsample is {sts.diversity():.2g}, \"\n", " f\"while in the entire population it was {ts.diversity():.2g}.\")" ] }, { "cell_type": "markdown", "id": "3b00c326", "metadata": {}, "source": [ "**Exercise:**\n", "Compute heterozygosity in this subsample by hand\n", "by looking at the genotypes above.\n", "(*Hint:* sum up $2p_i(1-p_i)$ and divide by the genome length.)" ] }, { "cell_type": "markdown", "id": "24b1c7ad", "metadata": {}, "source": [ "## Drift-mutation balance, again" ] }, { "cell_type": "markdown", "id": "f4300f01", "metadata": {}, "source": [ "Let's derive the expression we got for equilibrium genetic diversity\n", "again, but this time looking at the trees.\n", "We want to know *what's the probability that two genomes differ at a given site?*\n", "Well, suppose the two genomes live at the same time (call it \"today\"),\n", "and their most recent common ancestor lived $T$ generations ago.\n", "So, there were $2T$ reproduction events during which a mutation would have been inherited by\n", "one and not the other of the genomes.\n", "Let's use a different simplified mutation model this time:\n", "the *infinite sites* model,\n", "assuming that we can count up *all* mutations that differ between the two genomes.\n", "(This differs from the previous model, where two mutations could cancel each other out.)\n", "Afterwards, we can see how much of an effect the unrealistic parts\n", "of this model might have.\n", "\n", "The number of mutations, $M$, has a Binomial distribution\n", "with sample size $2T$ and probability $\\mu$,\n", "so the expected number of mutations per site\n", "(i.e., the heterozygosity) is\n", "\n", "$$\\begin{aligned}\n", " \\pi = \\mathbb{E}\\left[ 2 T \\mu \\right] .\n", "\\end{aligned}$$\n", "\n", "So, what is $\\mathbb{E}[T]$?\n", "Well, under the Wright-Fisher model,\n", "the probability that two genomes find their common ancestor each generation is $1/2N$,\n", "and so\n", "\n", "$$\n", " \\mathbb{P}\\left\\{ T > t \\right\\} = (1 - 1/2N)^t .\n", "$$\n", "\n", "This is the Geometric distribution,\n", "and has mean $\\mathbb{E}[T] = 2N$.\n", "This implies that, precisely under this model of mutation,\n", "\n", "$$\n", " \\pi = 4 N \\mu .\n", "$$\n", "\n", "This agrees with our previous calculation!\n", "The advantage to this one is that it brings the trees into focus:\n", "we see that heterozygosity\n", "is equal, on average, to $4 \\mu$ times the mean time to most recent common ancestor.\n", "This fact is much more general than the Wright-Fisher model.\n", "\n", "And, along the way, we've learned that the mean pairwise time to most recent common ancestor\n", "in a Wright-Fisher population of size $N$ is $2N$ generations.\n", "This is very useful!\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.9" } }, "nbformat": 4, "nbformat_minor": 5 }