Research
We’re interested in a lot of things, and we’ve always got some side projects going on. Some of those turn into main projects eventually. We do a lot of methods development, meaning a mix of theory, software, and data analysis. We run a dry lab, meaning we use mostly whiteboards and keyboards instead of test tubes and pipettes, and try not to spill tea on our computers. Here’s some of our main threads of research:
Spatial Population Genetics
The world is big, and organisms don’t all live in the same place (happily). Geography can be a problem for popgen methods that make simpler assumptions, but - more excitingly - is a promising source of information when tied with modern georeferenced genomic data. Where’d that genome come from, on a map? How far away from each other do close relatives tend to live? What’s a map of population density look like? How do organisms (or, pollen, or seeds) move around on the map? We’re working on these and other questions, using various tools from math to machine learning, as well as on good ways to simulate populations that live, reproduce, die, and evolve across interesting and realistic geographies.

Machine learning approaches to population genetics
As our ability to sequence genomes has exploded, so has our ability to throw machine learning at the data they produce. The trick is that a genome isn’t really like a photograph — it carries the tangled imprint of mutation, recombination, selection, and the movement of ancestors across a landscape, all correlated in ways that off-the-shelf methods don’t expect. A lot of our work is about building models that respect that structure, and figuring out what to feed them and what to ask.
A recurring theme is simulation-based inference: rather than deriving a likelihood by hand, we simulate genomes under realistic evolutionary scenarios, then train a neural network to recover the parameters we care about. Done well, this lets us ask questions that classical methods find too slow, too rigid, or simply intractable. Some of the tools we’ve built this way:
- locator — predicts where on a map a sample came from, straight from its genotype.
- disperseNN (and its successor disperseNN2) — estimates how far organisms disperse each generation from georeferenced genetic variation.
- ReLERNN — infers the recombination landscape along a chromosome using a recurrent neural network.
- cxt — a transformer that reads genomes a bit like a language model reads text, learning the genealogical history written into real sequences from training on simulated ones, and learning to cope with the missing and messy data that trips up classical methods.
Doing any of this well takes real know-how: deciding what information to put in, how to represent it, and what questions to ask shapes the answer as much as the network architecture does.
Understanding the influence of selection on genomic variation
We understand really quite well how evolution works in a lot of ways, especially on a microscopic level: we know a lot about how natural selection acts on heritable traits and the underlying genetic variation. But there’s a lot of outstanding questions of scale: How much does natural selection act? How strongly does it constrain organisms’ genomes? Is natural selection mostly keeping organisms near a fitness peak, or is it constantly moving different traits around, maybe even in different directions in different places? A unifying question here is: How much is genetic variation influenced by selection as opposed to drift (i.e., random genealogical noise). We’re working on this by developing methods to identify the action of selection, theory to describe the process of adaptation, and by plain old fashioned digging into the data.

Simulation methods
Fundamental to lots of our work are simulations: in studying complex things like how natural selection affects big genomes, or how big populations evolve across geography, it’s important to be able to try out different situations and see what happens. And, simulations are the basic ingredient to lots of modern machine learning methods (i.e., simulation-based inference). It’s challenging to make a good simulation that reasonably realistically describes the various parts of reality we’d like to study, from organisms moving around and interacting down to the details of how DNA is inherited. So, we put a lot of time in collaborative work developing these tools that we and many other people use:
- We’re core members ofthe PopSim consortium,
- as well as the tskit group,
- and contribute regularly to SLiM.