We are searching data for your request:

**Forums and discussions:**

**Manuals and reference books:**

**Data from registers:**

**Wait the end of the search in all databases.**

Upon completion, a link will appear to access the found materials.

Upon completion, a link will appear to access the found materials.

In this chapter, I describe how to estimate parameters from birth-death models using data on species diversity and ages, and how to use patterns of tree balance to test hypotheses about changing birth and death rates. I also describe how to calculate the likelihood for birth-death models on trees, which leads directly to both ML and Bayesian methods for estimating birth and death rates. In the next chapter, we will explore elaborations on birth-death models, and discuss models that go beyond constant-rates birth-death models to analyze the diversity of life on Earth.

- 11.1: Hotspots of Diversity
- Some parts of the tree of life have more species than others. This imbalance in diversity tells us that speciation is much more common in some lineages than others. Likewise, numerous studies have argued that certain habitats are "hotbeds" of speciation.

- 11.2: Clade Age and Diversity
- There are two different ways that one can measure the age of a clade: the stem age and the crown age. A clade's crown age is the age of the common ancestor of all species in the clade. By contrast, a clade's stem age measures the time that that clade descended from a common ancestor with its sister clade. The stem age of a clade is always at least as old, and usually older, than the crown age.

- 11.3: Tree Balance
- Birth-death trees have a certain amount of "balance," perhaps a bit less than your intuition might suggest. We can look to real trees to see if the amount of balance matches what we expect under birth-death models. A less balanced pattern in real trees would suggest that speciation and/or extinction rate vary among lineages more than we would expect. By contrast, more balanced trees would suggest more even and predictable diversification across the tree of life than expected under birth-death mo

- 11.4: Fitting birth-death models to branching times
- Most modern approaches to fitting birth-death models to phylogenetic trees use the intervals between speciation events on a tree - the "waiting times" between successive speciation - to estimate the parameters of birth-death models. Frequently, information about the pattern of species accumulation in a phylogenetic tree is summarized by a lineage-through-time (LTT) plot, which is a plot of the number of lineages in a tree against time

- 11.5: Sampling and birth-death models
- It is important to think about sampling when fitting birth-death models to phylogenetic trees. If any species are missing from your phylogenetic tree, they will lead to biased parameter estimates. This is because missing species are disproportionally likely to connect to the tree on short, rather than long, branches. If we randomly sample lineages from a tree, we will end up badly underestimating both speciation and extinction rates (and wrongly inferring slowdowns).

- 11.S: Fitting Birth-Death Models (Summary)

## Locally adaptive Bayesian birth-death model successfully detects slow and rapid rate shifts

Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Affiliation Department of Biology, University of Washington, Seattle, WA, 98195, USA

Roles Formal analysis, Funding acquisition, Software, Visualization, Writing – review & editing

Affiliations GeoBio-Center, Ludwig-Maximilians-Universität München, 80333 Munich, Germany, Department of Earth and Environmental Sciences, Paleontology & Geobiology, Ludwig-Maximilians-Universität München, 80333 Munich, Germany

Roles Formal analysis, Funding acquisition, Resources, Writing – review & editing

Affiliation Department of Zoology, University of Oxford, Oxford, United Kingdom

Roles Conceptualization, Funding acquisition, Supervision, Writing – review & editing

Affiliation Department of Biology, University of Washington, Seattle, WA, 98195, USA

Roles Conceptualization, Formal analysis, Funding acquisition, Methodology, Project administration, Supervision, Visualization, Writing – original draft, Writing – review & editing

Affiliation Department of Statistics, University of California, Irvine, CA, 92697, USA

## Background

Morphology provides the only source of data to understand the relationships and broadscale evolutionary patterns across the vast majority of life that has ever existed on this planet [1]. Morphological characters and fossil data (both specimens and their ages) also contribute to improved divergence time estimates in total-evidence phylogenies of extant organisms [2, 3], providing a more holistic reconstruction of the tree of life. Additionally, it constitutes the sole source of data to investigate the phylogeny and macroevolution of most extinct lineages. Therefore, improvements on the phylogenetic analysis of morphological data and divergence time estimates for evolutionary trees are essential for the entire field of evolutionary and comparative biology.

Currently, one of the most widespread methods to time-calibrate trees and estimate macroevolutionary parameters is the utilization of relaxed molecular clocks in Bayesian phylogenetic inference. By using either a fixed tree topology, or by co-estimating tree topologies (among species relationships) with divergence times and rates of evolution, clock-based Bayesian inference has become the workhorse for time-calibrated trees in evolutionary biology [4, 5]. Importantly, advances in the last decade have also enabled the utilization of morphological data in relaxed clock Bayesian inference analysis, especially through the development of tree models (modeling diversification parameters) that take into account fossil sampling probabilities—the fossilized birth-death (FBD) tree model [6, 7]. Since then, there has been rapidly growing interest to implement relaxed clock Bayesian inference analyses with morphological data (morphological clocks). Morphological clocks have been used to estimate divergence times for entirely extinct lineages [e.g., [8,9,10]], or in combination with molecular data (“morpho-molecular clocks” in total-evidence dating) to estimate evolutionary trees and divergence times for clades with extant representatives [e.g., [11,12,13,14,15]].

Nearly all relaxed clock Bayesian inference models were initially developed for molecular sequence data to reconstruct epidemiological and/or phylogenetic trees [16, 17], being later co-opted for applications with morphological datasets. As a result, most studies assessing the impact of model choice on species relationships and macroevolutionary parameters (divergence times and evolutionary rates) have only been conducted on molecular datasets (either empirical or simulated) (e.g., [18,19,20,21,22]), with little to no assessment of available models on morphological data. Some exceptions include studies that have included morphological data, but combined with molecular data in total-evidence dating, which creates different phylogenetic inference problems given the interplay between morphological and molecular signals (e.g., [11, 12, 14, 15]). Importantly, several features of morphological datasets make them quite distinct from molecular datasets. Such features include the following: the greater influence of natural selection on the phenotype and its potential impact on developing realistic morphological models [23] the lack of direct comparability of character states among different characters (as opposed to nucleotides in molecular sequences), thus hampering the development of more complex morphological substitution models [24] and the much smaller size of the vast majority of morphological datasets compared to molecular sequences. Therefore, it is expected that the impact of model specification on tree topology, divergence times, and evolutionary rates on morphological datasets may vary substantially from molecular datasets.

Recently, some studies have started to direct their attention to the performance of morphology only datasets in a probabilistic framework, especially so as to compare the performance of software and optimality criteria under various conditions using non-clock Bayesian inference—e.g., [23, 25,26,27,28]. However, very few studies have directed their attention to the performance of model choice using relaxed Bayesian clocks with morphological data only. Some of this pioneer work assessed the impact of uncertainty in fossil ages when calibrating the morphological clock [29, 30] different software/packages on divergence time estimates (using standard parameters) [31], sampling of autapomorphies [32], and using mechanistic tree priors (e.g., fossilized birth-death tree model) instead of the uniform tree prior [33]. Yet, several parameters available for character evolution, tree models, and clock models have never been tested to assess either their individual or combined impact on tree topology, divergence times, and evolutionary rates.

Here, we follow the approach taken by previous molecular studies to assess the impact of model choice (e.g., [20,21,22]), but applied to a new empirical morphological data of sphenodontian reptiles. Sphenodontians are a unique lineage of diapsid reptiles, with an extremely long evolutionary history dating at least as far back as the Middle Triassic at

230Mya [34], but currently represented by a single living species, *Sphenodon punctatus*, inhabiting small islands off the coast of New Zealand [35]. In contrast, their sister clade, squamates (lizards and snakes), is currently represented by

10,650 extant species [36], indicating both lineages had very different evolutionary histories after their split from a common ancestor at about 260–270 Mya [13, 37]. Despite considerable efforts to understand broad scale phylogenetic relationships, divergence times and evolutionary patterns among the various families of squamates (e.g., [13, 38,39,40,41]), there has been comparatively less effort to understand the species level relationships and macroevolutionary patterns in sphenodontians. For instance, *Sphenodon punctatus* has long been characterized as a “living fossil” [42], implying a relatively conserved morphology and low rates of evolution of its lineage for several million years [43]. However, only recently have quantitative tools been used to assess such assumptions, although limited to studying skull or mandibular shape evolution, and finding contrasting results [42, 43]. Furthermore, there has been no attempt to provide a detailed assessment of divergence times for the main sphenodontian clades using relaxed clock models and the FBD model. Finally, there has never been an assessment of evolutionary rates within sphenodontians taking into account information from all body regions, nor contrasting rates of evolution among different subdivisions of their phenotype.

Our study provides a new morphological phylogenetic data matrix of sphenodontian reptiles, constructed under strict criteria to avoid potential sources of biases in morphological datasets, such as logical or biological dependencies among characters, inherent to problematic character constructions [44,45,46]. Using this new empirical dataset (see Additional files 1, 2, 3, 4) and taking into account recent advances in the implementation of morphological clocks (e.g., the benefits of accounting for fossil age uncertainty, sampling autapomorphies, and using mechanistic tree models [29, 32, 33]), we investigated the impact of several additional available prior model choices (Fig. 1) on morphological data to answer the following questions: (1) What is the impact of distinct taxon sampling prior strategies for the FBD model with morphological data only? (2) Under the correct taxon sampling assumption, what is the impact of different clock models on divergence times and evolutionary rates? (3) What is the impact of allowing FBD model parameters to vary across time? (4) What is the impact of partitioned morphological clocks on divergence times and evolutionary rates? and (5) To what extent do the limitations of morphological data (i.e., low number of characters) prevent complex model implementations? We discuss important biases stemming from various model choices that may lead to unreliable divergence times and rates of evolution using morphological clocks. We also compare our results to previous assessments of the behavior of those same model parameters in molecular or combined evidence datasets. Finally, we provide potential sources of correction to alleviate some of the biases introduced by inappropriate modeling of morphological datasets, along with a new hypothesis of sphenodontian relationships and macroevolutionary patterns.

Summary of models and parameters available for morphological characters that were tested and implemented herein. ACRV, among character rate variation Asym, asymmetric state frequencies BI, Bayesian inference FBD, fossilized birth-death model GA *or* ga, Gamma distribution IGR (independent gamma rates uncorrelated clock model) LN *or* ln, lognormal distribution MML (SS), marginal model likelihoods (using the stepping-stone procedure) NoSA, no sampling of ancestors PMF, probability mass function SA, sampling of ancestors SFBD, skyline fossilized birth-death model Sym, symmetric state frequencies TK02 (Thorne and Kishino continuous autocorrelated clock model). See the “Methods” section for additional details and explanation for different models

## Results and Discussion

### Mathematical theory and model

#### Fundamental definitions and assumptions

A genome is treated as a "bag" of coding sequence for protein domains, which we simply call ** domains** for brevity. Domains are treated as independently evolving units disregarding the dependence between domains that tend to belong to the same multidomain protein. Each domain is considered to be a member of a

**(including single-member families). We consider three types of elementary evolutionary events: i) domain**

*family***which generates a new member within a family the principal mechanism of birth is duplication with divergence but additional mechanisms may be considered, including acquisition of a family member from a different species via horizontal gene transfer [32], ii) domain**

*birth,***which results from domain inactivation and/or deletion, and c) domain**

*death,***which generates a new family with one member. Innovation may occur via horizontal gene transfer from another species, via domain evolution from a non-coding sequence or a sequence of a non-globular protein, or via major change of a domain from a pre-existing family after a duplication, which makes the relationship between the given domain and its family of origin undetectable (this latter process formally combines domain birth, death and innovation in a single event). The innovation rate (ν), is considered constant for a given genome. The rates of elementary events are considered to be independent of time (i. e. only homogeneous models are considered) and of the nature (structure, biological function etc.) of individual families.**

*innovation,*In a finite genome, the maximal number of domains in a family cannot exceed the total number of domains and, in reality, is probably much smaller let *N* be the maximal possible number of domain family members. We consider classes of domain families, which have only one common feature, namely the number of members (Fig. (Fig.1). 1 ). Let *f*_{i} be the number of domain families in *i*-th class, i.e. families that are represented by exactly *i* domains in the given genome, *i* = 1,2. *N*. Birth of a domain in a family of class *i* results in the relocation of this family from class *i* to class *i*+1 (decrease of *f*_{i} and increase of *f*_{i+1} by 1). Conversely, death of a domain in a family of class *i* relocates the family to class *i*-1 death of a domain in class 1 results in the elimination of the corresponding family from the given genome, this being the only considered mechanism of family death. We consider time to be continuous and suppose it very unlikely that more than one elementary event occur during a short time interval formally, the probability that more than one event occurs during an interval Δ*t* is o(Δ*t*).

Domain dynamics and elementary evolutionary events under BDIM.

#### The formulation of the model

##### The simple BDIM

Let us formulate the following **independence assumption**: i) all elementary events are independent of each other ii) the rates of individual domain birth (λ) and death (δ) do not depend on *i* (number of domains in a family). Under this assumption, the instantaneous rate, at which a domain family leaves class *i*, is proportional to *i* and the following simple BDIM describes the evolution of such a system of domain family classes:

Similar models have been considered previously in several different contexts [33 v. 1, ch. 17, 34]. We will see in 3.2 that the solution of model (2.1) evolves to equilibrium, with a unique distribution of domain family sizes, *f*_{i}

1/*i*. Thus, under the simple BDIM, if the birth rate equals the death rate, the abundance of a domain class is inversely proportional to the size of the families in this class. When the observations do not fit this particular asymptotic (as observed in several studies on distributions of protein family sizes), a different, more general model needs to be developed.

##### The Master BDIM

A more general BDIM emerges when the independence assumption is abandoned. Instead of constructing specific hypotheses regarding the dependence between the elementary events, let us simply suppose that the domain birth and death rates for a family of class *i* do not necessarily show proportionality to *i*. For the general case, we designate these rates, respectively, λ_{i} and δ_{i} in the specific case of the simple BDIM (2.1), λ_{i} = λ*i* and δ_{i} = δ*i*. Then we have the following ** master BDIM**:

Let *F*(*t*)= *f*_{i}(*t*) be the total number of domain families at instant *t* it follows from (2.2) that

The system (2.2) has an equilibrium solution *f*_{1}. *f*_{N} defined by the equality d*f*_{i}(*t*)/d*t* = 0 for all *i* this solution is described below under Proposition 1. Accordingly, there exists an equilibrium solution of equation (2.3), which we will designate *F*_{eq} (the total number of domain families at equilibrium). At equilibrium, ν = δ_{1}*f*_{1}, i.e. the processes of innovation and death of *single* domains (more precisely, the death of domain families of class 1, i.e. singletons) are balanced.

Applying this identity to *p*_{i}(*t*) and rewriting equation (2.3) in the form

we obtain the following model for frequencies of the domain family (*master BDIM for frequencies*), which is equivalent to (2.2):

System (2.4) should be solved together with equation (2.3).

##### The Master BDIM and Markov processes

Let us note that system (2.4) for frequencies is *non-linear,* so it is not a system of Kolmogorov equations for state probabilities of any homogeneous Markov process. Let us further suppose that a genome had ample time to arrive at an equilibrium with respect to the total number of domain families, such that *F*(*t*) = *F*_{eq}. This does not imply d*p*_{i}(*t*)/d*t* = 0 or d*f*_{i}(*t*)/d*t* = 0 in other words, the system might rearrange the frequencies of individual families, although the total number of families remains stable. If *F*(*t*) = *F*_{eq}, the master system (2.4) turns into

System (2.5) can be rewritten as a matrix equation

It is easy to see that the sum of elements of each row (except for the first one) of the matrix *Q* is equal to ν/*F*_{eq} > 0. Therefore the matrix *Q* cannot be a matrix of transition rates for any Markov process (the sum of elements of each row of a matrix of transition rates for Markov process with continuous time should be non-positive [33 v. 1, ch.17, s. 8, 35 v. 2, ch. 3, s. 2] in other words, there is no Markov process with continuous time and state space <1,2. *N*> whose state probabilities satisfy system (2.5).

Thus, neither the initial BDIMs (2.1) or (2.2) nor the equilibrium model (2.5) can be described by any Markov process with continuous time.

**Remark**. If, in system (2.5), ν = 0, then this system turns into a system of state probabilities for a Markov birth and death process with continuous time.

#### Equilibrium in BDIMs

##### Equilibrium sizes and frequencies of the domain family system

Let us suppose that the genome had ample time to arrive at a complete equilibrium state, in which not only d*F*(*t*)/d*t* = 0, but also d*f*_{i}(*t*)/d*t* = 0 for all *i.* Thus, the equilibrium sizes of domain families *f*_{i} satisfy the system

It should be emphasized that the master model does not assign *a priori* the value of *F*_{eq} this value has to be computed depending on the model parameters.

The following statement is central for further analysis.

**Proposition 1**. *The master BDIM (2.2) has a unique equilibrium state* (*f*_{1}. *f*_{N}), *which is the sole solution* of *system (3.1):*

The unique equilibrium state (3.2) is globally asymptotically stable.

This proposition ascertains that all evolutionary trajectories of the system (2.2) exponentially (with respect to time) approach the equilibrium state (3.2). The proof is given in the Mathematical Appendix.

**Remark**. Let us denote the ratio of the birth rate to the innovation rate

and the ratio of the death rate to the innovation rate

Then, according to Proposition 1, for any BDIM in equilibrium,

The principal goal of the treatment that follows is the analysis of the asymptotic behavior of equilibrium frequencies and sizes of domain families (*f*_{1}. *f*_{N}) at large *N*. We will differentiate two cases of asymptotic behavior according to the following

##### Equilibrium frequencies for the simple BDIM

**Definition**. A simple BDIM is *balanced* if θ = λ/δ = 1, i.e. if the rates of individual domain birth and death are equal.

Let us recall that a random discrete variable ξ has the *logarithmic* distribution with parameter θ < 1 if

A random variable ξ has the *truncated logarithmic* distribution with parameter θ if

*that is, the equilibrium frequencies have the truncated logarithmic distribution if* θ < 1.

2) *If a simple BDIM is balanced, then*

The proof is given in the Mathematical Appendix.

Thus, a simple BDIM can have equilibrium frequencies only of the form *p*_{i} = *C*θ *i* /*i*, where *C* = *const* and θ is the distribution parameter. In particular, the equilibrium frequencies for a balanced simple BDIM have the power distribution with the degree equal to -1.

Simple methods exist for preliminary graphical estimation of the single distribution parameter θ [36 ch. 7, s. 7]. We will prove in the following section that, if we observe a power asymptotic for empirically observed equilibrium frequencies, then (assuming that the system can be described by a BDIM), the rates λ_{i} and δ_{i} should be asymptotically equal at large *i*. If, additionally, the degree of the asymptotic is *not equal* to -1, then the system dynamics *cannot* be described by a *simple* BDIM. In this case, it is necessary to consider more general models, such as the Master BDIM (2.2).

##### Asymptotic behavior of equilibrium frequencies of a Master BDIM: Main Theorems

Let us consider the master BDIM (2.2) we showed in 3.1 that its equilibrium frequencies are the solution of the system

The following theorem gives all possible types of asymptotic behavior of the equilibrium frequencies and defines the connections between these asymptotics depending on model parameters. In particular, if there is no information on the exact form of dependence of the rates of birth and death of domains on the size of a domain family, the theorem can be used to qualitatively describe the dynamics of the asymptotic behavior of the equilibrium frequencies.

We will prove that the asymptotic behavior of a solution of system (3.9) is completely defined by the asymptotic relation between λ_{i} and δ_{i}. More precisely, let us define a function χ (*i*)= λ_{i-1}/δ_{i} we consider only functions of power growth, i.e. χ (*i*)

*i* *s* at *i*→∞ for a real *s*. We will see that this is not a serious restriction because the most realistic situations correspond to the case of *s* = 0. So, let us suppose that, for large *i*, the following expansion is valid:

where *s, a* are real numbers and θ > 0. Evidently, if *s* ≠ 0, χ (*i*) tends either to 0 (*s* < 0) or to ∞ (*s* > 0) with the increase of *i*.

**Definition**. Let us refer to a BDIM (2.2), (3.10) as

*ii. first-order balanced*, if *s* = 0 and θ ≠ 1, i.e.

*iii. second-order balanced*, if *s* = 0, θ = 1 and *a* ≠ 0, i.e.

*iv. high-order balanced*, if *s* = 0, θ = 1 and *a* = 0, i.e.

We will show that the first three coefficients, *s*, θ and *a,* of asymptotic expansion (3.10) for χ (*i*) = λ_{i-1}/δ_{I} exactly specify all possible asymptotic behaviors of BDIM equilibrium frequencies.

**Theorem 1.** *The equilibrium frequencies p*_{i} *of BDIM (2.2) have the following asymptotics*

i. if the model is non-balanced, then

ii. if the model is first-order balanced, then

iii. if the model is second-order balanced, then

iv. if the model is high-order balanced, then

The proof is given in the Mathematical Appendix. The classification of BDIM according to the order of balance is illustrated in Fig. Fig.2 2 and the asymptotics for different types of BDIMs are shown in Fig. Fig.3 3 .

Different orders of balance in BDIMs.

Asymptotics of equilibrium distributions for balanced BDIMs of different orders.

It follows from this theorem that, if a BDIM is non-balanced, then its equilibrium frequencies *p*_{i} (and equilibrium family sizes *f*_{i}) increase or decrease extremely fast (hyper-exponentially) with the increase of *i*. In contrast, if a BDIM has a non-zero order of balance, asymptotic behavior is observed.

**Corollary 1**.*For a first-order balanced BDIM with* θ < 1,

*i. if a* > -1, *the equilibrium frequencies p*_{i} *follow Pascal distribution with parameters* (*a*+1,θ)

*ii. if a = -1, the equilibrium frequencies follow truncated logarithmic distribution with parameter* θ

*iii. if a = 0, the equilibrium frequencies follow geometric distribution with parameter* θ.

The following implication of Theorem 1 is of principal interest.

**Corollary 2**. *Equilibrium frequencies of a BDIM have a power asymptotic behavior if and only if the BDIM is second-order balanced.*

**Corollary 3**. *For high-order balanced BDIM, if* λ_{i-1}/δ_{i} = 1 *for all i, the only possible distribution of equilibrium frequencies is uniform, p*_{i} = *const for all i. Moreover, even if* λ_{i-1}/δ_{i} = 1 + *O*(1/*i* 2 ), *the equilibrium frequencies asymptotically tend to the uniform distribution.*

#### Rational BDIM

*Rational* models comprise a general class of BDIM (Fig. (Fig.4), 4 ), for which the asymptotic behavior of the equilibrium frequencies and equilibrium sizes of domain families can be completely investigated.

The hierarchy of BDIM types.

Let us suppose that the birth and death rates are of the form

We will refer to BDIM (2.2.), (4.1) as *rational* BDIM.

It is known that a wide class of mathematical functions can be well approximated by rational functions of the form (4.1) (see, e.g. [37]).

The following theorem describes all possible asymptotic behaviors of the equilibrium frequencies of a rational BDIM. Let us denote

**Theorem 2**. *The equilibrium sizes of domain families of a rational BDIM have the following asymptotics*

The proof is given in the Mathematical Appendix.

**Corollary 1**. *If* η = 0, *then the rational BDIM is first-order balanced and the sequence of equilibrium numbers of domain families* <*f*_{i}> *has a power-exponential asymptotics*

*In particular, if ρ - β > -1, the equilibrium frequencies p _{i} follow the Pascal distribution with parameters* (ρ - β + 1, θ)

*if* ρ - β = -1, *then frequencies p*_{i} *follow the truncated logarithmic distribution*

**Corollary 2**. *The equilibrium sizes of domain families f _{i} and equilibrium frequencies p_{i} for a rational BMID have the power asymptotics if and only if η = 0 and λ = δ, i.e. the BDIM is second-order balanced, in which case*

Formula (4.4) gives the asymptotics for the equilibrium sizes of domain families *f*_{i} and, accordingly, for the total number of families *F*_{eq}. The exact expressions for these quantities are given in the proofs of Theorem 2 and Lemma (see Mathematical Appendix).

**Proposition 3**.

i. The equilibrium sizes of domain families f_{i} of a balanced (first or higher order) rational BDIM are

ii. The total number of domain families at equilibrium is

For the rational, second-order balanced BDIM, the ratio of the birth rate to the innovation rate is

The asymptotic formulas for equilibrium frequencies of rational BDIM could be considered as particular cases of the corresponding formulas of general theorem 1. Proposition 3 allows one to calculate the constants in the corresponding asymptotic formulas for the sizes of domain families for a rational BDIM. If only equilibrium frequencies are analyzed, the values of these constants become irrelevant because they contract. However, if the actual values of *f*_{i} and *F*_{eq} are of interest, the values of the constants are required.

#### Properties of the main types of rational BDIM

##### Simple BDIM

As shown above, a simple BDIM can have equilibrium frequencies only of the form *p*_{i} = *C*θ *i* /*i*, *C* = *const*in particular, if the distribution parameter θ < 1, we get the (truncated) logarithmic distribution. Logarithmic distributions are seen in many biological contexts, e.g., the distribution of species by the number of individuals in populations or, what is more relevant, the distribution of protein folds by the number of families per fold [38]. Thus, a simple BDIM could be potentially used for modeling the dynamics of biological systems with a logarithmic distribution of equilibrium densities. We examine this possibility in greater detail starting with the case λ = δ (second-order balanced simple BDIM).

We can extract from Proposition 2 some additional information, which could be helpful for estimating the model parameters. It is known that

More precisely, the approximation 1/*i* = ln*N* + *C*_{E} + *N* -1 /2 - *N* -2 /12 has an error less then 10 -6 for *N* > 10. Thus, from (3.7), we obtain an interesting formula

This means that, in the equilibrium state of the system, the total number of domain families grows only slowly (

ln *N*) with the increase of the maximal number (*N*) of domains in a family (which is equal to the maximal possible number of domain family size classes).

Furthermore, according to equation (2.3), in the equilibrium state of a simple BDIM ν/δ = *f*_{1}, so we have

Formula (5.1) can be used for estimating the model parameters on the basis of empirical data.

In the more general case λ ≠ δ, we can also obtain an estimate of the rate of innovation ν. If λ < δ (θ < 1), then the series in the right part of (3.5) quickly converges,

so -ln(1-θ)/θ is a good approximation for the sum θ *i*-1 /*i* for large *N*. Then

Taking into account that ν/δ = *f*_{1} (2.3), we have a relation

which allows the parameter θ to be estimated on the basis of empirical data.

If *N* can be estimated independently and is not very large, we can use more exact relations:

Further, if (1-θ)*N* is small (i.e., θ is very close to 1), then the approximation

has an error less then [*N*(1-θ)] 2 /4 and, in this case,

For the simple BDIM, the ratio of the rate of duplications to the innovation rate is

If the simple BDIM is the 2 nd order balanced, θ = 1, then *G*(*N*) = *N* - 1.

Thus, for the simple, second-order balanced BDIM, the number of duplications per time unit is *N*-1 times greater than the number of innovations.

The total number of domains in the equilibrium state for the simple BDIM is

If a simple BDIM is second-order balanced, then *G*(*N*) = ν/λ *N*.

##### Linear BDIM

We saw that the assumption of independence of birth and death rates of individual domains on each other and on the size of domain families is incompatible with any power distribution of the equilibrium frequencies with the degree not equal to -1. The simplest case of a BDIM, which can have, depending on the parameters, three types of asymptotic behavior described by Theorem 1 (excluding the first one, hyper-exponential, which corresponds to a non-balanced BDIM all linear BDIMs are balanced) and, in particular, any power asymptotics, is a model with *linear birth and death* rates of the form:

The parameters *a* and *b* account, in the simplest possible form, for the deviation of the domain birth and death rates from those under the independence assumption. More precisely, according to (5.7), the average birth rate per domain in a family of size *i* is λ_{i}/*i* = λ + λ*a*/*i.* So, for small *i,* the average birth rate is close to λ + λ*a*, whereas, for large *i,* it tends to λ. Similarly, the average death rate changes from δ + δ*b* in a small family to the limit value δ in a large family. Thus, if *a* and *b* are positive (which seems to be the case for the available data see below), both the birth rate and the death rate per domain decrease with the increase of the class number (size of the respective domain families) conversely, if *a* and *b* are negative, these rates increase with the class number (Fig. (Fig.5 5 ).

Dependence of per domain birth and death rates on the domain family size for the second-order balanced linear BDIM.

Corollary 1 of Theorem 2 implies that equilibrium frequencies *p*_{i} of a linear BDIM have asymptotics

In particular, if λ ≠ δ and *a* = *b*, the linear BDIM is first-order balanced and the equilibrium frequencies *p*_{i} follow the logarithmic distribution (in this case, the linear BDIM is asymptotically equivalent to the simple BDIM). If λ = δ, the linear BDIM is second-order balanced and the equilibrium frequencies *p*_{i} follow the power distribution

Thus, the dependence of the domain frequency on the family size is actually determined by the difference *a* - *b*. If *a* >*b*, the birth rate decreases faster than the death rate with the increase of family size, i. e. there seems to be a "competition" between domains in a family in contrast, if *a* <*b*, the death rate drops faster, i.e. a "synergy" between domains appears to exist (Fig. (Fig.4 4 ).

More detailed information can be obtained using Proposition 4:

i) for a first-order balanced linear BDIM, the equilibrium sizes f_{i} of domain families are

and the total number of domain families at equilibrium is

*ii) for a second-order balanced linear BDIM* (θ = 1),

According to (2.3), in the equilibrium state of a linear BDIM, *f*_{1} = ν/δ_{1} = ν/(δ(1 + *b*)) and so, for a *second-order balanced* linear BDIM, we have the formula

Suppose that equilibrium frequencies obtained from empirical data follow the power distribution *p*_{i}

*i* -γ in this case, -γ is the slope of the empirical curve (ln*f*_{i} versus ln*i*) and can be estimated from the data. Assuming that the system is well described by a linear BDIM, it follows from (5.9) that *a* - *b* = 1 - γ and λ = δ. Thus,

where *a* is the single free parameter.

For the linear second-order balanced BDIM, the ratio of the birth rate to the innovation rate is

The case 1 + *a* - *b* = 0 (slope of the asymptote in double logarithmic coordinates equal to *a* - *b* - 1 = -2) is a critical one.

The total number of domains in the equilibrium state for a second-order balanced linear BDIM is

If the slope of the asymptote γ = -1, the linear second-ordered BDIM shows the same asymptotic behavior as a simple BDIM (2.1), but behaves differently at small *i*. If γ ≠ -1, the system cannot be described by a *simple* BDIM even asymptotically, but can be described by a linear BDIM. As indicated above, in this case, the average per-domain birth and death rates depend on the size of the domain family and the difference *a*-*b* characterizes this dependence.

##### Quadratic BDIM

The linear BDIM takes into account the dependence of average birth and death rates of individual domains on the size of domain family, but does not imply a specific form of interaction between domains. Let us consider the simplest, pairwise interaction, which leads to λ_{i}

*i* 2 , i.e. one or both rates are polynomials on *i* of the second degree. If these degrees are different (i.e., λ_{i}

*i* 2 ), then the corresponding BDIM is non-balanced and equilibrium frequencies have hyper-exponential asymptotics. Thus, let

According to theorem 3 and Proposition 3, the quadratic BDIM with rates (5.13) has equilibrium sizes of domain families

Note that the asymptotic behavior of frequencies *p*_{i} does not depend on free coefficients *r*_{2}, *q*_{2} in (5.13), but only on θ and *r*_{1}-*q*_{1} (as follows from (5.14)), although the values of *f*_{i} are proportional to the constant *c*_{2}, which could depend on the free coefficients *r*_{2}, *q*_{2}. Let us consider the case *r*_{2} = *q*_{2} = 0 in more detail.

Formula (5.17) allows estimating parameter θ from empirical data if *N* is large enough.

More precisely, *F*_{eq} = ν/λ θ *j* /*j* 2 = ν/λ (Polylog(2,θ)-θ 1+*N* LerchPhi(θ,2,1+*N*)), where LerchPhi is a special function (these and other special functions used below can be computed using program packages Mathematika or Maple).

If, additionally, θ = 1 (the BDIM is second-order balanced), then

From formulas (5.8), (5.15), we can extract some additional information, which could be helpful for estimating the model parameters at relatively small *N*. Let us recall definitions of some special functions.

The function PolyGamma(*n*,z) is *n* *th* derivative of φ(z), PolyGamma(*n*,z) = d *n* φ(z)/dz *n* , such that φ(z)= PolyGamma(0,z).

which can be used for estimating unknown parameters of the model.

The values of PolyGamma(1,*x*) are tabulated and can be computed using standard program packages for a rough preliminary estimate, PolyGamma(1,*x*) = 1/*x*+1/2*x* 2 +O(1/*x* 4 ).

##### Polynomial BDIMs

The quadratic models take into account the dependence of birth and death rates of individual domains on the simplest, pairwise interactions. If interactions of higher orders are postulated, λ_{i}

*Q*_{m}(*i*), where *P*_{n}(*i*), *Q*_{m}(*i*) are polynomials on *i* of the *n*-th and *m*-th degrees. Again, if the degrees *n* and *m* are different, then the BDIM is non-balanced and equilibrium frequencies have hyper-exponential asymptotics. Thus, let *n* = *m,*

According to Theorem 3, the polynomial BDIM with rates (5.21) has equilibrium sizes of domain families with *power-exponential asymptotics*

In particular, if ρ - *m* > -1, the equilibrium frequencies *p*_{i} follow the Pascal distribution with parameters (ρ - *m* + 1, θ)

if ρ - *m* = -1, the equilibrium frequencies *p*_{i} follow the (truncated) logarithmic distribution

if ρ - *m* = 0, the equilibrium frequencies *p*_{i} follow the geometric distribution

if λ = δ, the polynomial BDIM is second-order balanced and the equilibrium frequencies *p*_{i} follow the power distribution

Note that the degree of the power distribution (5.23) depends only on *m,* the common degree of the polynomials (5.21), and on ρ, the difference between the coefficients *r*_{1} and *q*_{1}, and does not depend on other coefficients. In particular, if *r*_{1} = *q*_{1}, then *p*_{i}

*i* -*m* . This relation could be interpreted as follows: if the first two coefficients of polynomial rates λ_{i} and δ_{i} are equal, then the degree of the power distribution (5.19) is equal to the "order of interactions" of domains.

Then (see Proposition 3) the equilibrium numbers of domain families *f*_{i} of the polynomial BDIM (5.18) are

This formula can be used for estimating the model parameters.

For the polynomial second-order balanced BDIM, the ratio of the death rate to the innovation rate is

#### Approximation of the observed domain family size distributions in prokaryotic and eukaryotic genomes with different BDIMs

Having developed the mathematical theory of BDIMs, we sought to determine which of these models, if any, adequately described the empirical data on domain family size distribution. To identify the domain sets of domains encoded in each of the genomes, the CDD library of position-specific scoring matrices (PSSMs), which includes the domains from the Pfam and SMART databases, was used in RPS-BLAST searches [12] against the protein sequences from a set of completely sequenced eukaryotic and prokaryotic genomes http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome. The CDD domain library is partially redundant, so when the results obtained from individual PSSMs showed significant overlap (more than 50% of hits overlapped for more than 50% of their length), the corresponding domains were examined case-by-case for redundancy. PSSMs representing structurally similar domains and producing overlapping lists of hits were joined into "synonymy clusters".

The results of RPS-BLAST searches against the sets of protein sequences from individual genomes were interpreted as follows: non-overlapping hits to the same protein were treated independently among overlapping hits, only the strongest one (lowest *E*-value) was recorded all hits from a synonymy cluster were assigned to one representative domain family. The number of hits that a domain family had in a genome, with the cut-off of *E* = 0.001, was recorded as the number of domains of the given family encoded in the given genome. The CDD domain library certainly does not include all existing domains. In practice, domains from this collection were detected in 㹐% in each of the analyzed species, with the only exception of human, for which the analyzed protein set is likely to contain a substantial fraction of false predictions (Table (Table1 1 ).

### Table 1

Domain families in sequenced genomes and parameters of the best-fit second-order balanced linear BDIM

a: *f*_{1}(*f*'_{1}), the observed (predicted) number of domains in class 1 (represented only once in the genome) *a*, *b*, parameters of the second-order balanced linear BDIM *k*, slope of the power asymptotic ν/δ = ν/λ, ratio of the innovation rate to the per domain death (birth) rate *G*, ratio of the innovation rate to the total (per genome) birth rate. b: Species abbreviations: Sce, *Saccharomyces cerevisiae*, Dme, *Drosophila melanogaster*, Cel, *Caenorhabditis elegans*, Ath, *Arabidopsis thaliana*, Hsa, *Homo sapiens*, Tma, *Thermotoga maritima*, Mth, *Methanothermobacter thermoautotrophicum*, Sso, *Sulfolobus solfataricus*, Bsu, *Bacillus subtilis*, Eco, *Escherichia coli*.

To enable statistical analysis using the χ 2 -method for the entire range of the data, including the sparsely distributed classes corresponding to large families, the data needed to be combined. For each genome, the observed domain family frequencies were grouped into bins, each containing at least 10 families typically, bins corresponding to families with small number of members included a single size class (e.g. all single-member families, two-member families etc), whereas bins corresponding to large families may span hundreds of size classes, most of them empty. Theoretical distribution values for a bin combining observed data from *m*-th to *n*-th class were computed as , where *f*'_{i} is the predicted number of families in the *i*-th class and depends on the model parameters. Since the model displays only a weak dependence on the maximum number of domains in a family (*N*), instead of including *N* as a model parameter, the sum (where *i*_{max} is the number of domains in the most abundant of the detected families), was normalized to equal the total number of families detected in the given genome (a requirement for the χ 2 analysis). χ 2 values were computed to measure the quality of fit between the observed and the theoretical distributions. The distribution parameters (θ for the simple BDIM, *a* and *b* for the second-order balanced linear BDIM) were adjusted to minimize the χ 2 value.

The simplest model that resulted in a good fit to the observed domain family size distributions was the second-order balanced linear BDIM (Fig. (Fig.6 6 , ,7 7 , ,8 8 , ,9 9 , ,10 10 , ,11 11 , ,12 12 , ,13 13 , ,14 14 , ,15). 15 ). For all analyzed genomes, *P*(χ 2 ) for this model was Ϡ.05, i.e. no significant difference between the model predictions and the observed data was detected. Considering the first-order balanced linear BDIM, which involves varying the parameter θ, did not result in a significant improvement of fit for any of the analyzed genomes (data not shown). In contrast, a fit to a truncated logarithmic distribution (prediction of a simple BDIM) failed for all genomes (*P*(χ 2 ) < 10 -5 Fig. Fig.16, 16 , ,17, 17 , and data not shown). An exact power-law distribution, which is often used to approximate protein family frequency distributions, similarly failed to adequately fit the observed data, even when the most deviant class 1 families were excluded (*P*(χ 2 ) = 0.0013 for *T. maritima* *P*(χ 2 ) < 10 -5 for the rest of the genomes Fig. Fig.16, 16 , ,17 17 and data not shown). Notably, second-order balanced linear BDIM results in a correct prediction of the number of very large families, whereas simple BDIM systematically underestimates the number of families in the highest bins. Conversely, the power-law fit underestimates the slope of the best-fit line (in double logarithmic coordinates) compared to the asymptote of the linear BDIM prediction and, accordingly, significantly overestimates the number of families in the highest bins (Fig. (Fig.16, 16 , ,17). 17 ). These results are compatible with the recent observation that the domain family size distributions are better described by the generalized Pareto distribution than by power laws [31].

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the yeast *Saccharomyces cerevisiae*. A. Distribution of the size of domain families grouped into bins B. Domain family size distribution in double logarithmic coordinates. Magenta line: *f*_{i} = 11521Γ(*i*+1.55)/Γ(*i*+4.27) C. Cumulative distribution function of domain family size. The line shows the prediction of the second-order balanced linear BDIM.

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the fruit fly *Drosophila melanogaster*. The panels and the designations are as in Fig. Fig.6. 6 . B. Magenta line: *f*_{i} = 5258Γ(*i*+1.62)/Γ(*i*+3.79)

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the nematode worm *Caenorhabditis elegans*. The panels and the designations are as in Fig. Fig.6. 6 . B. Magenta line: *f*_{i} = 2453Γ(*i*+1.13)/Γ(*i*+3.03)

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the thale cress *Arabidopsis thaliana*. The panels and the designations are as in Fig. Fig.6. 6 . B. Magenta line: *f*_{i} = 10750Γ(*i*+3.80)/Γ(*i*+5.98)

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: *Homo sapiens*. The panels and the designations are as in Fig. Fig.6. 6 . B. Magenta line: *f*_{i} = 22030Γ(*i*+5.16)/Γ(*i*+7.43)

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the hyperthermophilic bacterium *Thermotoga maritima*. The panels and the designations are as in Fig. Fig.6. 6 . B. Magenta line: *f*_{i} = 4256Γ(*i*+0.14)/Γ(*i*+3.22)

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the thermophilic euryarchaeon *Methanothermobacter thermautotrophicus*. The panels and the designations are as in Fig. Fig.6. 6 . B. Magenta line: *f*_{i} = 2753Γ(*i*+0.12)/Γ(*i*+3.00)

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the hyperthermophilic crenarchaeon *Sulfolobus solfataricus*. The panels and the designations are as in Fig. Fig.6. 6 . B. Magenta line: *f*_{i} = 2714Γ(*i*+0.36)/Γ(*i*+3.04)

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the bacterium *Bacillus subtilis*. The panels and the designations are as in Fig. Fig.6. 6 . B. Magenta line: *f*_{i} = 3489Γ(*i*+0.48)/Γ(*i*+3.01)

Fit of empirical domain family size distributions to the second-order balanced linear BDIM: the bacterium *Escherichia coli*. The panels and the designations are as in Fig. Fig.6. 6 . B. Magenta line: *f*_{i} = 6776Γ(*i*+0.84)/Γ(*i*+3.54)

Comparison of different approximations of the empirical domain family size distribution: *Escherichia coli*. Magenta line: second-order balanced linear BDIM, *f*_{i} = 6776Γ(*i*+0.84)/Γ(*i*+3.54), Red line: simple BDIM, *f*_{i} = 528 × 0.87 *i* /*i*, Cyan line: power law, *f*_{i} = 602*i* -1.76 .

Comparison of different approximations of the empirical domain family size distribution: *Arabidopsis thaliana*. Magenta line: second-order balanced linear BDIM, *f*_{i} = 10750Γ(*i*+3.80)/Γ(*i*+5.98), Red line: simple BDIM, *f*_{i} = 344 × 0.98 *i* /*i*, Cyan line: power law, *f*_{i} = 516*i* -1.36 .

Fitting the observed domain family size distribution with the second-order balanced linear BDIM resulted in positive values of the parameters *a* and *b*, with *a* <*b*, for all analyzed genomes (Table (Table1). 1 ). Accordingly, domain family size distributions in all cases asymptotically tend to the power law with the power *k* < -1 and, for all species with the exception of *C. elegans*, *k* < -2 (Table (Table1 1 and Fig. Fig.8). 8 ). As discussed above, this seems to indicate the existence of "synergy" between domains in a family whereby the likelihood of survival is greater for a domain that belongs to a large family than for a domain from a small family (Fig. (Fig.5). 5 ). For all species, we find that the innovation rate is approximately three orders of magnitude greater than the per domain birth (death) rate. Accordingly, the total per genome birth (duplication) rate is comparable to but, typically, several times greater than the innovation rate (Table (Table1). 1 ). The ratio of the per genome birth rate to the innovation rate increases with the number of genes in a genome or the number of detected domains, with nearly identical rates seen for small prokaryotic genomes and values as high as 20 for the largest plant and animal genomes (Table (Table1 1 ).

The data used to fit the BDIM typically included 50�% of the proteins encoded in a given genome (Table (Table1) 1 ) the remaining proteins were not represented by sufficiently similar domains in the current CDD collection. It cannot be ruled out that the fit would be significantly affected as a result of including all proteins encoded in the genome, in case the proteins currently not recognized in CDD searches have a family size distribution substantially different from that of the recognized ones. However, second-order balanced linear BDIM can accommodate considerable perturbations of the distribution through adjustment of the parameters, so we believe that this model is likely to approximate well also the size distribution of domain families for complete sets of proteins encoded in a genome. An alternative approach that at least partially circumvents the sampling problem involves analysis of all families of paralogs detectable using clustering by sequence similarity, with employing a predefined library of domains this analysis is beyond the scope of the present work but may be a subject of further investigation.

### General discussion and conclusions

Here, we presented a complete mathematical description of the size distribution of protein domain families encoded in genomes for simple but not unrealistic models of evolution, which include three types of events: domain duplication (birth), domain elimination (death), and domain innovation. In biological terms, innovation could involve gene acquisition via horizontal gene transfer, emergence of a new domain from a non-coding sequence or a non-globular protein sequence, or major modification of a domain obliterating its connection with a pre-existing family. Innovation via horizontal gene transfer appears to be particularly common in prokaryotes [32,39], which might account for the apparent higher relative innovation rate in prokaryotic genomes observed in the present analysis (Table (Table1 1 ).

We showed that birth-death-innovation models (BDIMs) with different levels of complexity lead to readily distinguishable predictions regarding the distribution of domain family sizes in genomes. In particular, we defined the exact analytic conditions that lead, exactly or asymptotically, to power law distributions, which have recently received ample attention, as they were uncovered in various biological and social contexts [20,25]. In contrast to previous analyses [16,17,30] but in agreement with the results of a recent re-examination [31], we showed that the power law only asymptotically approximates the domain family size distributions.

Three groups of observations made in this work seem to have the greatest potential of enhancing our understanding of genome evolution and, perhaps, evolution of other complex systems. First, we proved that, within the BDIM framework, there is a unique equilibrium state of the system, which is approached exponentially, with respect to time, from any initial state. In this equilibrium state, the number of domain families in each size class remains constant and follows a unique distribution depending on the type and parameters of the BDIM. In particular, power asymptotics emerges when and only when a BDIM is second-order balanced, i.e. the rates of domain birth and death are asymptotically equal. Since we showed that the observed size distributions of domain families in all analyzed genomes indeed tend to power law asymptotics, the results are compatible with the notion that the genomes are close to a steady state with respect to the domain diversity (*F*_{eq}, the number of distinct domain families at equilibrium, under the using the BDIM convention) and distribution (*f*_{i}). Taking a broader biological perspective, this result might indicate that evolving lineages go through lengthy periods of relative stasis when the level of genomic complexity remains more or less the same. Under this view, the stasis epochs are punctuated by relatively short periods of dramatic changes when the complexity either greatly increases (the emergence of eukaryotes is the most obvious case in point) or decreases (e.g. evolution of parasites). These bursts of evolution might be described as transitions between different BDIMs in the parameter space, with some of the trajectories potentially involving non-balanced BDIMs. The analogy between this emerging picture of genome evolution and the punctuated equilibrium concept of species evolution, which has been developed through analysis of the paleontological record [40], is obvious.

Second, we showed that the simplest model that adequately describes the observed domain family size distributions is the second-order balanced linear BDIM in contrast, simple BDIMs do not show a good fit to the observations. This has potentially important implications for the mode of domain family evolution. Simple BDIMs are based on the notion that the likelihood of duplication (birth) or elimination (death) of a domain is uniform across the genome and does not depend on the size or other characteristics of domain families (the independence assumption). Clearly, under the independence assumption, a duplication (birth) as well as elimination (death) of a domain is more likely to occur in a large family than in a small one, but only because the overall probability of such an event is proportional to the number of family members, whereas the birth (death) rate *per domain* remains the same. The key observation of this work, that the actual domain frequency distributions are well described by a linear but not by a simple BDIM, suggests that the independence assumption is an oversimplification. Instead, the linear BDIM includes parameters that describe the dependence of the per domain birth (death) rate on the family size. The asymptotics of the theoretical distribution that is the best fit for the actual data is a power law, with the power equal to *a*-*b*-1, where *a* and *b* are the parameters of a linear BDIM. We observed that, for all analyzed genomes, *a*-*b*-1 < -1 (*a* <*b*), which corresponds to "synergy" between domains in a family. Both the domain birth rate and the death rate drop with the increase of the size of a domain family, but the death rate decreases faster (Fig. (Fig.5). 5 ). In general terms, this suggests that small families are more dynamic during evolution than large families. In particular, under the BDIM formalism, innovation contributes only to single-member families (class 1), which have the greatest evolutionary mobility, and either quickly proliferate and are stabilized or perish. An implication of these observations is that, in general, large families are older than small ones. Exceptions to this generalization, i.e. the existence of small, ancient families, probably point to selection for a specific family size for example, it seems likely that selection acts against proliferation of certain essential proteins, e.g. ribosomal proteins, which typically form single-member families [41]. Another pertinent observation is that the linear BDIM seems to adequately accommodate even the largest of the identified domain families. Lineage-specific expansion of paralogous families appears to be one of the principal modes of organismic adaptation during evolution [13,14,42]. Thus, quantitatively, adaptive family expansion appears to fit within the BDIM framework, although these models do not explicitly incorporate the notion of selection. Of course, for BDIMs, it is irrelevant which families expand, and this choice is determined by selection.

Third, the BDIM equilibrium condition with respect to the total number of domain families, ν = δ_{1}*f*_{1} (ν is the innovation rate, δ_{1} is the domain death rate for class 1 families, and *f*_{1} is the number of domain families in class 1) allows us to estimate the ratio between domain innovation rate and the domain death and birth rates. Indeed, according to the above and the definition of a second-order linear BDIM, which is the best fit for the actual data, λ = δ = ν/*f*_{1}(1+*b*). Since the number of domain families in class 1 (families with only one member) is in the hundreds for each genome, this translates into an innovation rate that is much greater than the duplication or elimination rate *per domain* (Table (Table1). 1 ). Such high innovation rates might appear counter-intuitive, but let us note that the duplication rate over all domain families is a number that tends to be nearly identical to ν for small prokaryotic genomes and several-fold greater than ν for large eukaryotic genomes (Table (Table1). 1 ). Thus, under the second-order balanced linear BDIM, the likelihood of appearance of a new domain in a genome is close to or several times less than the likelihood of a duplication or elimination of an existing domain. Nevertheless, the finding that the innovation rate is comparable to the overall duplication/elimination rate seems surprising. If the linear BDIM is indeed a realistic evolutionary model, this emphasizes the critical role of innovation in maintaining the balance (steady state) in genome evolution. In prokaryotes, innovation via horizontal gene transfer appears to be particularly extensive [32,39], which might underlie the greater relative innovation rate in these organisms (Table (Table1 1 ).

As already indicated, BDIMs do not explicitly incorporate selection. However, the present analysis shows that only models with precisely balanced domain birth, death and innovation rates can account for the observed distribution of domain family size in each of the analyzed genomes. It seems likely that the balance between these rates is itself a product of selection. There is little doubt that BDIMs will be eventually replaced by more sophisticated formalisms that will more realistically capture the mechanisms of genome evolution. Nevertheless, even the crude modeling described here seems to reveal several potentially interesting and non-trivial aspects of the evolutionary process.

### Mathematical Appendix. Proofs of some statements

#### Proof of Proposition 1

When system (3.1) is solved consecutively from the last equation to the second one, it becomes obvious that the solution is unique up to a constant multiplier.

Since system (2.2) is linear, the equilibrium state (*f*_{1}. *f*_{N}) is asymptotically stable if the real parts of all characteristic values of the matrix

The following theorem (see [43]) gives the desired criterion: the real part of all the characteristic values of a real matrix *C* = |*c*_{ij}|, *i,j* = 1. *n* with non-negative non-diagonal elements are negative if and only if (-1) *k* *D*_{k} > 0 for all *k* = 1. *n*, where *D*_{k} is the main minor of the matrix *C* of the *k*-th order.

To apply this theorem, let us consider the *n* × *n* matrix, *n* ≤ *N*

Using these equalities, we can prove that for any *n*

Further, it is easy to see that for any *n*

Taking into account that *B*_{1} = -(λ_{1} + δ_{1}) < 0 and that the sign of det *A*_{n} coincides with (-1) *n* , it is easy to prove that

Thus, the sign of det *B*_{n} coincides with the sign of det *A*_{n} and so (-1) *n* *B*_{n} > 0 for all *n* = 1. *N*. According to the aforementioned theorem, the real parts of all the characteristic values of a real matrix *A*_{N} are negative and so the single equilibrium is asymptotically stable, QED.

## Contents

The **SIR model** [6] [7] [8] [9] is one of the simplest compartmental models, and many models are derivatives of this basic form. The model consists of three compartments:-

**S**: The number of **s**usceptible individuals. When a susceptible and an infectious individual come into "infectious contact", the susceptible individual contracts the disease and transitions to the infectious compartment. **I**: The number of **i**nfectious individuals. These are individuals who have been infected and are capable of infecting susceptible individuals. **R** for the number of **r**emoved (and immune) or deceased individuals. These are individuals who have been infected and have either recovered from the disease and entered the removed compartment, or died. It is assumed that the number of deaths is negligible with respect to the total population. This compartment may also be called "**r**ecovered" or "**r**esistant".

This model is reasonably predictive [10] for infectious diseases that are transmitted from human to human, and where recovery confers lasting resistance, such as measles, mumps and rubella.

These variables (**S**, **I**, and **R**) represent the number of people in each compartment at a particular time. To represent that the number of susceptible, infectious and removed individuals may vary over time (even if the total population size remains constant), we make the precise numbers a function of *t* (time): **S**(*t*), **I**(*t*) and **R**(*t*). For a specific disease in a specific population, these functions may be worked out in order to predict possible outbreaks and bring them under control. [10]

As implied by the variable function of *t*, the model is dynamic in that the numbers in each compartment may fluctuate over time. The importance of this dynamic aspect is most obvious in an endemic disease with a short infectious period, such as measles in the UK prior to the introduction of a vaccine in 1968. Such diseases tend to occur in cycles of outbreaks due to the variation in number of susceptibles (S(*t*)) over time. During an epidemic, the number of susceptible individuals falls rapidly as more of them are infected and thus enter the infectious and removed compartments. The disease cannot break out again until the number of susceptibles has built back up, e.g. as a result of offspring being born into the susceptible compartment.

Each member of the population typically progresses from susceptible to infectious to recovered. This can be shown as a flow diagram in which the boxes represent the different compartments and the arrows the transition between compartments, i.e.

### Transition rates Edit

For the full specification of the model, the arrows should be labeled with the transition rates between compartments. Between *S* and *I*, the transition rate is assumed to be *d(S/N)/dt = -βSI/N 2* , where *N* is the total population, β is the average number of contacts per person per time, multiplied by the probability of disease transmission in a contact between a susceptible and an infectious subject, and *SI/N 2* is the fraction of those contacts between an infectious and susceptible individual which result in the susceptible person becoming infected. (This is mathematically similar to the law of mass action in chemistry in which random collisions between molecules result in a chemical reaction and the fractional rate is proportional to the concentration of the two reactants).

Between *I* and *R*, the transition rate is assumed to be proportional to the number of infectious individuals which is γ*I*. This is equivalent to assuming that the probability of an infectious individual recovering in any time interval *dt* is simply γ*dt*. If an individual is infectious for an average time period *D*, then γ = 1/*D*. This is also equivalent to the assumption that the length of time spent by an individual in the infectious state is a random variable with an exponential distribution. The "classical" SIR model may be modified by using more complex and realistic distributions for the I-R transition rate (e.g. the Erlang distribution [11] ).

For the special case in which there is no removal from the infectious compartment (γ=0), the SIR model reduces to a very simple SI model, which has a logistic solution, in which every individual eventually becomes infected.

### The SIR model without vital dynamics Edit

The dynamics of an epidemic, for example, the flu, are often much faster than the dynamics of birth and death, therefore, birth and death are often omitted in simple compartmental models. The SIR system without so-called vital dynamics (birth and death, sometimes called demography) described above can be expressed by the following set of ordinary differential equations: [7] [12]

This model was for the first time proposed by William Ogilvy Kermack and Anderson Gray McKendrick as a special case of what we now call Kermack–McKendrick theory, and followed work McKendrick had done with Ronald Ross.

This system is non-linear, however it is possible to derive its analytic solution in implicit form. [6] Firstly note that from:

Secondly, we note that the dynamics of the infectious class depends on the following ratio:

the so-called basic reproduction number (also called basic reproduction ratio). This ratio is derived as the expected number of new infections (these new infections are sometimes called secondary infections) from a single infection in a population where all subjects are susceptible. [13] [14] This idea can probably be more readily seen if we say that the typical time between contacts is T c = β − 1 *before* the infectious has been removed is: T r / T c .

By dividing the first differential equation by the third, separating the variables and integrating we get

(note that the infectious compartment empties in this limit). This transcendental equation has a solution in terms of the Lambert W function, [15] namely

The role of both the basic reproduction number and the initial susceptibility are extremely important. In fact, upon rewriting the equation for infectious individuals as follows:

i.e., there will be a proper epidemic outbreak with an increase of the number of the infectious (which can reach a considerable fraction of the population). On the contrary, if

i.e., independently from the initial size of the susceptible population the disease can never cause a proper epidemic outbreak. As a consequence, it is clear that both the basic reproduction number and the initial susceptibility are extremely important.

#### The force of infection Edit

Note that in the above model the function:

models the transition rate from the compartment of susceptible individuals to the compartment of infectious individuals, so that it is called the force of infection. However, for large classes of communicable diseases it is more realistic to consider a force of infection that does not depend on the absolute number of infectious subjects, but on their fraction (with respect to the total constant population N

Capasso [16] and, afterwards, other authors have proposed nonlinear forces of infection to model more realistically the contagion process.

#### Exact analytical solutions to the SIR model Edit

In 2014, Harko and coauthors derived an exact so-called analytical solution (involving an integral that can only be calculated numerically) to the SIR model. [6] In the case without vital dynamics setup, for S ( u ) = S ( t )

An equivalent so-called analytical solution (involving an integral that can only be calculated numerically) found by Miller [17] [18] yields

Effectively the same result can be found in the original work by Kermack and McKendrick. [4]

A highly accurate analytic approximant of the SIR model as well as exact analytic expressions for the final values S ∞

### The SIR model with vital dynamics and constant population Edit

Consider a population characterized by a death rate μ

for which the disease-free equilibrium (DFE) is:

In this case, we can derive a basic reproduction number:

which has threshold properties. In fact, independently from biologically meaningful initial values, one can show that:

## Inference of epidemiological dynamics based on simulated phylogenies using birth-death and coalescent models

Quantifying epidemiological dynamics is crucial for understanding and forecasting the spread of an epidemic. The coalescent and the birth-death model are used interchangeably to infer epidemiological parameters from the genealogical relationships of the pathogen population under study, which in turn are inferred from the pathogen genetic sequencing data. To compare the performance of these widely applied models, we performed a simulation study. We simulated phylogenetic trees under the constant rate birth-death model and the coalescent model with a deterministic exponentially growing infected population. For each tree, we re-estimated the epidemiological parameters using both a birth-death and a coalescent based method, implemented as an MCMC procedure in BEAST v2.0. In our analyses that estimate the growth rate of an epidemic based on simulated birth-death trees, the point estimates such as the maximum a posteriori/maximum likelihood estimates are not very different. However, the estimates of uncertainty are very different. The birth-death model had a higher coverage than the coalescent model, i.e. contained the true value in the highest posterior density (HPD) interval more often (2-13% vs. 31-75% error). The coverage of the coalescent decreases with decreasing basic reproductive ratio and increasing sampling probability of infecteds. We hypothesize that the biases in the coalescent are due to the assumption of deterministic rather than stochastic population size changes. Both methods performed reasonably well when analyzing trees simulated under the coalescent. The methods can also identify other key epidemiological parameters as long as one of the parameters is fixed to its true value. In summary, when using genetic data to estimate epidemic dynamics, our results suggest that the birth-death method will be less sensitive to population fluctuations of early outbreaks than the coalescent method that assumes a deterministic exponentially growing infected population.

### Conflict of interest statement

The authors have declared that no competing interests exist.

### Figures

Figure 1. Comparison of the birth-death model…

Figure 1. Comparison of the birth-death model and the coalescent model in estimating epidemic growth…

For each plot, 100 trees simulated under the constant rate birth-death (BD) model with incomplete sampling (subfigure A) or coalescent (CE) model with exponential growth of the infected population (subfigure B) were analyzed assuming a birth-death model (blue bars) or a coalescent model with deterministic exponential population growth (red bars). 95% highest posterior density (HPD) intervals of the growth rate parameter are shown (y-axis). The trees are ordered (x-axis) by the median value of the posterior distribution of the growth rate parameter estimated by the coalescent (orange dot within the red bar) from the birth-death trees. Median of the posterior estimates for the growth rate parameter estimated by birth-death model is indicated as light blue dot within each blue interval. The true value of the growth rate parameter, i.e. the value under which the trees were simulated, is displayed as black horizontal bar. Here, we used and (). See Figure S1 for the plots of other parameter settings.Figure 2. Influence of branch length extension…

Figure 2. Influence of branch length extension in various parts of the tree on the…

For setting and (), we modified all 100 birth-death trees (A) and all 100 coalescent trees (B) by branch extension. The unchanged tree is denoted as “orig” on x-axis. We added 48 units of time, roughly corresponding to the full length of the longest trees, to the branches. We extended the branches that were present in the tree at 10% of the tree (going from the root), at 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% (see x-axis from left to right). We then re-estimated the growth rate parameter for each such tree. Unlike in previous plot, here we display a summary in form of the median values of the start and the end of the 95% HPD intervals, and the median of the medians of the posterior estimates for all 100 trees per setting.Figure 3. Influence of sampling scheme on…

Figure 3. Influence of sampling scheme on the growth rate parameter estimation.

For setting (…

For setting (), we modified the birth-death tree simulations to include periods of higher () and lower sampling (either , subfigures A and B, or , subfigures C and D). We simulated 100 birth-death trees (A and C) and corresponding coalescent trees (B and D) under various sampling schemes (see x-axis annotation). We display a summary in form of the median values of the start and the end of the 95% HPD intervals, and the median of the medians of the posterior estimates for all 100 trees per setting. For the settings where the constant rate birth-death method produced very severe biases, we also analysed the trees with the birth-death skyline model with 10 intervals for the sampling probability (BDSKY, light-blue lines). The summary for trees simulated under constant sampling throughout, is represented on the very left of each figure ( on the x-axis). Next, we varied the sampling as to e.g. sample no tips () in the early phases ( until ) when going forward in time and then sampling all the tips that die () from onward (corresponding to the setting denoted as “p = 0 from t = 0 to t = 9”).Figure 4. Error on as a function…

Figure 4. Error on as a function of sampling probability for fixed and .

In (A) the relationship between the error on , i.e. estimated /true , and the sampling probability is plotted. The values and are fixed. For different , , and and , we calculate and , and plot the impact on error when changing during inference using Equation (3) in the Supplementary Material S1. In (B) we display how error on depends on different assumptions of during inference for , and and an array of true sampling probability used for calculating and .Figure 5. Effect of different information used…

Figure 5. Effect of different information used in the parameter inference.

For setting and (), we estimated the parameter from the birth-death trees (A) and the coalescent trees (B) using four methods. First, using the coalescent posterior estimates of the growth rate and the true , we obtained (red bars). Second, we used the birth-death posterior estimates of (trees analysed under uniform priors for , , and ), and the true in the post-processing (blue bars), similar to the procedure used for the coalescent. Third, we also analyzed the trees by fixing the prior on the death rate to the true value, (green bars) or by fixing the prior on the sampling probability to the true value, (purple bars) during the MCMC analysis. Note that y-axis now displays 95% HPD of the parameter, and within each figure, the trees (simulations) are ordered (x-axis) by the median estimate of growth rate parameter estimated by the coalescent on the birth-death trees.Figure 6. Comparison of the birth-death model…

Figure 6. Comparison of the birth-death model and the coalescent model in estimating epidemic growth…

For simulated trees where all 100 tips are sampled at one point in time, we estimated the growth rate parameter assuming a birth-death model with fixed sampling probability (blue bars) and the coalescent model with a deterministic exponentially growing population (red bars). Here we used and sampling probability (). See Figure S15 for the plots of other parameter settings.Figure 7. Comparison of growth rate point…

Figure 7. Comparison of growth rate point estimates of the birth-death model and the coalescent…

For setting and (), we display the ML and MAP estimates for the birth-death trees (A) and the coalescent trees (B). As a comparison, the median values of the start and the end of the 95% HPD intervals, and the median of the medians of the posterior estimates for all 100 trees per setting are also displayed. The true value of the growth rate parameter, i.e. the value under which the trees were simulated, is displayed as a black horizontal bar. See Figures S17 and S18 for the plots of other parameter settings.## Contents

For recurrence and transience in Markov processes see Section 5.3 from Markov chain.

### Conditions for recurrence and transience Edit

Conditions for recurrence and transience were established by Samuel Karlin and James McGregor. [1]

By using **Extended Bertrand's test** (see Section 4.1.4 from Ratio test) the conditions for recurrence, transience, ergodicity and null-recurrence can be derived in a more explicit form. [2]

Then, the conditions for recurrence and transience of a birth-and-death process are as follows.

### Application Edit

The random walk described here is a discrete time analogue of the birth-and-death process (see Markov chain) with the birth rates

So, recurrence or transience of the random walk is associated with recurrence or transience of the birth-and-death process. [2]

These limiting probabilities are obtained from the infinite system of differential equations for p k ( t ) :

A **pure death process** is a birth–death process where λ i = 0

*M/M/1 model* and *M/M/c model*, both used in queueing theory, are birth–death processes used to describe customers in an infinite queue.

Birth-death processes are in phylodynamics as a generative model of phylogenies, ie a binary tree in which birth events correspond to branches of the tree and death events correspond to leaf nodes. [3] Notably, they are used in viral phylodynamics [4] to understand the transmission process.

In queueing theory the birth–death process is the most fundamental example of a queueing model, the *M/M/C/K/ ∞ /FIFO* (in complete Kendall's notation) queue. This is a queue with Poisson arrivals, drawn from an infinite population, and

*C*servers with exponentially distributed service times with

*K*places in the queue. Despite the assumption of an infinite population this model is a good model for various telecommunication systems.

### M/M/1 queue Edit

The **M/M/1** is a single server queue with an infinite buffer size. In a non-random environment the birth–death process in queueing models tend to be long-term averages, so the average rate of arrival is given as λ

The differential equations for the probability that the system is in state *k* at time *t* are

### Pure birth process associated with an M/M/1 queue Edit

That is, a (homogeneous) Poisson process is a pure birth process.

### M/M/c queue Edit

The M/M/C is a multi-server queue with *C* servers and an infinite buffer. It characterizes by the following birth and death parameters:

The system of differential equations in this case has the form:

### Pure death process associated with an M/M/C queue Edit

that presents the version of binomial distribution depending of time parameter t

### M/M/1/K queue Edit

The M/M/1/K queue is a single server queue with a buffer of size *K*. This queue has applications in telecommunications, as well as in biology when a population has a capacity limit. In telecommunication we again use the parameters from the M/M/1 queue with,

In biology, particularly the growth of bacteria, when the population is zero there is no ability to grow so,

Additionally if the capacity represents a limit where the individual dies from over population,

The differential equations for the probability that the system is in state *k* at time *t* are

## References

Nickerson DP, Hunter PJ: The Noble cardiac ventricular electrophysiology models in CellML. Prog Biophys Mol Biol. 2006, 90: 346-359.

Smith NP, Crampin EJ, Niederer SA, Bassingthwaighte JB, Beard DA: Computational biology of cardiac myocytes: proposed standards for the physiome. J Exp Biol. 2007, 210 (Pt 9): 1576-1583.

Niederer SA, Fink M, Noble D, Smith NP: A meta-analysis of cardiac electrophysiology computational models. Exp Physiol. 2009, 94: 486-495.

Hunter PJ, Borg TK: Integration from proteins to organs: the Physiome Project. Nat Rev Mol Cell Biol. 2003, 4: 237-243.

Tøndel K, Indahl UG, Gjuvsland AB, Omholt SW, Martens H: Multi-way metamodelling facilitates insight into the complex input–output maps of nonlinear dynamic models. BMC Syst Biol. 2012, 6: 88-

Tøndel K, Indahl UG, Gjuvsland AB, Vik JO, Hunter P, Omholt SW, Martens H: Hierarchical Cluster-based Partial Least Squares Regression is an efficient tool for metamodelling of nonlinear dynamic models. BMC Syst Biol. 2011, 5: 90-

Isaeva J, Martens M, Sæbø S, Wyller JA, Martens H: The modelome of line curvature: Many nonlinear models approximated by a single bi-linear metamodel with verbal profiling. Phys Nonlinear Phenom. 2012, 241: 877-889.

Isaeva J, Sæbo S, Wyller JA, Nhek S, Martens H: Fast and comprehensive fitting of complex mathematical models to massive amounts of empirical data. Chemometr Intell Lab. 2012, 117: 13-21.

Isaeva J, Sæbø S, Wyller JA, Wolkenhauer O, Martens H: Nonlinear modelling of curvature by bi-linear metamodelling. Chemometr Intell Lab. 2012, 117: 2-12.

Pearson K: On lines and planes of closest fit to systems of points in space. Philos Mag. 1901, 2: 559-572.

Jolliffe IT: A Note on the Use of Principal Components in Regression. J R Stat Soc: Ser C: Appl Stat. 1982, 31: 300-303.

Murty KG: Linear Programming. 1983, New York, USA: John Wiley & Sons, Inc

Kirkpatrick S, Gelatt CD, Vecchi MP: Optimization by Simulated Annealing. Science. 1983, 220: 671-680.

Marquardt DW: An Algorithm for Least-Squares Estimation of Nonlinear Parameters. J Soc Ind Appl Math. 1963, 11: 431-441.

Niederer SA, Hunter PJ, Smith NP: A Quantitative Analysis of Cardiac Myocyte Relaxation: A Simulation Study. Biophys J. 2006, 90: 1697-1722.

Land S, Niederer SA, Aronsen JM, Espe EKS, Zhang L, Louch WE, Sjaastad I, Sejersted OM, Smith NP: An analysis of deformation-dependent electromechanical coupling in the mouse heart. J Physiol. 2012, 590 (Pt 18): 4553-4569.

Land S:An integrative framework for computational modelling of cardiac electromechanics in the mouse. 2013, UK: Doctoral Thesis, University of Oxford,

Gordon AM, Homsher E, Regnier M: Regulation of contraction in striated muscle. Physiol Rev. 2000, 80: 853-924.

Rice JJ, Wang F, Bers DM, de Tombe PP: Approximate model of cooperative activation and crossbridge cycling in cardiac muscle using ordinary differential equations. Biophys J. 2008, 95: 2368-2390.

Hunter PJ, McCulloch AD, ter Keurs HE: Modelling the mechanical properties of cardiac muscle. Prog Biophys Mol Biol. 1998, 69: 289-331.

McKay MD, Los ASL, Beckman RJ, Conover WJ: Comparison the three methods for selecting values of input variable in the analysis of output from a computer code. Technometrics. 1979, 21: 239-245.

Land S, Louch WE, Niederer SA, Aronsen JM, Christensen G, Sjaastad I, Sejersted OM, Smith NP: Beta-adrenergic stimulation maintains cardiac function in Serca2 knockout mice. Biophys J. 2013, 104: 1349-1356.

MATLAB®. 2012, Natick, Massachusetts, USA: The MathWorks, Inc

Lawson CL, Hanson RJ: Solving Least Squares Problems. 1974, New Jersey, USA: Society for Industrial and Applied Mathematics, Prentice-Hall, Inc.

Martens M, Martens H: Partial Least Squares regression. Stat Proced Food Res JR Piggott Ed. 1986, London: Elsevier Applied Sciences, 293-360.

Martens H, Næs T: Multivariate Calibration. 1989, Chichester, UK: John Wiley and Sons

Martens H, Martens M: Multivariate Analysis of Quality: An Introduction. 2001, Chichester, UK: John Wiley & Sons Ltd, 1

Wold S, Martens H, Wold H: The multivariate calibration method in chemistry solved by the PLS method. Lect Notes Math Matrix Pencils. 1983, Heidelberg: Springer-Verlag, 286-293.

Tøndel K, Vik JO, Martens H, Indahl UG, Smith N, Omholt SW: Hierarchical multivariate regression-based sensitivity analysis reveals complex parameter interaction patterns in dynamic models. Chemometr Intell Lab. 2013, 120: 25-41.

Tøndel K, Gjuvsland AB, Måge I, Martens H: Screening design for computer experiments: Metamodelling of a deterministic mathematical model of the mammalian circadian clock. J Chemometr. 2010, 24: 738-747.

Bezdek JC: Pattern Recognition with Fuzzy Objective Function Algorithms. 1981, Norwell, USA: Kluwer Academic Publishers

Berget I, Mevik B-H, Næs T: New modifications and applications of fuzzy C-means methodology. Comput Stat Data Anal. 2008, 52: 2403-2418.

Næs T, Kubberød E, Sivertsen H: Identifying and interpreting market segments using conjoint analysis. Food Qual Prefer. 2001, 12: 133-143.

Næs T, Isaksson T: Splitting of calibration data by cluster analysis. J Chemometr. 1991, 5: 49-65.

Gutenkunst RN, Waterfall JJ, Casey FP, Brown KS, Myers CR, Sethna JP: Universally Sloppy Parameter Sensitivities in Systems Biology Models. PLoS Comput Biol. 2007, 3: e189-

Cacuci DG, Ionescu-Bujor M, Navon IM: Sensitivity and Uncertainty Analysis: Applications to Large-Scale Systems Vol 2. 2005, Boca Raton, USA: CRC Press, 1

Kleijnen JPC: Design and Analysis of Simulation Experiments. 2007, New York, USA: Springer, 1

Vik JO, Gjuvsland AB, Li L, Tøndel K, Niederer SA, Smith N, Hunter PJ, Omholt SW: Genotype-phenotype map characteristics of an in silico heart cell. Front Physiol. 2011, 2: 106-

Sobol IM: Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Math Comput Simul. 2001, 55: 271-280.

Nelder JA, Mead R: A Simplex Method for Function Minimization. Comput J. 1965, 7: 308-313.

El Tabach E, Lancelot L, Shahrour I, Najjar Y: Use of artificial neural network simulation metamodelling to assess groundwater contamination in a road project. Math Comput Model. 2007, 45: 766-776.

Hamilton J: The Kalman Filter. Time Ser Anal. 1994, New Jersey, USA: Princeton University Press, 13:

Sarkar AX, Sobie EA: Regression Analysis for Constraining Free Parameters in Electrophysiological Models of Cardiac Cells. PLoS Comput Biol. 2010, 6:

Radulescu O, Gorban AN, Zinovyev A, Noel V: Reduction of dynamical biochemical reactions networks in computational biology. Front Gene. 2012, 3: 131-

## 6. The Problem of Selection

Throughout the above, we have alluded to the problems that systematic selection for one or more mtDNA types can cause in these analyses. Theory describing the influence of selection has been established, but is complicated (Johnston et al., 2015). In particular, if approaches that assume the absence of selection are used when it is in fact present, errors can arise in estimates of genetic properties and physical mechanisms. As pointed out above, these issues may lead to dramatic underestimation of 𠇋ottleneck size,” and cannot be assumed to �ncel out.”

Several of the results above are valid only in the absence of selection: when no mtDNA type experiences an advantage over any other. This is known to be false for many mtDNA pairings in many somatic tissues, where selection for one mtDNA type over another is often observed (reviewed in Burgstaller et al., 2014). Selection in the germline has been more debated, but evidence is increasing. In several studies, the transmission of pathological mutations seems to be subject to selective pressure. The maximum level of transmission for the 3243 mutation in humans has appeared to be limited (Monnot et al., 2011 Otten et al., 2018), and selection against severe mtDNA mutations has been observed in mice (Fan et al., 2008). Recent observations in mice (Burgstaller et al., 2018 Latorre-Pellicer et al., 2019) and humans (Wei et al., 2019) have indeed observed selection at different loci. Burgstaller et al. (2018) suggest that selection may act in different directions at different developmental stages (very recently supported by Latorre-Pellicer et al., 2019), and that these directions may either cancel out or provide a net selective shift. Mathematical theory for the behavior of mutant proportion spread when selection is present remains less well-developed and represents an important future theoretical target. The birth-death-partitioning approach in references (Johnston et al., 2015 Johnston and Jones, 2016) can account for selection but are mathematically complicated. Otten et al. (2018) have proposed a truncated Kimura distribution to describe a selective regime where mutant proportions above a certain value are prohibited, and found that it is supported by observations of the 3243 mutation.

Comparison to the Kimura distribution is often used to argue for an absence of selection. However, this approach must be interpreted with caution. Depending on the mechanism of selection, Kimura-distributed samples may be observed even when selection has occurred. In particular, as approaches using the Kimura distribution sometimes use several �ter” but no �ore” measurements, it is possible that an early shift in mutant proportion will not be detected.

## Random walk models in biology

Mathematical modelling of the movement of animals, micro-organisms and cells is of great relevance in the fields of biology, ecology and medicine. Movement models can take many different forms, but the most widely used are based on the extensions of simple random walk processes. In this review paper, our aim is twofold: to introduce the mathematics behind random walks in a straightforward manner and to explain how such models can be used to aid our understanding of biological processes. We introduce the mathematical theory behind the simple random walk and explain how this relates to Brownian motion and diffusive processes in general. We demonstrate how these simple models can be extended to include drift and waiting times or be used to calculate first passage times. We discuss biased random walks and show how hyperbolic models can be used to generate correlated random walks. We cover two main applications of the random walk model. Firstly, we review models and results relating to the movement, dispersal and population redistribution of animals and micro-organisms. This includes direct calculation of mean squared displacement, mean dispersal distance, tortuosity measures, as well as possible limitations of these model approaches. Secondly, oriented movement and chemotaxis models are reviewed. General hyperbolic models based on the linear transport equation are introduced and we show how a reinforced random walk can be used to model movement where the individual changes its environment. We discuss the applications of these models in the context of cell migration leading to blood vessel growth (angiogenesis). Finally, we discuss how the various random walk models and approaches are related and the connections that underpin many of the key processes involved.

### 1. Introduction

The basis of random walk theory can be traced back to the irregular motion of individual pollen particles, famously studied by the botanist Brown (1828), now known as *Brownian motion*. Classical works on probability have been in existence for centuries, so it is somewhat surprising that it was only at the beginning of the twentieth century that a random walk was described in the literature, when the journal *Nature* published a discussion between Pearson (1905) and Rayleigh (1905). Physicists, such as Einstein (1905, 1906) and Smoluchowski (1916), were then drawn to the subject and many important fields, such as random processes, random noise, spectral analysis and stochastic equations, were developed during the course of research on random walks. Random walk theory was further developed with the mean-reversion process (Uhlenbeck & Ornstein 1930).

The first simple models of movement using random walks are uncorrelated and unbiased. In this context, uncorrelated means the direction of movement is completely independent of the previous directions moved: the location after each step taken in the random walk is dependent only on the location in the previous step and the process is Markovian with regard to the location (Weiss 1994). Unbiased means there is no preferred direction: the direction moved at each step is completely random. Assuming that movement in any direction is allowed, this process is essentially Brownian motion and such models can be shown to produce the standard diffusion (or heat) equation.

Correlated random walks (CRWs) involve a correlation between successive step orientations, which is termed ‘persistence’ (Patlak 1953). This produces a local directional bias: each step tends to point in the same direction as the previous one, although the influence of the initial direction of motion progressively diminishes over time and step orientations are uniformly distributed in the long term (Benhamou 2006). Since most animals have a tendency to move forwards (persistence), CRWs have been frequently used to model animal paths in various contexts (e.g. Siniff & Jessen 1969 Skellam 1973 Kareiva & Shigesada 1983 Bovet & Benhamou 1988 Turchin 1998).

A global directional bias can be introduced by making the probability of moving in a certain direction greater, leading to the drift–diffusion (or advection–diffusion) equation. Paths that contain a consistent bias in the preferred direction or towards a given target are termed biased random walks (BRWs), or biased and CRWs (BCRWs) if persistence is also observed. The bias may be due to the fixed external environmental factors (e.g. bottom heavy micro-organisms moving upwards under gyrotaxis Hill & Häder 1997), to spatially varying factors, such as chemical gradients (Alt 1980 Othmer *et al*. 1988), to mean-reversion mechanisms, such as movement within a home range (Blackwell 1997), or to a choice of direction by individuals at each step (Benhamou 2003). The target direction and strength of bias are not necessarily fixed over the whole path and may vary with location and time (e.g. fish larvae aiming for a reef Codling *et al*. 2004). When the target direction is fixed for all individuals in the population, it is possible to quantify the direction, functional form and magnitude of the introduced bias (Hill & Häder 1997 Codling & Hill 2005*a*). However, due to the localized directional bias (persistence) in CRW, it is a non-trivial problem to distinguish between the CRW and BCRW when individuals have different target directions (Benhamou 2006).

In the context of population redistribution, uncorrelated random walks are classed as position jump processes (Othmer *et al*. 1988). In general, they are valid only for large time scales and can be thought of as an asymptotic approximation to the true equations governing movement that include correlation effects. In turn, the CRW and BCRW are often referred to as velocity jump processes (since the process involves random changes in velocity) and have been extensively studied leading to a general framework to describe these processes (Othmer *et al*. 1988 Hillen & Othmer 2000 Hillen 2002 Othmer & Hillen 2002).

The random walks discussed in this paper can have either a fixed or variable step length. In the case of a variable step length, only walks where the distribution of step lengths has finite variance (e.g. the exponential distribution) are considered. By the central limit theorem, this means that, after a sufficient length of time, the location coordinate of an individual walker on any axis converges to a Gaussian distribution. A different type of random walk that has generated much recent interest is the *Lévy walk*, in which the distribution of step lengths is heavy tailed, i.e. has infinite variance. In this case, the walk exhibits scale-invariant (i.e. fractal) characteristics. Several recent studies (Viswanathan *et al*. 1996, 2000) have claimed that Lévy walks provide a suitable model for animal movements, although this is still the subject of some controversy (Benhamou 2007 Edwards *et al*. 2007 James & Plank 2007) and they may not be as generally applicable as once thought. In fact, many of the observed patterns that are attributed to Lévy processes can be generated by a simpler composite random walk process where the turning behaviour is spatially dependent (Benichou *et al*. 2006 Benhamou 2007). A full review of Lévy walks is outside the scope of this paper.

Our aim in this review paper is to produce a comprehensive reference that can be used by both mathematicians and biologists. Other books and review papers are available in this field (e.g. Weiss 1994 Turchin 1998 Okubo & Levin 2001 Hillen 2002), but we have yet to find one paper including all the key results while also explaining how the models used in different biological contexts are related to one another. Random walk theory is generally applied in two main biological contexts (although these are by no means exclusive and there are many other relevant contexts): the movement and dispersal of animals and micro-organisms, and chemotaxis models of cell signalling and movement. As we aim to demonstrate in this review, most of the modelling approaches used in these different contexts are based on the same underlying models. The paper is split into two main sections. In §2, the fundamental theory and equations of random walks are introduced, and the central ideas of bias and persistence are developed in some detail. Section 3 reviews the use of random walks in relation to animal and cell movements, by showing how the theory can be used to predict information about rates of spread and tortuosity, and movements that are governed by chemical signalling agents (or other stimuli). We discuss the benefits and limitations of the different approaches used, and the connections that underpin the models and results. We briefly explain some of the more complex models available in the literature, where relevant. Our aim is to be comprehensive but not exhaustive the reader is referred to key papers for further details and extensions of the basic ideas.

### 2. Fundamentals of random walks

The simple isotropic random walk model (SRW) is the basis of most of the theory of diffusive processes. The walk is isotropic, or unbiased, meaning that the walker is equally likely to move in each possible direction and uncorrelated in direction, meaning that the direction taken at a given time is independent of the direction at all preceding times. Using this model, it is straightforward to derive an equation for the probability density function (PDF) for the location of the walker in one dimension, either by considering the limit as the number of steps gets very large or by using a difference equation (we illustrate these approaches in §§2.1 and 2.2). In §§2.3 and 2.4, we show how the SRW is extended to higher dimensions and, in §2.5, we discuss the general properties of diffusive processes and some limitations of models that use such an approach. In §2.6, we give a simple example of a random walk to a barrier to demonstrate how the SRW can form the basis of more complex models of movement.

An important extension to the SRW is the CRW, in which there is a persistence in the walker's direction of movement. In §2.7, an equation, known as the telegraph equation, describing the simplest one-dimensional CRW is derived and, in §2.8, its relation to the standard diffusion process is discussed. In §2.9, we demonstrate how this analysis can be extended to include bias in the global direction of movement. However, in two (or more) dimensions a similar analysis (for both the unbiased and biased cases) does not lead to a closed-form equation for the probability density in space and time. We discuss this problem and its implications in §2.10.

#### 2.1 The simple isotropic random walk

Consider a walker moving on an infinite one-dimensional uniform lattice (i.e. a line split into discrete points). Suppose the walker starts at the origin (*x*=0) and then moves a short distance *δ* either left or right in a short time *τ*. The motion is assumed to be completely random, so the probabilities of moving both left and right are 1/2. After one time step, the walker can either be at a distance *δ* to the left or right of the origin, with probability 1/2 each. After the next time step, the walker will either be at a distance 2*δ* to the left or right of the origin (with probability 1/4 each) or will have returned to the origin (with probability 1/2). Note that, after an even (odd) number of steps, the walker can only be at an even (odd) distance away from the origin. Continuing in this way, the probability that a walker will be at a distance *mδ* to the right of the origin after *n* time steps (where *m* and *n* are even) is given by

Useful time-dependent statistics of this process are the mean location *E*(*X*_{t}) and the mean squared displacement (MSD) , defined as

#### 2.2 A BRW with waiting times

It is also a standard procedure (Lin & Segel 1974 Okubo & Levin 2001) to derive the governing equation for a SRW using a difference equation, and subsequently solve to find solutions of the form (2.2). This method also allows more complex processes to be modelled as part of the random walk and we illustrate this in the following example by including both a preferred direction (or bias) and a possible waiting time between movement steps.

Consider a walker moving on a one-dimensional lattice, where, at each time step *τ*, the walker moves a distance *δ* to the left or right with probabilities *l* and *r*, respectively, or stays in the same location (‘waits’), with probability 1−*l*−*r* (the isotropic random walk in §2.1 has *r*=*l*=1/2 and there is no waiting time). Now, if the walker is at location *x* at time *t*+*τ*, then there are three possibilities for its location at time *t*: (i) it was at *x*−*δ* and then moved to the right, (ii) it was at *x*+*δ* and then moved to the left, and (iii) it was at *x* and did not move at all. Thus we have

It is worth noting that, owing to the waiting time between movements, the value of the diffusion constant *D* is smaller (*k*=*l*+*r*<1 in this case) than that in §2.1 and hence, as expected, diffusion is less rapid. Note also that, due to the way the limits are taken in (2.6), the term *ϵ*=*r*−*l* must tend to zero in order to obtain a finite drift rate *u*.

The solution of (2.7) subjected to the initial condition *p*(*x*, 0)=*δ*_{d}(*x*) (the Dirac delta function), which means that the walker is at *x*=0 at time *t*=0, is (Montroll & Shlesinger 1984 Grimmett & Stirzaker 2001)

The mean location and the MSD can be easily calculated by substituting (2.8) into (2.3). Alternatively, it is possible to calculate the moments directly from the governing PDE (2.7). This is a technique that is particularly useful in more complex models when the governing differential equation is known, but the solution for *p*(*x*, *t*) may be difficult or impossible to find (see §3.5). First, we multiply equation (2.7) either by *x* (to find mean location) or by *x* 2 (to find MSD) and integrate by parts. Then, using the definitions in (2.3), the fact that

#### 2.3 The BRW in higher dimensions

A similar derivation to that in §2.2 can be completed using an *N*-dimensional lattice to give the standard drift–diffusion equation

*R*

_{t}=|

*X*_{t}| and, for (2.12), is given by

Figure 1 (*a*,*c*,*e*) PDFs and (*b*,*d*,*f*) sample paths of different random walks. (*a*,*b*) A lattice BRW with probabilities of moving a distance *δ* right or left of *τ*(*D*/*δ* 2 ±*u*/(2*δ*)) and up or down of *τD*/*δ* 2 . (*c*,*d*) A non-lattice CRW with probabilities of turning an angle *δ*_{θ} clockwise or anticlockwise of . (*e*,*f*) A non-lattice BCRW with probabilities of turning clockwise or anticlockwise of (cf. the linear reorientation model of §3.4). In the BRW and BCRW, the global preferred direction is *θ*_{0}=0 in the CRW and BCRW, the initial direction is *θ*=*π*/2 and the walker moves with constant speed *v*. In all cases, the walker starts at (*x*, *y*)=(0, 0) at *t*=0 and is allowed to move until *t*=10. The PDFs *p*(*x*, *y*, *t*=10) were calculated from 10 6 realizations of the walk. In (*a*), the white lines show the contours of the corresponding theoretical PDF (2.12). In the sample paths for the BRW, at each step the walker either stays still or moves right, left, up or down by a distance *δ*. In the CRW and BCRW, at each step the walker's direction of motion *θ* either stays the same or turns clockwise or anticlockwise by an angle *δ*_{θ}, and the walker's movement is given by the vector *vτ*(cos *θ*, sin *θ*). Parameter values: *D*=0.2, *u*=0.5, , *B*=2.5, *v*=0.5.

It is interesting to note that, if two-dimensional movement is not restricted to a lattice (box 1), then diffusion in the direction of the bias is lower than that in the perpendicular direction (see ch. 2 of Codling 2003 Coscoy *et al*. 2007). This result is not always properly considered in spatial population dynamics models, where diffusion is usually assumed to be the same in all directions even when there is an average drift in a particular direction.

#### Box 1. Stepping off the lattice.

The von Mises distribution

The wrapped normal distribution

The wrapped Cauchy distribution

#### 2.4 The Fokker–Planck equation

It is also possible to extend the SRW in two (or more) dimensions to include movement probabilities that are spatially dependent. This results in the Fokker–Planck equation for *p*(*x*, *y*, *t*) (Okubo & Levin 2001). Suppose that an individual moves on a two-dimensional lattice. At each time step *τ*, an individual can move a distance *δ* either up, down, left or right with probabilities dependent on location, given by *u*(*x*, *y*), *d*(*x*, *y*), *l*(*x*, *y*) and *r*(*x*, *y*), respectively (with *u*+*d*+*l*+*r*≤1), or remain at the same location with probability 1−*u*(*x*, *y*)−*l*(*x*, *y*)−*d*(*x*, *y*)−*r*(*x*, *y*). We now use a difference equation as in §2.2, expand as a Taylor series and define the following parameters:

#### 2.5 General diffusive properties and model limitations

Closer inspection of the diffusion equation solutions (2.2) and (2.8) shows that the following important property holds: *p*(*x*, *t*)>0 for all locations *x* and all positive times *t*. Hence, for any positive time (no matter how small), there is a positive probability of being at *any* finite location.

In practice, the probability of being at an exceptionally large distance away from the origin after an infinitesimal time step is extremely small. However, the fact that this property of ‘infinite propagation speed’ exists suggests that the limiting process involved in deriving the governing equation of the SRW model has some limitations. As shown in §2.2, to derive the governing equation of a simple BRW, we take limits assuming that *δ* 2 /*τ*=2*D* is constant as *δ*, *τ*→0 (where *δ* is the distance moved at each time step *τ*). Clearly, if this property holds then, in the same limit, we get *δ*/*τ*→∞, i.e. the effective instantaneous speed is infinite (Okubo & Levin 2001). The solution of the diffusion equation should hence be considered only as an asymptotic approximation, valid for large times (*t*≫*τ*), of equations that more accurately describe the correlations in movement and finite speeds that are present when considering movement at shorter time scales.

A more detailed discussion of the infinite propagation speed in diffusive processes can be found in Okubo & Levin (2001), while in §2.7 we present a simple model based on a velocity jump process (rather than a position jump process), which avoids this problem.

#### 2.6 Random walks with a barrier

To model movement in a confined domain, one can introduce a repelling or reflecting barrier into the random walk: a walker reaching the barrier will automatically turn around and move away in the opposite direction. Similarly, to model movement where walkers leave the system upon reaching a given point, one can introduce an absorbing barrier. Models such as these are not only appropriate to model movement in space, but they are also suited to modelling development and growth where critical life stages are reached (Pitchford & Brindley 2001 Pitchford *et al*. 2005 Mullowney & James 2007) and to ‘integrate and fire’ models of nerve responses (Iyengar 2000). These are examples of *first passage time* (or *first hitting time*) problems, in which the distribution of the time taken to reach an absorbing barrier is of primary interest (Condamin *et al*. 2007).

The simple example of an absorbing barrier given below is adapted from an example by Grimmett & Stirzaker (2001). Suppose we have a one-dimensional random walk process that satisfies the drift–diffusion equation (2.7) for *x*>0. Suppose the walker starts at location *x*=*x*_{0}>0, and there is an absorbing barrier at *x*=0 such that, if the walker reaches the point *x*=0, it is removed from the system. The appropriate boundary and initial conditions in this case are

A similar analysis is possible for a reflecting barrier (Grimmett & Stirzaker 2001) and the SRW model can be extended or analysed in many other ways. Montroll & Shlesinger (1984) give a good review of the general theory of random walks and discuss a wide variety of ideas and problems.

#### 2.7 CRWs and the telegraph equation

A CRW takes into account short-term correlations in the direction of movement. In most cases, this means that the walker is more likely to move in the same or a similar direction to its previous movement direction. This tendency to continue in the same direction is known as persistence (Patlak 1953 see for example figure 1*d*). By explicitly including persistence and a fixed speed of movement in the random walk process, the problem of infinite propagation speed discussed in §2.5 is avoided (see also the discussion in Turchin 1998). The location at each step of the random walk is no longer a Markov process (as it depends on the sequence of previous locations). Hence the usual framework for describing a CRW is a *velocity jump process*, in which the variable following a Markov process is the walker's velocity rather than the location (Othmer *et al*. 1988).

Consider a population of individuals moving either left or right along an infinite line at a constant speed *v*. Denote the density of right- and left-moving individuals at location *x* and time *t* by *α*(*x*, *t*) and *β*(*x*, *t*), respectively. The total population density is *p*(*x*, *t*)=*α*(*x*, *t*)+*β*(*x*, *t*). At each time step *τ*, each individual either changes direction and moves a distance *δ* in this new direction (with probability *r*=*λτ*), or moves a distance *δ* in the previous direction (with probability *q*=1−*λτ*). Hence, turning events occur as a Poisson process with rate *λ*. If we take a forward time step, then the number density of individuals at location *x* moving right and left, respectively, is given by

Note that, although (2.27) describes a correlated movement process, the random walk is globally unbiased in the sense that there is no overall preferred direction, simply a tendency for individuals to persist in their present direction of motion (a localized bias). Although the original process uses a fixed time step *τ*, the mean time between turning events, , is different since the telegraph process does not have a turning event at every time step.

Equation (2.27) can be solved given the initial conditions *p*(*x*, 0) and (∂*p*/∂*t*)(*x*, 0), but the full solution is quite complex (see Morse & Feshbach 1953, for details). Since there is a fixed speed *v*, the solution does not imply infinite propagation speeds (as found with the solutions (2.2) and (2.8) of the diffusion equation).

It is possible to use a similar method to that used in §2.2 to derive equations for the moments of the solution: *E*(*X*_{t})=0 and

#### Box 2. Anomalous diffusion.

*μ*=0. This corresponds to a stationary process with no movement over the period of observation.

0<*μ*<1. This situation is known as *sub-diffusion* since MSD increases at a slower rate than in the case of standard diffusion. Such situations typically occur when waiting times between steps are included in the models of movement (e.g. Weeks *et al*. 1996 although a model with waiting times may not always be sub-diffusive, see §2.2), or if the spatial domain is constrained in some way (Coscoy *et al*. 2007), e.g. with the presence of a barrier (as in §2.6).

*μ*=1. This is the standard relation between MSD and time for diffusive movement.

1<*μ*<2. This situation is known as *super-diffusion* since MSD increases at a faster rate than in the case of standard diffusion (although not so fast as with ballistic movement). Such situations typically occur when the step lengths in the walk are drawn from a distribution with infinite variance. Such a process is known as a Lévy walk and has been extensively studied by physicists and more recently ecologists (Viswanathan *et al*. 1996) although its general applicability to animal movement is still open to debate (Benhamou 2007 Edwards *et al*. 2007).

*μ*=2. In this situation, the movement is described as *ballistic* or *wavelike*, and MSD increases quadratically with *t*. This corresponds to the absolute displacement (cf. MDD) increasing linearly with time, which is a standard property of a wave process. In such cases, the characteristic backtracking and random movement associated with diffusive processes is not present, and each individual effectively moves in a straight line (in a random direction) away from the origin for the whole time period. (Note that, as seen is §2.2, MSD scales with *t* 2 for large *t* in a BRW here, we are concerned only with an unbiased movement process.)

#### 2.8 Diffusion limit of the telegraph equation

In §2.1, we showed that the diffusion coefficient of a SRW is *D*=*δ* 2 /(2*τ*), where *δ* is the distance moved at each jump and *τ* is the time step between jumps. Since the telegraph turning process is a Poisson process of intensity *λ*, the mean time between turning events is and the average distance moved between turning events is . Hence the effective diffusion coefficient for the telegraph process is indeed given by

#### Box 3. Hyperbolic or parabolic?

#### 2.9 The biased telegraph equation

The derivation of the biased one-dimensional telegraph equation is similar to the unbiased case, except that we use different turning probabilities depending on the direction of movement. Denoting the probability of turning by *r*_{1}=*λ*_{1}*τ* for right-moving individuals and *r*_{2}=*λ*_{2}*τ* for left-moving individuals, it can be shown that the governing equation, called the biased telegraph equation, is

Note that, in this example, the bias is introduced through the different rate of turning in each direction, which is a form of *klinokinesis* this contrasts with the uncorrelated BRW in §2.2, where the bias came about through the probability of moving in each direction, which is a form of *taxis* (see §3.8).

#### 2.10 The telegraph equation in higher dimensions

In §2.5, the solution to the two-dimensional diffusion process with or without drift was shown to be valid only as a long-time approximation. One can introduce correlation by completing a similar derivation as in §2.7, but working with a two-dimensional lattice rather than a line. There are now four possible directions of movement: right, left, up and down. Initially, we assume that the probability of turning is independent of the direction of movement (so there is no bias). As before, we assume a constant speed *v*. We split the population into individuals moving in each of the four directions *α*_{1}, …, *α*_{4}. At each time step *τ*, an individual can turn *π*/2 rad anticlockwise or clockwise (with probabilities *λ*_{1}*τ* and *λ*_{2}*τ,* respectively), turn *π* rad (with probability *λ*_{3}*τ*), or continue in the previous direction (with probability 1−(*λ*_{1}+*λ*_{1}+*λ*_{3})*τ*).

Completing a similar analysis to §2.7 leads to a set of differential equations for each of the sub-populations moving in the four directions. Further manipulation similar to that in §2.7 then leads to

### 3. Random walks as models of animal and cell movement

It was demonstrated in §2 how expressions may be derived for the PDF, *p*(** x**,

*t*), of an uncorrelated random walk in one or more dimensions where the motion is either a purely random or a biased diffusive process. If

*p*(

**,**

*x**t*) is known, it is straightforward to calculate the moments such as the mean location,

*E*(

*X*_{t}), or MSD, . Animal and cell movements are often characterized by some directional correlation (persistence) and, unfortunately, with a CRW, it is not usually possible to calculate

*p*(

**,**

*x**t*) directly, or even to derive a system of differential equations for

*p*(

**,**

*x**t*). Variations of the telegraph equation can be used to model a one-dimensional CRW and

*p*(

**,**

*x**t*) (and associated moments) can be found. However, as discussed in §2.10, it is still a non-trivial problem to derive a solution for

*p*(

**,**

*x**t*) for a CRW in higher dimensions (Othmer

*et al*. 1988 Hillen 2002 Keller 2004). Nevertheless, it is, in many cases, still possible to calculate statistics of the CRW directly through the analysis of paths, as discussed in §§3.1 and 3.2. In §3.3, some measures of the tortuosity of a path are introduced and their relation to MSD is discussed.

The situation is slightly more complex when movement is both correlated and biased in a global preferred direction (i.e. a BCRW). In §3.4, we discuss how it is possible to detect bias in observed paths when there is either a fixed global preferred direction (as in gyrotaxis where bias is due to gravity) or when the preferred direction is individual dependent (as in navigation to a fixed target in space). Simple extensions of the BRW model are also discussed. In §3.5, we introduce a generalized mass-balance equation (the transport equation) that describes hyperbolic movement and discuss how this can be used as a general framework for modelling BCRW.

In general, for organisms moving in an environment that is varying spatially and/or temporally, transition probabilities will depend explicitly on the time *t* and walker's location ** x**. Typically, this dependence is via some ‘control signal’, such as a chemical substance, light, heat, humidity or odour. A control signal can stimulate the organism in four main ways: the stimulus may be an attractant (or repellent), providing a directional bias that stimulates the organisms to migrate up (or down) a concentration gradient field, or may be an inducer (or inhibitor), causing the rate of diffusive unbiased movement to increase (or decrease). Of course, a specific stimulus may combine more than one of these four properties, and it is not always straightforward to distinguish between the different effects, and the underlying mechanisms responsible for them, on the basis of experimental observations (Cai

*et al*. 2006).

As a further complication, it is common for migrating cells to modify their own chemical environment by producing or degrading the control substance. For example, the slime mould *Dictyostelium discoideum* secretes cyclic adenosine monophosphate (cAMP), which acts as a chemoattractant, leading to the aggregation of cells from a wide area (Höfer *et al*. 1995) certain types of bacteria secrete slime trails, which provide directional guidance for other cells (Othmer & Stevens 1997). In §3.6, the basic theory of reinforced random walks (RRWs) is introduced, together with some models for the transition probabilities which lie at the heart of the RRW description. Some applications of RRW modelling are reviewed in §3.7. Finally, in §3.8, the link between non-lattice random walks and tactic and kinetic movement mechanisms is discussed.

#### 3.1 Mean squared displacement of CRWs

The MSD, defined in *N* dimensions by (2.13), gives a measure of the spatial spread of the population with time and, owing to its relationship with the diffusion coefficient *D* via, for example (2.14), is of great importance to those studying dispersal in biological systems (Okubo & Levin 2001). Interestingly, however, many of the results discussed in this section were first derived through studies in molecular chemistry. The growth and space-filling properties of polymer chains and larger molecules have been modelled as a CRW by, for example, Tchen (1952) and Flory (1969), who both derived equations for MSD. Tchen (1952) also demonstrated the important result that the location coordinates after a large number of steps of a CRW are normally distributed.

Taylor (1921) derived the following expression for the MSD in a one-dimensional correlated walk (see also Tchen 1952 Flory 1969 Hanneken & Franceschetti 1998 Okubo & Levin 2001). Suppose the walker takes a series of steps *y*_{j} (*j*=1, …, *n*) of constant length (|*y*_{j}|=*δ*). A correlation is explicitly introduced between the directions of successive steps (although this correlation propagates in a gradually diminishing way through the Markov process). As the number of steps *n* becomes large, the MSD for this discrete process tends asymptotically to

Kareiva & Shigesada (1983) were the first to set up and analyse a generalized model of a two-dimensional CRW for animal movement that included a variable step length and a general angular distribution for the direction moved at each step (see also Skellam 1973 Nossal & Weiss 1974 Lovely & Dahlquist 1975 Hall 1977 Dunn 1983 Marsh & Jones 1988). The CRW consists of a series of discrete steps of length *L*_{j} and direction *Θ*_{j}. The length *L*_{j} of the *j*th move and the turning angle *Φ*_{j}=*Θ*_{j+1}−*Θ*_{j} are assumed to be random variables with no autocorrelation or cross-correlation (and no correlation between step length and step direction). The CRW thus consists of a series of independent draws from the step length PDF, *p*(*l*), and the turning angle PDF, *g*(*ϕ*) (box 1), for each step (i.e. the process is a first-order Markov chain, see Grimmett & Stirzaker (2001)). We define the mean cosine *c* and mean sine *s* of the turning angle as (see also box 1)

Using this model, Kareiva & Shigesada (1983) derived the following equation for the MSD after *n* steps

The general result (3.3) reduces to a much simpler form in particular cases. For example, if there is no persistence (i.e. the walk is uncorrelated), then *g*(*ϕ*) has a uniform density and both *c* and *s* are zero, and (3.3) reduces to (cf. (2.14) with no drift, ** u**=

**0**and with

*N*=2). As mentioned, a more realistic case is when organisms exhibit equal probabilities of turning clockwise or anticlockwise, so

*g*(

*ϕ*) is symmetric about

*ϕ*

_{0}=0 (although it is worth noting that this is not always the case—for example, bacteria can exhibit an inherent rotational bias due to the handedness of their flagellar motor). In the case of equal turning probabilities, we get

*s*=0 and (3.3) may be written in terms of the coefficient of variation

*b*of the step length

*L*(

*b*

_{2}=

*E*(

*L*2 )/(

*E*(

*L*)) 2 −1)

In the case of random walks with bias, it can be more complicated to derive expressions for MSD. It is possible to write down the MSD of an uncorrelated BRW using a similar approach to the above. Consider a walk consisting of steps of length *L*_{j} and direction *Θ*_{j} (note the important difference between the direction of movement *Θ*_{j} and the turning angle *Φ*_{j}=*Θ*_{j+1}−*Θ*_{j}), whose mean sine is zero and whose mean cosine is *q*. After *n* steps, the MSD is given by (Marsh & Jones 1988 Benhamou 2006)

An equation for MSD in a biased velocity jump process can be generated for both the one-dimensional case and higher dimensions, using a generalized transport equation (see §3.5). However, these are special cases of a BCRW where the turning events occur as a Poisson process. In the case of a BCRW with a fixed time step between turning events (where the spatial step lengths can be either fixed or variable), it remains an open problem to calculate a direct equation for MSD.

#### 3.2 Mean dispersal distance of unbiased CRWs

The MSD, defined in (2.13), is of interest to ecologists and biologists owing to its relation to the diffusion coefficient *D* via (2.14). The MSD is also a statistic that is reasonably mathematically tractable, as illustrated in §3.1. However, because the MSD deals with the squared dispersal distance, it has been suggested that a more intuitive statistic is the mean dispersal distance (MDD Bovet & Benhamou 1988 McCulloch & Cain 1989 Wu *et al*. 2000 Byers 2001). The MDD of a dispersing population is defined in *N* dimensions as

Consider a two-dimensional, unbiased, CRW with mean step length *E*(*L*), and a symmetrical about zero (i.e. zero mean sine) probability distribution *g*(*ϕ*) for the turning angle at each step. After a sufficiently large number of steps *n*, the location coordinates *X*_{n} and *Y*_{n} are independently normally distributed with equal variance. Bovet & Benhamou (1988) used this to derive the following approximate relationship between MDD and MSD:

As discussed in §2.7, it is possible to derive differential equations for the MSD of a BCRW where the times between turning events are distributed as a Poisson process but, owing to the presence of an absolute value in (3.6), this approach cannot be used to derive an equation for the MDD of a BCRW. This remains an open problem but, for particular cases, it may be possible to use the result of Bovet & Benhamou (1988), together with an equation for the MSD derived from a moment closure method (see Codling 2003).

#### 3.3 Tortuosity of CRWs

The *tortuosity* of a path describes the amount of turning in a given space or time. Clearly, tortuosity is related to the MSD and MDD: highly tortuous paths will spread out in space slowly (small MSD), while straight paths will spread out in space quickly (high MSD). Hence, it can be useful to measure and study the tortuosity of observed paths in order to understand the processes involved, estimate the area searched by an organism and predict spatial dispersal. Several measures of tortuosity are available but most have some limitations.

The *straightness index* (sometimes called the *net-to-gross displacement ratio*) is a relative measure that compares the overall net displacement *G* of a path with the total path length *T* (Batschelet 1981). For example, if a random walk starts at location (*x*_{0}, *y*_{0}) and, after *n* steps with lengths *l*_{j} (*j*=1, …, *n*), ends at (*x*_{n}, *y*_{n}), then the straightness index is given by

Unfortunately the straightness index is not a reliable measure of tortuosity of a CRW, because the mean of *G* corresponds to the MDD and hence increases with the square root of *n*. Consequently, the ratio *G*/*T* tends to zero as the number of steps increases (Benhamou 2004). Thus, when observing a CRW and measuring the straightness index, a different result will be found depending on the number of steps considered (the total time observed or the total path length used). It is, therefore, very difficult to compare the tortuosity of different CRWs using this method, unless they all consist of a similar number of steps.

The tortuosity of a CRW corresponds to the amount of turning associated with a given path length and, in some way, measures its long-term diffusion potential. On this basis, Benhamou (2004) defined the path *sinuosity* in terms of the mean step length and MSD after a large number of steps as . Using equation (3.4) simplified by removing the second term on the right-hand side, which becomes negligible in the long term, gives

Equation (3.9) can be applied directly to actual paths when animals naturally move in a discrete way (e.g. a bee flying from one flower to another), provided the basic requirements of a CRW are respected (independence of step lengths and turns). For animals moving in a continuous way, the path is usually *discretized* when recorded (i.e. the raw data consist of a set of locations rather than of a continuous track), and can be rediscretized for analysis purposes. (Re)discretization has been shown to alter the turning angle and step length distributions, and hence involves the use of corrected formulae (Bovet & Benhamou 1988 Benhamou 2004 Codling & Hill 2005*a*). Note that rediscretization can be done either spatially (Bovet & Benhamou 1988), which is easier to fit with the analysis above, or temporally (Codling & Hill 2005*a*), which is perhaps more natural for experiments in the field.

It has also been proposed to measure the tortuosity of animals' random search paths by a fractal dimension. This would be useful for actual paths that can be reliably represented using fractioned Brownian motion, which is a fractal movement model where the persistence is related to the fractal dimension through the use of a parameter called the Hurst coefficient. To our knowledge, however, it has not yet been shown that this kind of model can provide a better representation of animals' random search paths than CRW, whose true fractal dimension is equal to 2 for two-dimensional movements (i.e. they will eventually fill the entire plane) for any positive sinuosity (for *S*=0, the CRW reduces to a straight line, with fractal dimension equal to 1). Applying the classical ‘divider’ method, initially developed to measure the fractal dimension of fractal lines, the CRW provides pseudo-fractal dimension values (Turchin 1998 Benhamou 2004), which correspond to indirect measures of the mean cosine of turns *c* (Nams 1996, 2005). These pseudo-fractal values cannot reliably estimate the path tortuosity of a CRW, which depends not only on the mean cosine of turns *c*, but also on the mean step length *E*(*L*) and coefficient of variation *b*.

#### 3.4 Bias in observed paths

The next step up in model complexity from a CRW is a BCRW. We will deal with modelling approaches for BCRWs in §§3.5 and 3.6, while in this section we will consider the ways in which bias may be detected in an observed path (see also Coscoy *et al*. 2007). Assuming the simplest possible environment and behaviour, there are two main ways in which individuals may respond to a signal and hence introduce bias into their movement. Firstly, there may be a fixed sensory gradient such that the preferred direction is always the same for all individuals at all locations in space (e.g. micro-organisms moving under the influence of gravity). Secondly, there may be a target or point source fixed in space, such that the preferred direction is towards a fixed point (e.g. animals searching for a food source). Note that the first scenario can be considered as a special case of the second where the target is fixed at infinity.

In order to parametrize continuum models of bioconvection in micro-organisms (Kessler 1986 Hill & Pedley 2005), Hill & Häder (1997) analysed the paths of swimming algae undergoing *gyrotaxis* (upward swimming due to a gravitational torque) or *phototaxis* (swimming towards a light source). In both situations, it was assumed that the preferred absolute direction of movement was independent of location (i.e. a ‘target at infinity’), but that the *turning angle* was dependent on the most recent direction of movement. To help inform experimental observations, Hill & Häder (1997) set up a random walk on the unit circle as follows: at each time step *τ*, the walker makes a small turn *δ* clockwise with probability *a*(*θ*) or anticlockwise with probability *b*(*θ*), or continues in the same direction with probability 1−*a*(*θ*)−*b*(*θ*). Following a similar method to that used in §2.2 and taking the limit *δ*, *τ*→0 such that *δ* 2 /*τ* is constant yields the Fokker–Planck equation for the PDF *p*(*θ*, *t*) of travelling in direction *θ* at time *t*

By rediscretizing the observed paths and binning the data, Hill & Häder (1997) estimated using a similar method to Bovet & Benhamou (1988 as is essentially the same measure as sinuosity, see §3.3), and also the functional form of *μ*_{0}. For gyrotaxis, a sinusoidal reorientation model was found to fit well: *μ*_{0}(*θ*, *t*)=−sin (*θ*−*θ*_{0}(*t*))/*B*, where *θ*_{0}(*t*) is the preferred direction and *B* is the average time taken to reorient to the preferred direction. If both and *θ*_{0} are assumed to be constant then the steady state of the Fokker–Planck equation (3.10) is a von Mises distribution (box 1). This sinusoidal response was predicted by Kessler (1986) and can be explained by the gravitational torque that acts on the individual alga: the torque is zero if the alga is moving in the preferred direction or in the opposite direction, and is the greatest when the alga is moving perpendicular to the gravitational force. For phototaxis, a better fit was given by a simple linear response: *μ*_{0}(*θ*, *t*)=−(*θ*−*θ*_{0}(*t*))/*B*. This case is more generally applicable since there are only a few physical situations likely to result in a sinusoidal response. However, if both and *θ*_{0} are allowed to be variable, then the steady state of the Fokker–Planck equation (3.10) is more complicated (Hill & Häder 1997).

This ad hoc method used by Hill & Häder (1997) was tested using simulations by Codling & Hill (2005*a*) and found to be valid (allowing for smoothing errors not accounted for by Hill & Häder (1997) and assuming that sinuosity is low enough to avoid ‘wrapping’ problems leading to an underestimate of (Bovet & Benhamou 1988 Benhamou 2004)). Note that this method allows for both the functional form of the bias response to be estimated and the relevant parameters (mean reorientation time and angular diffusivity) to be quantified.

The situation is more difficult when the preferred direction changes with spatial location (and is thus individual dependent), a common scenario when animals are moving towards a target in space. The main problem in such a scenario is distinguishing between localized bias due to forward persistence and true biased movement towards a target. This is particularly true if there is more than one target or if the target moves in space. Benhamou (2006) suggested a new procedure based on the backward evolution of the beeline distance from the end of the path (the goal) to each animal's preceding locations. This procedure is efficient, as it requires only a small sample of short paths for detecting a possible orientation component, but is not perfect as there remains a relatively high (approx. 30%) probability of misidentifying a true CRW as a BCRW. This type I error can be reduced by considering together a number of paths assumed to be of the same kind.

The above scenarios assume that bias is introduced through reorientation towards the preferred direction. However, there are many other ways to model mechanisms that produce a directional drift, including the case where the mean turning rate is null and the directional bias comes about through variations in (see Benhamou 2006, and §3.8). For example, the classical *run-and-tumble* behaviour observed in chemotactic bacteria is usually modelled through a low rate of turning when moving in the preferred direction (runs) and a high rate of turning (tumbles) otherwise (Berg 1983). We discuss the relation between these various mechanisms in §3.8.

#### 3.5 The transport equation and general hyperbolic models of movement

In §2.7, it was shown how the telegraph process could be used in one dimension to model a CRW and how this resulted in a hyperbolic governing equation (box 3). A similar process does not produce a closed equation in two or more dimensions, but it is still possible to work with a generalized hyperbolic governing equation to model the CRW and BCRW (velocity jump processes) in higher dimensions. Othmer *et al*. (1988) introduced the idea of a governing mass balance equation, the *linear transport equation*, which can be used to describe the movement and reorientation of cells (and animals see also Alt 1980).

Let *p*(** x**,

**,**

*v**t*) be the density function for individuals moving in

*N*-dimensional space, where

**∈**

*x**N*is the location of an individual and

**∈**

*v**N*is its velocity. The total number density of individuals at location

**, regardless of velocity, is given by integrating**

*x**p*over all possible velocities

In general, we are interested in the first few velocity moments (as from these we can calculate the statistics of interest such as *E*(*X*_{t}) and *E*(*R*_{t} 2 )), including the number density *n*(** x**,

*t*) introduced in (3.11), and the average velocity

**(**

*u***,**

*x**t*), which is defined by

Integrating (3.12) over ** v** gives an evolution equation for

*n*in terms of

*u*It is worth noting that equation (3.12) describes a general process, of which many of the basic random walk models discussed previously are special cases. For example, in one spatial dimension, and under the assumption that individuals move with constant speed *v*, there are only two possible velocities: +*v* and −*v*. With the additional assumption that the turning frequency *λ* is constant, it can be shown that (3.13) and (3.14) reduce to equations (2.25) and (2.26) derived in §2.7, which lead to the one-dimensional telegraph equation.

The transport equation also provides a natural extension of the basic telegraph process to two dimensions, without the need to restrict the population to a lattice as in §2.10. Retaining the assumptions of constant turning frequency *λ* and constant speed *v*, an individual's velocity may be described simply by the angle *θ* between its direction of motion and the positive *x*_{1}-axis. The appropriate density function is now *p*(** x**,

*θ*,

*t*). The transport equation (3.12) reduces to

Othmer *et al*. (1988) used the following example to illustrate how a random walk in an external field can be modelled using (3.15). Suppose individuals are moving with a taxis-inducing gradient in the direction *θ*_{0}=0 (i.e. the positive *x*_{1}-direction), under the assumption that the gradient influences only the reorientation kernel *T*(*θ*, *θ*′). Suppose also that the reorientation kernel *T*(*θ*, *θ*′) is the sum of a symmetric probability distribution *h*(*ϕ*), where *ϕ*=*θ*−*θ*′, and a bias term *k*(*θ*) that results from the taxis-inducing gradient. Since the gradient is directed along the *x*_{1}-axis, the bias term takes its maximum at *θ*=0 and is symmetric about *θ*=0. Thus

Assuming that all individuals start at the origin (0, 0), with initial directions uniformly distributed around the unit circle, one can derive a system of differential equations, known as *moment equations*, for the statistics of interest. For the above choice of *T*(*θ*, *θ*′), this system is straightforward to solve (Othmer *et al*. 1988 Codling 2003). The mean velocity and mean location are given by

The choice of reorientation kernel in (3.16) is crucial in the above analysis as it results in a closed system of differential equations for the statistics of interest. Codling & Hill (2005*b*) used an arguably more realistic reorientation kernel based on a single symmetric distribution with mean turning angle dependent on the direction of movement (see §3.4). However, this results in a ‘cascade’ of higher moment equations and further assumptions to close the system need to be made.

A general discussion of the issues relating to the moment closure of systems of moment equations resulting from transport equations is given in Hillen (2002). The transport equation can be developed further to allow more complex scenarios to be modelled (spatially dependent parameters, etc.) although this is likely to make finding a closed-form solution more difficult. The hyperbolic movement model has been used in the place of diffusion models to create *reaction–transport* systems (Hillen 1996, 2002), which can be used in place of the classical reaction–diffusion models (Turing 1952 Murray 1993) used in pattern formation and developmental biology.

#### 3.6 Reinforced random walks

The types of processes whereby walkers modify the chemical environment of themselves and of other individuals in the population have led modellers of cell locomotion to employ the theory of RRWs, which allows the walker to modify (reinforce) the transition probabilities associated with the grid points, or interval, it traverses (Davis 1990). The most common way of representing a one-dimensional RRW is the so-called master equation

There are many possible models for the transition rates *T* ± in terms of the concentration *w*(*x*, *t*) of a control substance, such as the ‘local model’, ‘barrier model’ and ‘normalized barrier model’ proposed by Othmer & Stevens (1997). For a specific choice of transition rates, the continuum limit of the master equation (3.17) can often be found by a similar method to that used to obtain equation (2.5), and may usually be written in the form

In the local model of Othmer & Stevens (1997), the transition rates depend only on the local concentration of control substance: for some constant *ρ* and positive function *F*, called the transition probability function. Taking the limit *δ*→0 and *ρ*→∞, such that *D*=*ρδ* 2 is a constant, leads to a continuum limit equation (3.18) with diffusivity *d*(*w*)=*DF*(*w*) and chemotactic sensitivity *Χ*(*w*)=−*D*(d*F*/d*w*). Hence, as well as modulating the rate of random motility, the control substance provides a directional bias because the cell is more likely to move from an area of high motility to low motility than vice versa.

In the barrier model, the transition rates depend on the concentration of control substance in the interval to be traversed: . Hence, the cell can sense concentrations at *x*±*δ*/2 (so *δ* is the effective sensory range of the cell) and uses these to decide its direction of movement. The continuum limit equation is (3.18) with diffusivity *d*(*w*)=*Dτ*(*w*) and chemotactic sensitivity *Χ*(*w*)=0. Hence, the control substance has an effect of random motility, as in the local model, but there is no directional bias. This is expected because the transition probability from *x* to *x*+*δ*, determined by the control substance concentration at *x*+*δ*/2, is the same as the transition probability from *x*+*δ* to *x*, so there is no directional bias.

In the normalized barrier model, the transition rates are as in the barrier model, but are normalized so that *T* + +*T* − is a constant,

The actions of the control substance under each of these three models are summarized in table 1. Note that the normalized barrier model represents a control substance that is an attractant or repellent, whereas the barrier model represents an inducer or inhibitor. The local model allows both the motility and the directional bias to be varied. However, the constraint on the relationship between *d*(*w*) and *Χ*(*w*) in the local model means that the substance must either be an inducer and a repellent, or an inhibitor and an attractor, which is not always realistic. Motivated by this limitation, the normalized barrier model may be generalized to allow for a variable mean waiting time (VMWT). This may be achieved in two ways. Firstly, the transition rates take the form (3.19), but *ρ* is permitted to be a function of the control substance concentration *w* (Plank 2003). This results in a diffusivity of *d*(*w*)=*Dλ*(*w*) and a chemotactic sensitivity of , thus allowing both non-directional (random) and directional effects of the control substance to be specified independently. A second possible method is to combine the normalized and unnormalized barrier models by taking the transition rates as

Table 1 Summary of the biological interpretation of different models for the RRW transition rates, in terms of the non-directional (*d*(*w*)) and directional (*Χ*(*w*)) effects of the control substance.

Partial differential equations of the form (3.18) have been studied analytically for certain choices of the functions *Χ*(*w*) and *d*(*w*) and of the dynamics for *w*. For example, Keller & Segel (1971) obtained travelling wave solutions and Rascle & Ziti (1995) obtained similarity solutions in the case where chemotactic sensitivity is inversely related to *w*, and the control substance is consumed linearly by the cells. Othmer & Stevens (1997) considered three main cases of cell behaviour, which they termed ‘aggregation’ (meaning solutions in which the cell density aggregates to a finite positive value in certain regions), ‘blow-up’ (cell density tends to infinity in finite time) and ‘collapse’ (cell density tends to zero everywhere). Analogous results for the discrete RRW were originally shown by Davis (1990), and were linked to the continuum-level results by Plank (2003). Analytical conditions for the cases of aggregation, blow-up and collapse were developed by Levine & Sleeman (1997) and further theoretical work on mixed parabolic–hyperbolic equations was done by Takase & Sleeman (2002).

#### 3.7 Applications of reinforced random walks

The theory of RRWs has been presented in terms of one-dimensional walks for simplicity, but is straightforward to extend to higher dimensions. Furthermore, the actions of multiple control substances (*w*_{1}, …, *w*_{n}) can be modelled by taking the transition probability function *τ* to be of the form *F*(*w*)=*F*_{1}(*w*_{1}) … *F*_{n}(*w*_{n}) (Levine *et al*. 2001). Under the normalized barrier model, this leads to a sum of *n* chemotactic terms in the continuum limit equation (3.18).

RRWs have been used to model the movement of myxobacteria (Othmer & Stevens 1997) and the migration of endothelial cells during tumour-induced angiogenesis (growth of new blood vessels Levine *et al*. 2001 Sleeman & Wallis 2002 Plank & Sleeman 2003). These studies adopted the normalized barrier model, chiefly because this models a control substance that offers a directional stimulus (which is of prime importance for successful angiogenesis), but has no direct effects on cell motility. Nevertheless, the growth factors regulating angiogenesis do also influence cell motility and an extension to a VMWT model, which allows for both directional and non-directional effects, was proposed by Plank *et al*. (2004). A non-lattice RRW, similar to that outlined in §3.4, was subsequently used (Plank 2003 Plank & Sleeman 2004) to model angiogenesis in two and three dimensions. In addition to choosing an appropriate functional form for the reorientation rate *μ*_{0}(*θ*), one must define the preferred direction *θ*_{0}. In the case of a chemoattractant, the natural choice is to define *θ*_{0} to be in the direction of increasing chemical concentration (i.e. in the same direction as ∇*w*), which may be a function of the cell's location ** x** and time

*t*.

In models such as these, the control substance(s) may be assumed to be in a steady state, or the random walk model for cell movement may be coupled to one or more time-dependent PDEs for the concentration of the control substance(s). If the cells modulate their chemical environment by producing or consuming control substances, these PDEs will contain source/sink terms depending on cell density. In the case of angiogenesis, the basic cellular motion driving capillary growth is modelled by a random walk, but further rules must be specified to allow for capillary branching and looping (anastomosis). In addition to the transition probabilities, each cell has a probability, which may depend on local control substance concentration, of branching to create two daughter capillaries. If a capillary tip collides with an existing vessel, an anastomosis is formed and the colliding cell ceases to take part in the random walk.

An alternative to the RRW approach described above is to take a continuum Fokker–Planck equation of the form (3.18) as the starting point. This may be discretized using standard techniques to obtain a finite-difference equation of the form (2.4) (with the obvious extension to higher dimensions), which can be used to identify the relevant transition probabilities of moving left and right, and staying still. This method was developed by Anderson *et al*. (1997) to model nematode movement and has since been used to study tumour angiogenesis (Anderson & Chaplain 1998) and tumour cell invasion (Anderson *et al*. 2000), and to distinguish between the inhibitory and repellent effects of a signalling molecule called ‘Slit’ in experimental work (Cai *et al*. 2006). The random walk model for angiogenesis has also been coupled to a flow model to study the effects of blood flow and nutrient delivery in a nascent capillary network (McDougall *et al*. 2006).

Both the ‘bottom-up’ approach of the RRW and the ‘top-down’ approach of discretizing a PDE have the same governing equation (3.18) in the continuum limit. However, the two methods will, in general, result in random walks with different transition probabilities, illustrating the fact that, usually, there is not a unique random walk model corresponding to a given continuum equation. The RRW method has the advantage that the transition probabilities are derived mechanistically from the underlying biology, using a transition probability model of the type summarized in table 1, rather than via a mathematical discretization and normalization procedure (see Plank & Sleeman 2004, for details).

#### 3.8 Biological orientation mechanisms

There are two main types of mechanisms for movement in response to a stimulus. Kinesis refers to the situation where the organism samples only the stimulus intensity at a single point and modulates its speed of movement (orthokinesis or *O*-kinesis) or its path sinuosity (klinokinesis or *K*-kinesis) accordingly. The terms *O*-kinesis and *K*-kinesis were first defined by Gunn *et al*. (1937) in terms of linear and angular speeds, but the redefinition of *K*-kinesis in purely spatial terms (Benhamou & Bovet 1989), by using sinuosity rather than rate of change of direction, has led to a much clearer view of the properties of kineses. By contrast, taxis is where the organism is able to detect a preferential direction of movement, and bias its turns accordingly, without necessarily altering its overall speed of movement or rate of turning.

The random walks described above are all couched in terms of transition rates, which are implicitly assumed to be under the full control of the migrating organism. This raises questions concerning the information that the organism requires in order to be able to exert this control, and the probable realism of assuming that such information is available. In the case of the barrier-type models (see §3.6), it is clear that the walker needs at least two sensors in order to compute the transition rates ‘on the spot’ (i.e. without moving) this type of mechanism is referred to as tropotaxis. The same is also true for the non-lattice random walk (§3.5) if the walker needs to resolve the target direction *θ*_{0}, for example by measuring ∇*w*. The same taxis may also be produced if the organism has only one sensor (or many sensors that are too close together to allow it reliably to detect differences), but moves its body in various directions to sample the local variations in stimulus this is termed klinotaxis. In its simplest form, taxis does not incorporate a correlation between successive step directions, and hence corresponds to a BRW: individuals directly settle their local direction with respect to the target direction. More realistic taxis models can be obtained in the form of BCRW by introducing such a directional correlation (Benhamou & Bovet 1992 Benhamou 1994). The result of the taxis is then determined by the relative weights of the local directional bias (persistence, which controls the motility) and the global directional bias (goal attractiveness, which controls the advection), and by the level of random noise in the system.

By contrast, the local model in §3.6 comes under the category of kinesis, as opposed to taxis, because the walker measures only the stimulus intensity at a single point in order to compute the transition rates. Although the probabilities of moving in different directions are always equal at any given time, this mechanism results in an effective directional bias towards areas of low motility, illustrating the fact that a kinetic mechanism can produce a directional bias.

There are two working modes of kinesis (Benhamou & Bovet 1989): absolute (A) and differential (D). The movement is controlled, in A mode, with respect to the local stimulus intensity experienced at any given location or, in D mode, with respect to the change in stimulus intensity experienced during a step (e.g. through sensory adaptation). Both AO-kinesis and AK-kinesis are space-use mechanisms for exploiting patchy environments, enabling the animal to reduce locally the diffusion coefficient (*D*=*v*/*S* 2 ) of its movement (by decreasing mean speed *v* or increasing the sinuosity *S*) in the most suitable areas (Benhamou & Bovet 1989). Unlike AO-kinesis, however, AK-kinesis is able, when applied to a gradient field, to generate a slight drift in the gradient direction based on simple differences in angular diffusivity , with a null mean turning rate (*μ*_{0}=0). This property was used by Jamon & Bovet (1987) to account for the homing behaviour of mice. However, the advection component is very weak (and accordingly most animals get lost), so that AK-kinesis cannot be considered as an efficient orientation mechanism. By contrast, DK-kinesis can reach 90% of the efficiency of a pure taxis (BRW although taxis with persistence can outperform it in a noisy environment because persistence can be used to smooth the noise (Benhamou & Bovet 1992)). Indeed the so-called chemotaxis of bacteria (Alt 1980 Berg 1983) is mainly a DK-kinesis where the sinuosity is modulated through step length rather than turning angle variance (called a run-and-tumble mechanism). Finally, DO-kinesis seems to have no biological applications.

### 4. Conclusion

The field of random walks is a large and growing area of applied mathematics that is being increasingly used to model biological systems, notably in ecology (animal movements) and pathophysiology (cell movements in, for example, blood vessel formation and cancer cell invasion). In this review paper, the fundamental mathematical theory behind the unbiased and biased, and uncorrelated and CRWs has been developed. Limitations and extensions of these basic models have been discussed, and the progress and pitfalls associated with the application of random walk models to biological scenarios have been examined. A notable advantage of random walk models lies in their ability to distinguish, in a systematic way, underlying mechanisms (such as persistence, kineses and taxes) from observed data, in a way that would not be possible without the insight that rigorous mathematics provides. As a result of this, understanding of the various movement mechanisms that occur in nature has been greatly improved.

Research in the area of random walks is far from complete. There remains a wealth of mathematical problems relating to random walks that have yet to be solved (for example, formulae for the MSD and MDD in a general BCRW) and, of course, an almost endless supply of biological systems that are amenable to modelling using random walk techniques. We have discussed very simple environmental interactions (relating to simple changes in transition probabilities in a RRW) but, in reality, most animals (and many micro-organisms) are highly developed and able to interact extensively with their environment to optimize search strategies. Furthermore, most of the simple models discussed here implicitly assume homogeneous (or at most very simple heterogeneous) environments, whereas most real environments are highly complicated with barriers and differential terrain over all three dimensions that will affect movement behaviour and speed (Vuilleumier & Metzger 2006). Distinguishing between changes in behaviour due to environmental or spatial interactions simply by observing and analysing movements will always be difficult without further biological information. In general, we have considered only movements at an individual level, with population-level effects being subsequently extrapolated under the assumption that there are no interactions between individuals. However, in reality, these interactions can have an important effect on the overall behaviour and subsequent dispersal or orientation of a population (Couzin *et al*. 2005 Codling *et al*. 2007). Statistical techniques for the analysis of simple CRWs, for instance by measuring their MSD or their tortuosity, are relatively well developed, but there is little work in this area for the more general case of reinforced random walks. This is a potential avenue for future research.

One of the main issues that is causing many notable problems in the literature is the confusion between observed *pattern* and the underlying *process* that generated it. For example, Benhamou (2006) illustrated the high error rate in distinguishing between localized and global bias, while Parrish *et al*. (2002) and Codling *et al*. (2007) showed that it was very difficult to distinguish between individuals independently orienting towards a common target and an interacting group. Another example is aggregation, i.e. the increase in animal density in some places. This has often been considered as a mechanism, but is just a pattern that can be generated by two kinds of mechanisms: exploitation mechanisms (e.g. AO- and AK-kinesis) by which the animals spend more time in certain places and directional mechanisms (e.g. DK-kinesis and taxis) by which animals orient towards these places. This confusion between pattern and process has arguably led to many of the results in the literature (e.g. Viswanathan *et al*. 1996, 2000) interpreting animal movement as a Lévy process, a different class of random walk outside the scope of this review paper. Recent studies (Benhamou 2007 Edwards *et al*. 2007 Plank & James in press) have suggested that the Lévy model is less applicable than first thought and hence a better modelling approach may be to use extensions of the random walk models discussed here.