Robinson–Foulds metric

The Robinson–Foulds metric is a way to measure the distance between unrooted phylogenetic trees. It is defined as (A + B) where A is the number of partitions of data implied by the first tree but not the second tree and B is the number of partitions of data implied by the second tree but not the first tree. The partitions are calculated for each tree by removing each branch. Thus, the number of eligible partitions for each tree is equal to the number of branches in that tree. The Robinson–Foulds metric is also known as the symmetric difference metric.

Explanation

Given two unrooted trees of nodes and a set of labels (i.e., taxa) for each node (which could be empty, but only nodes with degree greater than or equal to three can be labeled by an empty set) the Robinson–Foulds metric finds the number of $\alpha$ and $\alpha ^{-1}$ operations to convert one into the other. The number of operations defines their distance. The authors define two trees to be the same if they are isomorphic and the isomorphism preserves the labeling. The construction of the proof is based on a function called $\alpha$ , which contracts an edge (combining the nodes, creating a union of their sets). Conversely, $\alpha ^{-1}$ expands an edge (decontraction), where the set can be split in any fashion.

The $\alpha$ function removes all edges from $T_{1}$ that are not in $T_{2}$ , creating $T_{1}\wedge T_{2}$ , and then $\alpha ^{-1}$ is used to add the edges only discovered in $T_{2}$ to the tree $T_{1}\wedge T_{2}$ to build $T_{2}$ . The number of operations in each of these procedures is equivalent to the number of edges in $T_{1}$ that are not in $T_{2}$ plus the number of edges in $T_{2}$ that are not in $T_{1}$ . The sum of the operations is equivalent to a transformation from $T_{1}$ to $T_{2}$ , or vice versa.

Properties

The RF distance corresponds to an equivalent similarity metric that reflects the resolution of the strict consensus of two trees, first used to compare trees in 1980.[1]

In their 1981 paper Robinson and Foulds proved that the distance is in fact a metric.

Algorithms for computing the metric

In 1985 Day gave an algorithm based on perfect hashing that computes this distance that has only a linear complexity in the number of nodes in the trees. A randomized algorithm that uses hash tables that are not necessarily perfect has been shown to approximate the Robinson-Foulds distance with a bounded error in sublinear time.

Specific applications

In phylogenetics, the metric is often used to compute a distance between two trees. The treedist program in the PHYLIP suite offers this function, as does the RAxML_standard package, the DendroPy Python library (under the name "symmetric difference metric"), and R package phangorn (treedist function). For comparing groups of trees, the fastest implementations include HashRF and MrsRF.

The Robinson–Foulds metric has also been used in quantitative comparative linguistics to compute distances between trees that represent how languages are related to each other.

Shortcomings

The RF metric suffers a number of shortcomings:[2]

Relative to other metrics, it is imprecise; it can take two fewer distinct values than there are taxa in a tree.[2]

It is rapidly saturated; very similar trees can be allocated the maximum distance value.[2]

Its value can be counterintuitive. One example is that moving a tip and its neighbour to a particular point on a tree generates a _lower_ difference value than if just one of the two tips were moved to the same place.[2]

Its range of values can depend on tree shape: trees that contain many uneven partitions will command relatively lower distances, on average, than trees with many even partitions.[2]

The first two of these issues can be addressed by using less conservative metrics such as the "Generalized RF distance" or the Matching Splits / Matching Clades measures, which aim to reward similar (but not quite identical) groupings shared between two trees -- the unadulterated Robinson Foulds distance doesn't care how similar two groupings are, if they aren't identical, they are thrown out with the bathwater.[3]

There is an argument, however, that partitions are not the best basis for tree comparison, and that other metrics -- such as the quartet distance or path difference -- would be preferable.[2]

Software implementations

Language/Program	Function	Notes
R	`dist.dendlist(dendlist(x,y))` from dendextend	See
Python	`tree_1.robinson_foulds(tree_2)` from ete3	See

References

Schuh, R. T. & Polhemus, J. T. (1980). "Analysis of taxonomic congruence among morphological, ecological and biogeographic data sets for the Leptopodomorpha (Hemiptera)". Systematic Biology. 29 (1): 1–26. doi:10.1093/sysbio/29.1.1. ISSN 1063-5157.
Smith, Martin R. (2019). "Bayesian and parsimony approaches reconstruct informative trees from simulated morphological datasets" (PDF). Biology Letters. 15 (2). 20180632. doi:10.1098/rsbl.2018.0632. PMC 6405459. PMID 30958126.
- Böcker S., Canzar S., Klau G.W. 2013. The generalized Robinson-Foulds metric. In: Darling A., Stoye J., editors. Algorithms in Bioinformatics. WABI 2013. Lecture Notes in Computer Science, vol 8126. Berlin, Heidelberg: Springer. p. 156–169.
- Bogdanowicz D., Giaro K. 2012. Matching split distance for unrooted binary phylogenetic trees. IEEE/ACM Trans. Comput. Biol. Bioinforma. 9:150–160.
- Bogdanowicz D., Giaro K. 2013. On a matching distance between rooted phylogenetic trees. Int. J. Appl. Math. Comput. Sci. 23:669–684.
- Nye T.M.W., Liò P., Gilks W.R. 2006. A novel algorithm and web-based tool for comparing two alternative phylogenetic trees. Bioinformatics. 22:117–119.