No free lunch theorem

In mathematical folklore, the "no free lunch" (NFL) theorem (sometimes pluralized) of David Wolpert and William Macready appears in the 1997 "No Free Lunch Theorems for Optimization".^[1] Wolpert had previously derived no free lunch theorems for machine learning (statistical inference).^[2]

In 2005, Wolpert and Macready themselves indicated that the first theorem in their paper "state[s] that any two optimization algorithms are equivalent when their performance is averaged across all possible problems".^[3] The 1997 theorems of Wolpert and Macready are mathematically technical.

The folkloric "no free lunch" (NFL) theorem is an easily stated and easily understood consequence of theorems Wolpert and Macready actually prove. It is weaker than the proven theorems, and thus does not encapsulate them. Various investigators have extended the work of Wolpert and Macready substantively. See No free lunch in search and optimization for treatment of the research area.

While some scholars argue that NFL conveys important insight, others argue that NFL is of little relevance to machine learning research.^[4]^[5]

Example

To find the highest point on Earth, Alice uses a steepest-ascent local search which restarts when a local peak is fully climbed. Bob uses a steepest-descent local search which restarts when it bottoms out at a local trough. In our Earth, Alice will find the highest point much faster than Bob will. However, in some ensembles, each world such as ours can be paired 1-to-1 with a hypothetical "Bobworld" which is identical to ours, except that the elevation of peak of Mount Everest and of the lowest point in the Marianas Trench are swapped, as if a tall pole has been stuck in the Trench. In Bobworld, Bob's strategy of descending outperforms Alice's strategy of climbing; in fact, given further 1-to-1 pairing assumptions that are reasonable in a universe of bounded size, neither Alice's strategy nor Bob's strategy performs better on average, in the absence of some systematic natural bias towards worlds like Earth and away from worlds like Bobworld.^[4]

Original NFL theorems

Wolpert and Macready give two NFL theorems that are closely related to the folkloric theorem. In their paper, they state:

We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems.^[1]

The first theorem first hypothesizes that objective functions do not change while optimization is in progress, and then hypothesizes that objective functions may change.^[1]

Theorem 1: For any algorithms a₁ and a₂, at iteration step m

\sum _{f}P(d_{m}^{y}|f,m,a_{1})=\sum _{f}P(d_{m}^{y}|f,m,a_{2}),

where $d_{m}^{y}$ denotes the ordered set of size $m$ of the cost values $y$ associated to input values $x\in X$ , $f:X\rightarrow Y$ is the function being optimized and $P(d_{m}^{y}|f,m,a)$ is the conditional probability of obtaining a given sequence of cost values from algorithm $a$ run $m$ times on function $f$ .

The theorem can be equivalently formulated as follows:

Theorem 1: Given a finite set

V

and a finite set

S

of real numbers, assume that

f : V \to S

is chosen at random according to uniform distribution on the set

S^{V}

of all possible functions from

V

to

S

. For the problem of optimizing

f

over the set

V

, then no algorithm performs better than blind search.

Here, blind search means that at each step of the algorithm, the element $v\in V$ is chosen at random with uniform probability distribution from the elements of $V$ that have not been chosen previously.

In essence, this says that when all functions f are equally likely, the probability of observing an arbitrary sequence of m values in the course of optimization does not depend upon the algorithm. In the analytic framework of Wolpert and Macready, performance is a function of the sequence of observed values (and not e.g. of wall-clock time), so it follows easily that all algorithms have identically distributed performance when objective functions are drawn uniformly at random, and also that all algorithms have identical mean performance. But identical mean performance of all algorithms does not imply Theorem 1, and thus the folkloric theorem is not equivalent to the original theorem.

Theorem 2 establishes a similar, but "more subtle", NFL result for time-varying objective functions.^[1]

Motivation

The NFL theorems were explicitly not motivated by the question of what can be inferred (in the case of NFL for machine learning) or found (in the case of NFL for search) when the "environment is uniform random". Rather uniform randomness was used as a tool, to compare the number of environments for which algorithm A outperforms algorithm B to the number of environments for which B outperforms A. NFL tells us that (appropriately weighted) there are just as many environments in both of those sets.

This is true for many definitions of what precisely an "environment" is. In particular, there are just as many prior distributions (appropriately weighted) in which learning algorithm A beats B (on average) as vice versa. This statement about sets of priors is what is most important about NFL, not the fact that any two algorithms perform equally for the single, specific prior distribution that assigns equal probability to all environments.

While the NFL is important to understand the fundamental limitation for a set of problems, it does not state anything about each particular instance of a problem that can arise in practice. That is, the NFL states what the NFL states in the mathematical statements and it is nothing more than that. For example, it applies to the situations where the algorithm is fixed first and a nature can choose a worst problem instance to each fixed algorithm. Therefore, if we have a "good" problem in practice or if we can choose a "good" learning algorithm for a given particular problem instance, then the NFL does not mention any limitation about this particular problem instance. See for example.^[6] To understand the results of the NFL along with "seemingly" contradicting results from other papers, it is important to actually understand the mathematical logic of the NFL instead of intuitive notation of the NFL. All results including the NFL and^[6] are indeed consistent.

Implications for computing and for the scientific method

To illustrate one of the counter-intuitive implications of NFL, suppose we fix two supervised learning algorithms, C and D. We then sample a target function f to produce a set of input-output pairs, d. How should we choose whether to train C or D on d, in order to make predictions for what output would be associated with a point lying outside of d?

It is common in almost of all science and statistics to answer this question - to choose between C and D - by running cross-validation on d with those two algorithms. In other words, to decide whether to generalize from d with either C or D, we see which of them has better out-of-sample performance when tested within d.

Note that since C and D are fixed, this use of cross-validation to choose between them is itself an algorithm, i.e., a way of generalizing from an arbitrary dataset. Call this algorithm A. (Arguably, A is a simplified model of the scientific method itself.)

Note as well though that we could also use anti-cross-validation to make our choice. In other words, we could choose between C and D based on which has worse out-of-sample performance within d. Again, since C and D are fixed, this use of anti-cross-validation is itself an algorithm. Call that algorithm B.

NFL tells us (loosely speaking) that B must beat A on just as many target functions (and associated datasets d) as A beats B. In this very specific sense, the scientific method will lose to the "anti" scientific method just as readily as it wins.^[7]

However, note that NFL only applies if the target function is chosen from a uniform distribution of all possible functions. If this is not the case, and certain target functions are more likely to be chosen than others, then A may perform better than B overall. The contribution of NFL is that it tells us choosing an appropriate algorithm requires making assumptions about the kinds of target functions the algorithm is being used for. With no assumptions, no "meta-algorithm", such as the scientific method, performs better than random choice.

While some scholars argue that NFL conveys important insight, others argue that NFL is of little relevance to machine learning research.^[4]^[5] If Occam's razor is correct, for example if sequences of lower Kolmogorov complexity are more probable than sequences of higher complexity, then (as is observed in real life) some algorithms, such as cross-validation, perform better on average on practical problems (when compared with random choice or with anti-cross-validation).^[8]

Notes

1 2 3 4 Wolpert, D.H., Macready, W.G. (1997), "No Free Lunch Theorems for Optimization", IEEE Transactions on Evolutionary Computation 1, 67.
↑ Wolpert, David (1996), "The Lack of A Priori Distinctions between Learning Algorithms", Neural Computation, pp. 1341-1390. Archived 2016-12-20 at the Wayback Machine.
↑ Wolpert, D.H., and Macready, W.G. (2005) "Coevolutionary free lunches", IEEE Transactions on Evolutionary Computation, 9(6): 721-735
1 2 3 Whitley, Darrell, and Jean Paul Watson. "Complexity theory and the no free lunch theorem." In Search Methodologies, pp. 317-339. Springer, Boston, MA, 2005.
1 2 Giraud-Carrier, Christophe, and Foster Provost. "Toward a justification of meta-learning: Is the no free lunch theorem a show-stopper." In Proceedings of the ICML-2005 Workshop on Meta-learning, pp. 12-19. 2005.
1 2 Kawaguchi, K., Kaelbling, L.P, and Bengio, Y.(2017) "Generalization in deep learning", https://arxiv.org/abs/1710.05468
↑ Wolpert, D.H. (2013) "What the no free lunch theorems really mean", Ubiquity, Volume 2013, December 2013, DOI: 10.1145/2555235.2555237
↑ Lattimore, Tor, and Marcus Hutter. "No free lunch versus Occam’s razor in supervised learning." In Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence, pp. 223-235. Springer, Berlin, Heidelberg, 2013.

External links

No Free Lunch Theorems
- graphics illustrating the theorem

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[WM97-1] 1 2 3 4 Wolpert, D.H., Macready, W.G. (1997), "No Free Lunch Theorems for Optimization", IEEE Transactions on Evolutionary Computation 1, 67.

[Wolpert96-2] Wolpert, David (1996), "The Lack of A Priori Distinctions between Learning Algorithms", Neural Computation, pp. 1341-1390. Archived 2016-12-20 at the Wayback Machine.

[WM-coev-3] Wolpert, D.H., and Macready, W.G. (2005) "Coevolutionary free lunches", IEEE Transactions on Evolutionary Computation, 9(6): 721-735

[whitley-4] 1 2 3 Whitley, Darrell, and Jean Paul Watson. "Complexity theory and the no free lunch theorem." In Search Methodologies, pp. 317-339. Springer, Boston, MA, 2005.

[carrier-5] 1 2 Giraud-Carrier, Christophe, and Foster Provost. "Toward a justification of meta-learning: Is the no free lunch theorem a show-stopper." In Proceedings of the ICML-2005 Workshop on Meta-learning, pp. 12-19. 2005.

[KA-deep-6] 1 2 Kawaguchi, K., Kaelbling, L.P, and Bengio, Y.(2017) "Generalization in deep learning", https://arxiv.org/abs/1710.05468

[7] Wolpert, D.H. (2013) "What the no free lunch theorems really mean", Ubiquity, Volume 2013, December 2013, DOI: 10.1145/2555235.2555237

[8] Lattimore, Tor, and Marcus Hutter. "No free lunch versus Occam’s razor in supervised learning." In Algorithmic Probability and Friends. Bayesian Prediction and Artificial Intelligence, pp. 223-235. Springer, Berlin, Heidelberg, 2013.