Undefined behavior

In computer programming, undefined behavior (UB) is the result of executing computer code whose behavior is not prescribed by the language specification to which the code adheres, for the current state of the program. This happens when the translator of the source code makes certain assumptions, but these assumptions are not satisfied during execution.

The behavior of some programming languages—most famously C and C++—is undefined in some cases.[1] In the standards for these languages the semantics of certain operations is described as undefined. These cases typically represent unambiguous bugs in the code, for example indexing an array outside of its bounds. An implementation is allowed to assume that such operations never occur in correct standard-conforming program code. In the case of C/C++, the compiler is allowed to give a compile-time diagnostic in these cases, but is not required to: the implementation will be considered correct whatever it does in such cases, analogous to don't-care terms in digital logic. It is the responsibility of the programmer to write code that never invokes undefined behavior, although compiler implementations are allowed to issue diagnostics when this happens. This assumption can make various program transformations valid or simplify their proof of correctness, giving flexibility to the implementation. As a result, the compiler can often make more optimizations. It also allows more compile-time checks by both compilers and static program analysis.

In the C community, undefined behavior may be humorously referred to as "nasal demons", after a comp.std.c post that explained undefined behavior as allowing the compiler to do anything it chooses, even "to make demons fly out of your nose".[2] Under some circumstances there can be specific restrictions on undefined behavior. For example, the instruction set specifications of a CPU might leave the behavior of some forms of an instruction undefined, but if the CPU supports memory protection then the specification will probably include a blanket rule stating that no user-accessible instruction may cause a hole in the operating system's security; so an actual CPU would be permitted to corrupt user registers in response to such an instruction, but would not be allowed to, for example, switch into supervisor mode.

Benefits

Documenting an operation as undefined behavior allows compilers to assume that this operation will never happen in a conforming program. This gives the compiler more information about the code and this information can lead to more optimization opportunities.

An example for the C language:

int foo(unsigned char x)
{
     int value = 2147483600; /* assuming 32 bit int */
     value += x;
     if (value < 2147483600)
        bar();
     return value;
}

The value of x cannot be negative and, given that signed integer overflow is undefined behavior in C, the compiler can assume that at the line of the if check value >= 2147483600. Thus the if and the call to the function bar can be ignored by the compiler since the if has no side effects and its condition will never be satisfied. The code above is therefore semantically equivalent to:

int foo(unsigned char x)
{
     int value = 2147483600;
     value += x;
     return value;
}

Had the compiler been forced to assume that signed integer overflow has wraparound behavior, then the transformation above would not have been legal.

Such optimizations become hard to spot by humans when the code is more complex and other optimizations, like inlining, take place.

Another benefit from allowing signed integer overflow to be undefined is that it makes it possible to store and manipulate a variable's value in a processor register that is larger than the size of the variable in the source code. For example, if the type of a variable as specified in the source code is narrower than the native register width (such as "int" on a 64-bit machine, a common scenario), then the compiler can safely use a signed 64-bit integer for the variable in the machine code it produces, without changing the defined behavior of the code. If the behavior of a 32-bit integer under overflow conditions was depended upon by the program, then a compiler would have to insert additional logic when compiling for a 64-bit machine, because the overflow behavior of most machine code instructions depends on the register width.[3]

A further important benefit of undefined signed integer overflow is that it enables, though does not require, such erroneous overflows to be detected at compile-time or by static program analysis, or by run-time checks such as the Clang and GCC sanitizers and valgrind; if such overflow was defined with a valid semantics such as wrap-around then compile-time checks would not be possible.

Risks

C and C++ standards have several forms of undefined behavior throughout, which offers increased liberty in compiler implementations and compile-time checks at the expense of undefined run-time behavior if present. In particular, there is an appendix section dedicated to a non-exhaustive listing of common sources of undefined behavior in C.[4] Moreover, compilers are not required to diagnose code that relies on undefined behavior, due to current static analysis limitations. Hence, it is common for programmers, even experienced ones, to unintentionally rely on undefined behavior either by mistake, or simply because they are not well-versed in the rules of the language that can span over hundreds of pages. This can result in bugs that are exposed when optimizations are enabled on the compiler, or when a compiler of a different vendor or version is used. Testing or fuzzing with dynamic undefined behavior checks enabled, e.g. the Clang sanitizers, can help to catch undefined behavior not diagnosed by the compiler or static analyzers[5].

In scenarios where security is critical, undefined behavior can lead to security vulnerabilities in software. When GCC's developers changed their compiler in 2008 such that it omitted certain overflow checks that relied on undefined behavior, CERT issued a warning against the newer versions of the compiler.[6] Linux Weekly News pointed out that the same behavior was observed in PathScale C, Microsoft Visual C++ 2005 and several other compilers;[7] the warning was later amended to warn about various compilers.[8]

Examples in C and C++

The major forms of undefined behavior in C can be broadly classified as [9]: spatial memory safety violations, temporal memory safety violations, integer overflow, strict aliasing violations, alignment violations, unsequenced modifications, data races, and loops that neither perform I/O nor terminate.

In C the use of any automatic variable before it has been initialized yields undefined behavior, as does integer division by zero, signed integer overflow, indexing an array outside of its defined bounds (see buffer overflow), or null pointer dereferencing. In general, any instance of undefined behavior leaves the abstract execution machine in an unknown state, and causes the behavior of the entire program to be undefined.

Attempting to modify a string literal causes undefined behavior:[10]

char *p = "wikipedia"; // valid C, deprecated in C++98/C++03, ill-formed as of C++11
p[0] = 'W'; // undefined behavior

Integer division by zero results in undefined behavior:[11]

int x = 1;
return x / 0; // undefined behavior

Certain pointer operations may result in undefined behavior:[12]

int arr[4] = {0, 1, 2, 3};
int *p = arr + 5;  // undefined behavior for indexing out of bounds
p = 0;
int a = *p;        // undefined behavior for dereferencing a null pointer

In C and C++, the comparison of pointers to objects is only strictly defined if the pointers point to members of the same object, or elements of the same array.[13] Example:

int main(void)
{
  int a = 0;
  int b = 0;
  return &a < &b; /* undefined behavior */
}

Reaching the end of a value-returning function (other than main()) without a return statement results in undefined behavior if the value of the function call is used by the caller:[14]

int f()
{
}  /* undefined behavior if the value of the function call is used*/

Modifying an object between two sequence points more than once produces undefined behavior.[15] It is worth mentioning that there are considerable changes in what causes undefined behavior in relation to sequence points as of C++11.[16] The following example will however cause undefined behavior in both C++ and C.

i = i++ + 1; // undefined behavior

When modifying an object between two sequence points, reading the value of the object for any other purpose than determining the value to be stored is also undefined behavior.[17]

a[i] = i++; // undefined behavior
printf("%d %d\n", ++n, power(2, n)); // also undefined behavior

See also

References

  1. Lattner, Chris (May 13, 2011). "What Every C Programmer Should Know About Undefined Behavior". LLVM Project Blog. LLVM.org. Retrieved May 24, 2011.
  2. "nasal demons". The Jargon File. Retrieved 12 June 2014.
  3. https://gist.github.com/rygorous/e0f055bfb74e3d5f0af20690759de5a7#file-gistfile1-txt-L166
  4. ISO/IEC 9899:2011 §J.2.
  5. John Regehr. "Undefined behavior in 2017, cppcon 2017".
  6. "Vulnerability Note VU#162289 — gcc silently discards some wraparound checks". Vulnerability Notes Database. CERT. 4 April 2008. Archived from the original on 9 April 2014.
  7. Jonathan Corbet (16 April 2008). "GCC and pointer overflows". Linux Weekly News.
  8. "Vulnerability Note VU#162289 — C compilers may silently discard some wraparound checks". Vulnerability Notes Database. CERT. 8 October 2008 [4 April 2008].
  9. Pascal Cuoq and John Regehr (4 July 2017). "Undefined Behavior in 2017, Embedded in Academia Blog".
  10. ISO/IEC (2003). ISO/IEC 14882:2003(E): Programming Languages - C++ §2.13.4 String literals [lex.string] para. 2
  11. ISO/IEC (2003). ISO/IEC 14882:2003(E): Programming Languages - C++ §5.6 Multiplicative operators [expr.mul] para. 4
  12. ISO/IEC (2003). ISO/IEC 14882:2003(E): Programming Languages - C++ §5.7 Additive operators [expr.add] para. 5
  13. ISO/IEC (2003). ISO/IEC 14882:2003(E): Programming Languages - C++ §5.9 Relational operators [expr.rel] para. 2
  14. ISO/IEC (2007). ISO/IEC 9899:2007(E): Programming Languages - C §6.9 External definitions para. 1
  15. ANSI X3.159-1989 Programming Language C, footnote 26
  16. "Order of evaluation - cppreference.com". en.cppreference.com. Retrieved 2016-08-09.
  17. ISO/IEC (1999). ISO/IEC 9899:1999(E): Programming Languages - C §6.5 Expressions para. 2

Further reading

  • Peter van der Linden, Expert C Programming. ISBN 0-13-177429-8
  • UB Canaries (April 2015), John Regehr (University of Utah, USA)
  • Undefined Behavior in 2017 (July 2017) Pascal Cuoq (TrustInSoft, France) and John Regehr (University of Utah, USA)
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.