Brzozowski derivative

Brzozowski derivative (on red background) of a dictionary string set with respect to "con"

In theoretical computer science, in particular in formal language theory, the Brzozowski derivative u−1S of a set S of strings and a string u is defined as the set of all strings obtainable from a string in S by cutting off a prefixing u, formally: u−1S = { v ∈ Σ*: uvS }, cf. picture.[1] It is named after the computer scientist Janusz Brzozowski who investigated their properties and gave an algorithm to compute the derivative of a generalized regular expression.

Derivative of a regular expression

Given a finite alphabet A of symbols,[2] a generalized regular expression denotes a possibly infinite set of finite-length strings of symbols from A. It may be built of:

  • ∅ (denoting the empty set of strings),
  • ε (denoting the singleton set containing just the empty string),
  • a symbol a from A (denoting the singleton set containing the single-symbol string a),
  • RS (where R and S are, in turn, generalized regular expressions; denoting their set's union),
  • RS (denoting the intersection of R 's and S 's set),
  • ¬R (denoting the complement of R 's set with respect to the set of all strings of symbols from A),
  • RS (denoting the set of all possible concatenations of strings from R 's and S 's set),
  • R* (denoting the set of n-fold repetitions of strings from R 's set, for any n≥0, including the empty string).

In an ordinary regular expression, neither ∧ nor ¬ is allowed. The string set denoted by a generalized regular expression R is called its language, denoted as L(R).

Computation

For any given generalized regular expression R and any string u, the derivative u−1R is again a generalized regular expression.[3] It may be computed recursively as follows.[4]

(ua)−1R= a−1(u−1R)      for a symbol a and a string u
ε−1R= R

Using the previous two rules, the derivative with respect to an arbitrary string is explained by the derivative with respect to a single-symbol string a. The latter can be computed as follows:[5]

a−1a= ε
a−1b= ∅for each symbol ba
a−1ε= ∅
a−1= ∅
a−1(R*)= a−1RR*
a−1(RS)= (a−1R)S ∨ ν(R)a−1S
a−1(RS)= (a−1R) ∧ (a−1S)
a−1(RS)= (a−1R) ∨ (a−1S)
a−1R)= ¬(a−1R)

Here, ν(R) is an auxiliary function yielding a generalized regular expression that evaluates to the empty string ε if R 's language contains ε, and otherwise evaluates to ∅. This function can be computed by the following rules:[6]

ν(a)= ∅for any symbol a
ν(ε)= ε
ν(∅)= ∅
ν(R*)= ε
ν(RS)= ν(R) ∧ ν(S)
ν(RS)= ν(R) ∧ ν(S)
ν(RS)= ν(R) ∨ ν(S)
ν(¬R)= εif ν(R) = ∅
ν(¬R)= ∅if ν(R) = ε

Properties

A string u is a member of the string set denoted by a generalized regular expression R if and only if ε is a member of the string set denoted by the derivative u−1R.[7]

Considering all the derivatives of a fixed generalized regular expression R results in only finitely many different languages. If their number is denoted by dR, all these languages can be obtained as derivatives of R with respect to string of length below dR.[8] Furthermore, there is a complete deterministic finite automaton with dR states which recognises the regular language given by R, as laid out by the Myhill–Nerode theorem.

See also

References

  1. Janusz A. Brzozowski (1964). "Derivatives of Regular Expressions". J ACM. 11: 481–494. doi:10.1145/321239.321249.
  2. Brzozowski (1964), p.481, required A to consist of the 2n combinations of n bits, for some n.
  3. Brzozowski (1964), p.483, Theorem 4.1
  4. Brzozowski (1964), p.483, Theorem 3.2
  5. Brzozowski (1964), p.483, Theorem 3.1
  6. Brzozowski (1964), p.482, Definition 3.2
  7. Brzozowski (1964), p.483, Theorem 4.2
  8. Brzozowski (1964), p.484, Theorem 4.3
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.