RE/flex

RE/flex
Developer(s)	Robert van Engelen
Stable release	1.6.4 / March 22, 2020
Written in	C++
Operating system	Cross-platform
Type	Lexical analysis
License	BSD license
Website	https://github.com/Genivia/RE-flex

RE/flex (regex-centric, fast lexical analyzer)[1][2] is a free and open source computer program written in C++ that generates fast lexical analyzers (also known as "scanners" or "lexers")[3][4] in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers (non-greedy, lazy repeats), and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

History

The RE/flex project was designed and implemented by professor Robert van Engelen in 2016 and released in 2017 as free open source. The software evolved with several contributions made by others. The RE/flex tool generates lexical analyzers based on regular expression ("regex") libraries, instead of fixed DFA tables generated by traditional lexical analyzer generators[5].

Lexer specification

The RE/flex lexical analyzer generator accepts a lexer specification as input. The RE/flex specification syntax is more expressive than traditional Flex lexer specification syntax, and may include indentation anchors, word boundaries, lazy quantifiers (non-greedy, lazy repeats), and new actions such as wstr() to retrieve Unicode wide-string matches.

A lexer specification is of the form:

Definitions
%%
Rules
%%
User Code

The Definitions section includes declarations and customization options, followed by name-pattern pairs to define names for regular expression patterns. Named patterns can be referenced in other patterns by embracing them in { and }. The following example defines two names for two patterns, where the second pattern number uses the previously named pattern digit:

%top{
  #include <inttypes.h> // strtol()
%}

%class{
  public:
    int value; // yyFlexLexer class public member
%}

%init{
  value = 0; // yyFlexLexer initializations
%}

%option flex

digit     [0-9]
number    {digit}+

The Rules section defines pattern-action pairs. The following example defines a rule to translate a number to the lexer class integer value member:

{number}  value = strtol(yytext, NULL, 10);
\s        // skip white space
.         throw *yytext;

The User Code section typically defines C/C++ functions, for example a main program:

int main()
{
  yyFlexLexer lexer;
  try
  {
    while (lexer.yylex() != 0)
      std::cout << "number=" << lexer.value << std::endl;
  }
  catch (int ch)
  {
    std::cerr << "Error: unknown character code " << ch << std::endl;
  }
}

The yyFlexLexer class is generated by RE/flex as instructed by the %option flex directive in the lexer specification. The generated lex.yy.cpp source code contains the algorithm for lexical analysis, which is linked with the libreflex library.

Source code output

The generated algorithm for lexical analysis is based on the concept that any regular expression engine can in principle be used to tokenize input into tokens: given a set of $n$ regular expression patterns $p_{i}$ for $i=1,\ldots ,n$ , a regular expression of the form "( $p_{1}$ )|( $p_{2}$ )|...|( $p_{n}$ )" with $n$ alternations can be used to match and tokenize the input. In this way, the group capture index $i$ of a matching pattern $p_{i}$ that is returned by the regular expression matcher identifies the pattern $p_{i}$ that matched the input text partially and continuously after the previous match.

This approach makes it possible for any regex library that supports group captures to be utilized as a matcher. However, note that all groupings of the form (X) in patterns must be converted to non-capturing groups of the form (?:X) to avoid any unwanted group capturing within sub-expressions.

The following RE/flex-generated yyFlexLexer class yylex method repeatedly invokes the matcher's scan (continuous partial matching) operation to tokenize input:

int yyFlexLexer::yylex()
{
  if (!has_matcher())
    matcher("(p1)|(p2)|...|(pn)"); // new matcher engine for regex pattern (p1)|(p2)|...|(pn)
  while (true)
  {
    switch (matcher().scan()) // scan and match next token, get capture index
    {
      case 0: // no match
        if (... EOF reached ...)
          return 0;
        output(matcher().input()); // echo the current input character
        break;
      case 1: // pattern p1 matched
        ... // Action for pattern p1
        break;
      case 2: // pattern p2 matched
        ... // Action for pattern p2
        break;
      ... // and so on for patterns up to pn
    }
  }
}

If none of the $n$ patterns match and the end-of-file (EOF) is not reached, the so-called "default rule" is invoked. The default rule echo's the current input character and advances the scanner to the next character in the input.

The regular expression pattern "( $p_{1}$ )|( $p_{2}$ )|...|( $p_{n}$ )" is produced by RE/flex from a lexer specification with $n$ rules of pattern-action pairs:

%%
p1    Action for pattern p1
p2    Action for pattern p2
...
pn    Action for pattern pn
%%

From this specification, RE/flex generates the aforementioned yyFlexLexer class with the yylex() method that executes actions corresponding to the patterns matched in the input. The generated yyFlexLexer class is used in a C++ application, such as a parser, to tokenize the input into the integer-valued tokens returned by the actions in the lexer specification. For example:

std::ifstream ifs(filename, std::ios::in);
yyFlexLexer lexer(ifs);
int token;
while ((token = lexer.yylex()) != 0)
  std::cout << "token = " << token << std::endl;
ifs.close();

Note that yylex() returns an integer value when an action executes return token_value;. Otherwise, yylex() does not return a value and continues scanning the input, which is often used by rules that ignore input such as comments.

This example tokenizes a file. A lexical analyzer often serves as a tokenizer for a parser generated by a parser generator such as Bison.

Compatibility

RE/flex is compatible with Flex specifications when %option flex is used. This generates a yyFlexLexer class with yylex() method. RE/flex is also compatible with Bison using a range of RE/flex options for complete coverage of Bison options and features.

By contrast to Flex, RE/flex scanners are thread-safe by default on work with reentrant Bison parsers.

Unicode support

RE/flex supports Unicode regular expression patterns in lexer specifications and automatically tokenizes UTF-8, UTF-16, and UTF-32 input files. Code pages can be specified to tokenize input files encoded in ISO/IEC 8859 1 to 16, Windows-1250 to Windows-1258, CP-437, CP-850, CP-858, MacRoman, KOI-8, EBCDIC, and so on. Normalization to UTF-8 is automatically performed by internal incremental buffering for (partial) pattern matching with Unicode regular expression patterns.

Indent, nodent, and dedent matching

RE/flex integrates indent and dedent matching directly in the regular expression syntax with new \i and \j anchors. These indentation anchors detect changes of line indentation in the input. This allows many practical scenarios to be covered to tokenize programming languages with indented block structures. For example, the following lexer specification detects and reports indentation changes:

%%
^[ \t]+        std::cout << "| "; // nodent: text is aligned to current indent margin
^[ \t]*\i      std::cout << "> "; // indent: matched with \i
^[ \t]*\j      std::cout << "< "; // dedent: matched with \j
\j             std::cout << "< "; // dedent: for each extra level dedented
%%

Lazy quantifiers

Lazy quantifiers can be associated with repeats in RE/flex regular expression patterns to simplify the expressions using non-greedy repeats, when applicable. Normally matching is "greedy", meaning that the longest pattern is matched. For example, the pattern a.*b with the greedy * repeat matches aab, but also matches abab because .* matches any characters except newline and abab is longer than ab. Using a lazy quantifier ? for the lazy repeat *?, pattern a.*?b matches ab but not abab.

As a practical application of lazy quantifiers, consider matching C/C++ multiline comments of the form /*...*/. The lexer specification pattern "/*"(.|\n)*?"*/" with lazy repeat *? matches multiline comments. Without lazy repeats the pattern "/*"([^*]|(\*+[^*/]))*\*+"/" should be used (note that quotation of the form "..." is allowed in lexer specifications only, this construct is comparable to the \Q...\E quotations supported by most regex libraries.)

Performance

RE/flex offers a choice of regular expression pattern matchers: RE/flex regex and Boost.Regex. The RE/flex regex engine is based on a DFA and usually has a time complexity of $O(n)$ in the length $n$ of the input. Boost.Regex offers a richer regular expression pattern syntax, but is slower due to its NFA-based matching algorithm.

By default RE/flex generates DFA tables to speed up the matcher's scan operation. A faster DFA for pattern matching is generated with the --fast option. This DFA is expressed in direct code instead of in tables. For example, the following coded DFA for pattern \w+ runs very efficiently to match words:

void reflex_code_FSM(reflex::Matcher& m)
{
  int c0 = 0, c1 = c0;
  m.FSM_INIT(c1);
S0:
  c1 = m.FSM_CHAR();
  if (97 <= c1 && c1 <= 122) goto S5;
  if (c1 == 95) goto S5;
  if (65 <= c1 && c1 <= 90) goto S5;
  if (48 <= c1 && c1 <= 57) goto S5;
  return m.FSM_HALT(c1);
S5:
  m.FSM_TAKE(1);
  c1 = m.FSM_CHAR();
  if (97 <= c1 && c1 <= 122) goto S5;
  if (c1 == 95) goto S5;
  if (65 <= c1 && c1 <= 90) goto S5;
  if (48 <= c1 && c1 <= 57) goto S5;
  return m.FSM_HALT(c1);
}

Performance can be tuned using the RE/flex built-in profiler. The profiler instruments the generated source code. When the scanner terminates, the profiler reports the number of times a rule is matched and the cumulative time consumed by the matching rule. Profiling includes the time spent in the parser when the rule returns control to the parser. This allows fine-tuning the performance of the generated scanners and parsers. First, the lexer rules that are hot spots, i.e. a computationally expensive, are detected. Then, effective optimization should focus on optimizing these expensive rules and the parser.

References

van Engelen, Robert (April 15, 2017). "Constructing Fast Lexical Analyzers with RE/flex - Why Another Scanner Generator?". CodeProject.
Heng, Christopher (December 27, 2018). "Free Compiler Construction Tools".
Levine, John R.; Mason, Tony; Brown, Doug (1992). lex & yacc (2nd ed.). O'Reilly. pp. 1–2. ISBN 1-56592-000-7.
Levine, John (August 2009). flex & bison. O'Reilly Media. p. 304. ISBN 978-0-596-15597-1.
Aho, Alfred; Sethi, Ravi; Ullman, Jeffrey (1986). Compilers: Principles, Techniques and Tools. Addison-Wesley.

External links

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] van Engelen, Robert (April 15, 2017). "Constructing Fast Lexical Analyzers with RE/flex - Why Another Scanner Generator?". CodeProject.

[2] Heng, Christopher (December 27, 2018). "Free Compiler Construction Tools".

[3] Levine, John R.; Mason, Tony; Brown, Doug (1992). lex & yacc (2nd ed.). O'Reilly. pp. 1–2. ISBN 1-56592-000-7.

[4] Levine, John (August 2009). flex & bison. O'Reilly Media. p. 304. ISBN 978-0-596-15597-1.

[5] Aho, Alfred; Sethi, Ravi; Ullman, Jeffrey (1986). Compilers: Principles, Techniques and Tools. Addison-Wesley.