Comprehensive Tutorial on Lex Programming (FLEX Tool)

Compiler Designs and Principals Intro To Lex programming

Posted by tintin_2003 on 2025-10-28 14:48:31 |

Share: Facebook | Twitter | Whatsapp | Linkedin Visits: 72


Comprehensive Tutorial on Lex Programming (FLEX Tool)

🧠 Comprehensive Tutorial on Lex Programming (FLEX Tool)

πŸ”· Introduction to Lex

Lex (short for Lexical Analyzer Generator) is a powerful tool used in compiler design to generate lexical analyzersβ€”programs that identify tokens from an input stream.
It is often used alongside Yacc (Yet Another Compiler Compiler) to build complete compilers or interpreters.

The main purpose of Lex is to simplify the process of writing a lexical analyzer by allowing the developer to specify patterns using regular expressions. These patterns represent the various tokens (like identifiers, numbers, keywords, operators, etc.) of a programming language.

The tool FLEX (Fast Lexical Analyzer Generator) is an improved, open-source implementation of Lex.
When you write a .l (Lex source file) and run it through flex, it generates a C source file (lex.yy.c), which can be compiled using a C compiler to create an executable lexical analyzer.


πŸ”· Lex and Its Role in Compiler Design

In compiler design, the lexical analyzer (also called scanner) is the first phase of the compiler.
Its function is to:

  1. Read the source program character by character.

  2. Group characters into meaningful sequences called tokens.

  3. Pass these tokens to the syntax analyzer (parser).

For example, in the expression:

sum = total + 100;

Lex identifies the following tokens:

  • Identifier: sum

  • Operator: =

  • Identifier: total

  • Operator: +

  • Number: 100

  • Punctuation: ;

Each token type (identifier, operator, number, etc.) is defined using patterns in Lex.


πŸ”· Structure of a Lex Program

A Lex (FLEX) program consists of three well-defined sections separated by %% symbols.

%{
    Declarations (C code, headers, and variables)
%}

%%
    Rules Section (patterns and actions)
%%

    User Code Section (C functions like main(), display(), etc.)

1️⃣ Definition Section

  • Enclosed between %{ and %}.

  • Used for C declarations like:

    • #include <stdio.h>

    • Function prototypes

    • Variable declarations

  • Anything written here is copied directly into the generated C file (lex.yy.c).

Example:

%{
#include <stdio.h>
void display(int);
int flag;
%}

2️⃣ Rules Section

This section defines pattern–action pairs.
Each pattern is a regular expression, and each action is a block of C code.

Syntax:

pattern   { action }

Example:

[0-9]+     { printf("Digit detected: %s\n", yytext); }
[a-zA-Z]+  { printf("Word detected: %s\n", yytext); }

Here:

  • [0-9]+ β†’ matches one or more digits.

  • [a-zA-Z]+ β†’ matches one or more alphabetic characters.

  • yytext β†’ a built-in Lex variable that holds the matched token.

3️⃣ User Code Section

This section contains custom C functions, such as main() or any helper functions (like display() in your programs).

Example:

int main() {
    printf("Enter text:\n");
    yylex();   // Calls the lexer
    return 0;
}

πŸ”· The Workflow of a Lex Program

  1. Write a Lex program in a file named program.l.

  2. Run the Lex/Flex tool:

    flex program.l
    

    This generates lex.yy.c.

  3. Compile the generated C file:

    gcc lex.yy.c -o program
    
  4. Execute the program:

    ./program
    

When executed, the program reads input from standard input (stdin) and applies rules to match patterns and execute actions.


πŸ”· Important Built-in Variables and Functions

Symbol Meaning
yytext Holds the current matched text (token)
yyleng Length of yytext
yylex() The main function generated by Lex to scan input
ECHO Prints the matched token to standard output
yymore() Appends the next matched token to yytext
yyless(n) Pushes back all but the first n characters of the current token
unput(c) Pushes a character c back into the input stream
yywrap() Function called when end of input is reached

πŸ”· Example 1 β€” Checking Whether a Word is a Vowel

πŸ“˜ Program 3.4

%{
#include <stdio.h>
void display(int);
%}

%%
[aeiouAEIOU][a-zA-Z]+  { display(1); return; }
.+                      { display(0); return; }
%%

void display(int flag) {
    if(flag)
        printf("\nThe given word is a vowel\n");
    else
        printf("\nThe given word is NOT a vowel\n");
}

int main() {
    printf("Enter a word:\n");
    yylex();
    return 0;
}

πŸ” Explanation:

  • Pattern 1: [aeiouAEIOU][a-zA-Z]+

    • Matches words starting with a vowel.

    • Calls display(1).

  • Pattern 2: .+

    • Matches all other inputs.

    • Calls display(0).


πŸ”· Example 2 β€” Identifying Words and Numbers

πŸ“˜ Program 3.5

%{
#include <stdio.h>
#include <stdlib.h>
void display(char string[], int flag);
%}

%%
[a-zA-Z]+[\n]  { display(yytext, 1); return; }
[0-9]+[\n]     { display(yytext, 0); return; }
.+             { display(yytext, -1); return; }
%%

void display(char string[], int flag) {
    if(flag == 1)
        printf("The string \"%s\" is a word\n", string);
    else if(flag == 0)
        printf("The string \"%s\" is a digit\n", string);
    else
        printf("The string \"%s\" is neither a word nor a digit\n", string);
}

int main() {
    printf("Enter a string:\n");
    yylex();
    return 0;
}

πŸ” Explanation:

  • yytext is passed to the display() function.

  • Flags differentiate between words, digits, or invalid strings.


πŸ”· Example 3 β€” Using the Special Function yymore()

πŸ“˜ Program 3.10

%{
#include <stdio.h>
%}

%%
[a-z]+  {
    printf("\nIt's a lowercase word: %s", yytext);
    printf("\nBeginning of yymore()");
    yymore();
    printf("\nEnd of yymore()\n");
}

[A-Z]+  {
    printf("\nIt's an uppercase word: %s", yytext);
    printf("\nBeginning of yymore()");
    yymore();
    printf("\nEnd of yymore()\n");
}
%%

int main() {
    yylex();
    return 0;
}

πŸ” Explanation:

  • The yymore() function tells Lex to append the next token to yytext instead of overwriting it.

  • Used when you want to accumulate tokens into a single string.


πŸ”· Example 4 β€” Using yyless(n)

πŸ“˜ Program 3.11

%{
#include <stdio.h>
%}

%%
[a-z]+    { printf("\nThe word is: "); ECHO; }
[a-zA-Z]+ { yyless(2); printf("\nAfter yyless: "); ECHO; }
%%

int main() {
    yylex();
    return 0;
}

πŸ” Explanation:

  • yyless(2) keeps only the first two characters of the current token.

  • The remaining characters are pushed back into the input stream.

  • Useful for controlling how much of a token is processed.


πŸ”· Example 5 β€” Using unput(c)

πŸ“˜ Program 3.12

%{
#include <stdio.h>
%}

%%
[a-zA-Z]+  {
    printf("Original: %s\n", yytext);
    unput('a');   // Push 'a' back to input
    printf("After unput(a): ");
    ECHO;
}
%%

int main() {
    yylex();
    return 0;
}

πŸ” Explanation:

  • The function unput(a) pushes 'a' back into the input stream.

  • The next scan will re-read 'a' as if it were part of the input.


πŸ”· How Lex Translates to C

When you run flex program.l, it generates lex.yy.c, which contains:

  1. yylex() β€” The main scanning function.

  2. A finite automaton built from your regular expressions.

  3. C code blocks you wrote in the user section.

This generated C code uses tables and transition states internally to recognize tokens efficiently.
When compiled, it creates an executable that reads from standard input and performs your defined actions.


πŸ”· Regular Expressions in Lex

Lex uses regular expressions (regex) to define tokens.

Expression Meaning
a Match the character β€˜a’
. Any character except newline
[abc] Any one of β€˜a’, β€˜b’, or β€˜c’
[^abc] Any character except β€˜a’, β€˜b’, or β€˜c’
[a-z] Any lowercase letter
r* Zero or more occurrences of r
r+ One or more occurrences of r
r? Zero or one occurrence of r
`r1 r2`
(r) Grouping expressions
^ Beginning of line
$ End of line

These patterns are combined with C code actions to define behavior for each token.


πŸ”· Special Directives

Lex allows several special directives to enhance control:

Directive Description
%{ ... %} Encloses C declarations
%% Separates program sections
%option Sets Lex behavior (e.g., %option noyywrap)
%{} in Rules Allows inline C blocks
ECHO Prints the matched text
yywrap() Handles end-of-input

πŸ”· Integration of Lex with Yacc

  • Lex is used for tokenizing input.

  • Yacc (Yet Another Compiler Compiler) is used for parsing grammar.

  • Typically, yylex() from Lex provides tokens to the parser generated by Yacc.

Example:

%{
#include "y.tab.h"
%}

%%
[0-9]+   { yylval = atoi(yytext); return NUMBER; }
[+]      { return PLUS; }
[-]      { return MINUS; }
[\n]     { return 0; }
.        { return yytext[0]; }
%%

This Lex program works with a Yacc parser to perform arithmetic expression evaluation.


πŸ”· Advantages of Using Lex

  1. Automatic Token Generation: Saves time and reduces errors.

  2. Readable Syntax: Uses clear pattern-action pairs.

  3. Efficient Scanning: Generates optimized finite automata.

  4. Easy Integration with Yacc: Works seamlessly with parser tools.

  5. Reusable Components: Functions and tokens can be easily reused across programs.


πŸ”· Common Errors and Debugging Tips

Error Cause Solution
Unrecognized pattern Typo or bad regex Check regex syntax
Undefined function Function missing in C section Declare in definition section
Unterminated string Missing quotes or braces Verify string delimiters
yywrap undefined Missing definition Add int yywrap() { return 1; }
Segmentation fault Uninitialized pointer or missing return Ensure all rules return properly

πŸ”· Practical Applications of Lex

  1. Syntax Highlighting: Detect keywords, numbers, and strings.

  2. Preprocessing: Filter comments or whitespace from source code.

  3. Data Validation: Check format of identifiers, emails, or numbers.

  4. Compilers: Tokenize high-level language constructs.

  5. Log File Analysis: Extract patterns or keywords from large files.


πŸ”· Summary

Lex (and its variant FLEX) is an essential tool for compiler designers and developers who want to automate lexical analysis using regular expressions and pattern matching.

A Lex program follows a clear three-section structure:

%{
   Declarations
%}

%%
   Patterns and actions
%%

   C code / user functions

Its integration with C makes it extremely flexible.
With practice, Lex becomes a gateway to understanding deeper compiler concepts like tokenization, finite automata, and parsing.

Leave a Comment: