Comprehensive Tutorial on Lex Programming (FLEX Tool)

Compiler Designs and Principals Intro To Lex programming

Posted by tintin_2003 on 2025-10-28 14:48:31 |

Share: Facebook | Twitter | Whatsapp | Linkedin Visits: 72

🧠 Comprehensive Tutorial on Lex Programming (FLEX Tool)

🔷 Introduction to Lex

Lex (short for Lexical Analyzer Generator) is a powerful tool used in compiler design to generate lexical analyzers—programs that identify tokens from an input stream.
It is often used alongside Yacc (Yet Another Compiler Compiler) to build complete compilers or interpreters.

The main purpose of Lex is to simplify the process of writing a lexical analyzer by allowing the developer to specify patterns using regular expressions. These patterns represent the various tokens (like identifiers, numbers, keywords, operators, etc.) of a programming language.

The tool FLEX (Fast Lexical Analyzer Generator) is an improved, open-source implementation of Lex.
When you write a .l (Lex source file) and run it through flex, it generates a C source file (lex.yy.c), which can be compiled using a C compiler to create an executable lexical analyzer.

🔷 Lex and Its Role in Compiler Design

In compiler design, the lexical analyzer (also called scanner) is the first phase of the compiler.
Its function is to:

Read the source program character by character.
Group characters into meaningful sequences called tokens.
Pass these tokens to the syntax analyzer (parser).

For example, in the expression:

sum = total + 100;

Lex identifies the following tokens:

Identifier: sum
Operator: =
Identifier: total
Operator: +
Number: 100
Punctuation: ;

Each token type (identifier, operator, number, etc.) is defined using patterns in Lex.

🔷 Structure of a Lex Program

A Lex (FLEX) program consists of three well-defined sections separated by %% symbols.

%{
    Declarations (C code, headers, and variables)
%}

%%
    Rules Section (patterns and actions)
%%

    User Code Section (C functions like main(), display(), etc.)

1️⃣ Definition Section

Enclosed between %{ and %}.
Used for C declarations like:
- #include <stdio.h>
- Function prototypes
- Variable declarations
Anything written here is copied directly into the generated C file (lex.yy.c).

Example:

%{
#include <stdio.h>
void display(int);
int flag;
%}

2️⃣ Rules Section

This section defines pattern–action pairs.
Each pattern is a regular expression, and each action is a block of C code.

Syntax:

pattern   { action }

Example:

[0-9]+     { printf("Digit detected: %s\n", yytext); }
[a-zA-Z]+  { printf("Word detected: %s\n", yytext); }

Here:

[0-9]+ → matches one or more digits.
[a-zA-Z]+ → matches one or more alphabetic characters.
yytext → a built-in Lex variable that holds the matched token.

3️⃣ User Code Section

This section contains custom C functions, such as main() or any helper functions (like display() in your programs).

Example:

int main() {
    printf("Enter text:\n");
    yylex();   // Calls the lexer
    return 0;
}

🔷 The Workflow of a Lex Program

Write a Lex program in a file named program.l.
Run the Lex/Flex tool:
```
flex program.l
```
This generates lex.yy.c.
Compile the generated C file:
```
gcc lex.yy.c -o program
```
Execute the program:
```
./program
```

When executed, the program reads input from standard input (stdin) and applies rules to match patterns and execute actions.

🔷 Important Built-in Variables and Functions

Symbol	Meaning
yytext	Holds the current matched text (token)
yyleng	Length of `yytext`
yylex()	The main function generated by Lex to scan input
ECHO	Prints the matched token to standard output
yymore()	Appends the next matched token to `yytext`
yyless(n)	Pushes back all but the first n characters of the current token
unput(c)	Pushes a character c back into the input stream
yywrap()	Function called when end of input is reached

🔷 Example 1 — Checking Whether a Word is a Vowel

📘 Program 3.4

%{
#include <stdio.h>
void display(int);
%}

%%
[aeiouAEIOU][a-zA-Z]+  { display(1); return; }
.+                      { display(0); return; }
%%

void display(int flag) {
    if(flag)
        printf("\nThe given word is a vowel\n");
    else
        printf("\nThe given word is NOT a vowel\n");
}

int main() {
    printf("Enter a word:\n");
    yylex();
    return 0;
}

🔍 Explanation:

Pattern 1: [aeiouAEIOU][a-zA-Z]+
- Matches words starting with a vowel.
- Calls display(1).
Pattern 2: .+
- Matches all other inputs.
- Calls display(0).

🔷 Example 2 — Identifying Words and Numbers

📘 Program 3.5

%{
#include <stdio.h>
#include <stdlib.h>
void display(char string[], int flag);
%}

%%
[a-zA-Z]+[\n]  { display(yytext, 1); return; }
[0-9]+[\n]     { display(yytext, 0); return; }
.+             { display(yytext, -1); return; }
%%

void display(char string[], int flag) {
    if(flag == 1)
        printf("The string \"%s\" is a word\n", string);
    else if(flag == 0)
        printf("The string \"%s\" is a digit\n", string);
    else
        printf("The string \"%s\" is neither a word nor a digit\n", string);
}

int main() {
    printf("Enter a string:\n");
    yylex();
    return 0;
}

🔍 Explanation:

yytext is passed to the display() function.
Flags differentiate between words, digits, or invalid strings.

🔷 Example 3 — Using the Special Function `yymore()`

📘 Program 3.10

%{
#include <stdio.h>
%}

%%
[a-z]+  {
    printf("\nIt's a lowercase word: %s", yytext);
    printf("\nBeginning of yymore()");
    yymore();
    printf("\nEnd of yymore()\n");
}

[A-Z]+  {
    printf("\nIt's an uppercase word: %s", yytext);
    printf("\nBeginning of yymore()");
    yymore();
    printf("\nEnd of yymore()\n");
}
%%

int main() {
    yylex();
    return 0;
}

🔍 Explanation:

The yymore() function tells Lex to append the next token to yytext instead of overwriting it.
Used when you want to accumulate tokens into a single string.

🔷 Example 4 — Using `yyless(n)`

📘 Program 3.11

%{
#include <stdio.h>
%}

%%
[a-z]+    { printf("\nThe word is: "); ECHO; }
[a-zA-Z]+ { yyless(2); printf("\nAfter yyless: "); ECHO; }
%%

int main() {
    yylex();
    return 0;
}

🔍 Explanation:

yyless(2) keeps only the first two characters of the current token.
The remaining characters are pushed back into the input stream.
Useful for controlling how much of a token is processed.

🔷 Example 5 — Using `unput(c)`

📘 Program 3.12

%{
#include <stdio.h>
%}

%%
[a-zA-Z]+  {
    printf("Original: %s\n", yytext);
    unput('a');   // Push 'a' back to input
    printf("After unput(a): ");
    ECHO;
}
%%

int main() {
    yylex();
    return 0;
}

🔍 Explanation:

The function unput(a) pushes 'a' back into the input stream.
The next scan will re-read 'a' as if it were part of the input.

🔷 How Lex Translates to C

When you run flex program.l, it generates lex.yy.c, which contains:

yylex() — The main scanning function.
A finite automaton built from your regular expressions.
C code blocks you wrote in the user section.

This generated C code uses tables and transition states internally to recognize tokens efficiently.
When compiled, it creates an executable that reads from standard input and performs your defined actions.

🔷 Regular Expressions in Lex

Lex uses regular expressions (regex) to define tokens.

Expression	Meaning
`a`	Match the character ‘a’
`.`	Any character except newline
`[abc]`	Any one of ‘a’, ‘b’, or ‘c’
`[^abc]`	Any character except ‘a’, ‘b’, or ‘c’
`[a-z]`	Any lowercase letter
`r*`	Zero or more occurrences of `r`
`r+`	One or more occurrences of `r`
`r?`	Zero or one occurrence of `r`
`r1	r2`
`(r)`	Grouping expressions
`^`	Beginning of line
`$`	End of line

These patterns are combined with C code actions to define behavior for each token.

🔷 Special Directives

Lex allows several special directives to enhance control:

Directive	Description
`%{ ... %}`	Encloses C declarations
`%%`	Separates program sections
`%option`	Sets Lex behavior (e.g., `%option noyywrap`)
`%{}` in Rules	Allows inline C blocks
`ECHO`	Prints the matched text
`yywrap()`	Handles end-of-input

🔷 Integration of Lex with Yacc

Lex is used for tokenizing input.
Yacc (Yet Another Compiler Compiler) is used for parsing grammar.
Typically, yylex() from Lex provides tokens to the parser generated by Yacc.

Example:

%{
#include "y.tab.h"
%}

%%
[0-9]+   { yylval = atoi(yytext); return NUMBER; }
[+]      { return PLUS; }
[-]      { return MINUS; }
[\n]     { return 0; }
.        { return yytext[0]; }
%%

This Lex program works with a Yacc parser to perform arithmetic expression evaluation.

🔷 Advantages of Using Lex

Automatic Token Generation: Saves time and reduces errors.
Readable Syntax: Uses clear pattern-action pairs.
Efficient Scanning: Generates optimized finite automata.
Easy Integration with Yacc: Works seamlessly with parser tools.
Reusable Components: Functions and tokens can be easily reused across programs.

🔷 Common Errors and Debugging Tips

Error	Cause	Solution
Unrecognized pattern	Typo or bad regex	Check regex syntax
Undefined function	Function missing in C section	Declare in definition section
Unterminated string	Missing quotes or braces	Verify string delimiters
yywrap undefined	Missing definition	Add `int yywrap() { return 1; }`
Segmentation fault	Uninitialized pointer or missing return	Ensure all rules return properly

🔷 Practical Applications of Lex

Syntax Highlighting: Detect keywords, numbers, and strings.
Preprocessing: Filter comments or whitespace from source code.
Data Validation: Check format of identifiers, emails, or numbers.
Compilers: Tokenize high-level language constructs.
Log File Analysis: Extract patterns or keywords from large files.

🔷 Summary

Lex (and its variant FLEX) is an essential tool for compiler designers and developers who want to automate lexical analysis using regular expressions and pattern matching.

A Lex program follows a clear three-section structure:

%{
   Declarations
%}

%%
   Patterns and actions
%%

   C code / user functions

Its integration with C makes it extremely flexible.
With practice, Lex becomes a gateway to understanding deeper compiler concepts like tokenization, finite automata, and parsing.