Posted by tintin_2003 on 2025-10-28 14:48:31 |
Share: Facebook | Twitter | Whatsapp | Linkedin Visits: 72
Lex (short for Lexical Analyzer Generator) is a powerful tool used in compiler design to generate lexical analyzersβprograms that identify tokens from an input stream.
It is often used alongside Yacc (Yet Another Compiler Compiler) to build complete compilers or interpreters.
The main purpose of Lex is to simplify the process of writing a lexical analyzer by allowing the developer to specify patterns using regular expressions. These patterns represent the various tokens (like identifiers, numbers, keywords, operators, etc.) of a programming language.
The tool FLEX (Fast Lexical Analyzer Generator) is an improved, open-source implementation of Lex.
When you write a .l (Lex source file) and run it through flex, it generates a C source file (lex.yy.c), which can be compiled using a C compiler to create an executable lexical analyzer.
In compiler design, the lexical analyzer (also called scanner) is the first phase of the compiler.
Its function is to:
Read the source program character by character.
Group characters into meaningful sequences called tokens.
Pass these tokens to the syntax analyzer (parser).
For example, in the expression:
sum = total + 100;
Lex identifies the following tokens:
Identifier: sum
Operator: =
Identifier: total
Operator: +
Number: 100
Punctuation: ;
Each token type (identifier, operator, number, etc.) is defined using patterns in Lex.
A Lex (FLEX) program consists of three well-defined sections separated by %% symbols.
%{
Declarations (C code, headers, and variables)
%}
%%
Rules Section (patterns and actions)
%%
User Code Section (C functions like main(), display(), etc.)
Enclosed between %{ and %}.
Used for C declarations like:
#include <stdio.h>
Function prototypes
Variable declarations
Anything written here is copied directly into the generated C file (lex.yy.c).
Example:
%{
#include <stdio.h>
void display(int);
int flag;
%}
This section defines patternβaction pairs.
Each pattern is a regular expression, and each action is a block of C code.
Syntax:
pattern { action }
Example:
[0-9]+ { printf("Digit detected: %s\n", yytext); }
[a-zA-Z]+ { printf("Word detected: %s\n", yytext); }
Here:
[0-9]+ β matches one or more digits.
[a-zA-Z]+ β matches one or more alphabetic characters.
yytext β a built-in Lex variable that holds the matched token.
This section contains custom C functions, such as main() or any helper functions (like display() in your programs).
Example:
int main() {
printf("Enter text:\n");
yylex(); // Calls the lexer
return 0;
}
Write a Lex program in a file named program.l.
Run the Lex/Flex tool:
flex program.l
This generates lex.yy.c.
Compile the generated C file:
gcc lex.yy.c -o program
Execute the program:
./program
When executed, the program reads input from standard input (stdin) and applies rules to match patterns and execute actions.
| Symbol | Meaning |
|---|---|
| yytext | Holds the current matched text (token) |
| yyleng | Length of yytext |
| yylex() | The main function generated by Lex to scan input |
| ECHO | Prints the matched token to standard output |
| yymore() | Appends the next matched token to yytext |
| yyless(n) | Pushes back all but the first n characters of the current token |
| unput(c) | Pushes a character c back into the input stream |
| yywrap() | Function called when end of input is reached |
%{
#include <stdio.h>
void display(int);
%}
%%
[aeiouAEIOU][a-zA-Z]+ { display(1); return; }
.+ { display(0); return; }
%%
void display(int flag) {
if(flag)
printf("\nThe given word is a vowel\n");
else
printf("\nThe given word is NOT a vowel\n");
}
int main() {
printf("Enter a word:\n");
yylex();
return 0;
}
Pattern 1: [aeiouAEIOU][a-zA-Z]+
Matches words starting with a vowel.
Calls display(1).
Pattern 2: .+
Matches all other inputs.
Calls display(0).
%{
#include <stdio.h>
#include <stdlib.h>
void display(char string[], int flag);
%}
%%
[a-zA-Z]+[\n] { display(yytext, 1); return; }
[0-9]+[\n] { display(yytext, 0); return; }
.+ { display(yytext, -1); return; }
%%
void display(char string[], int flag) {
if(flag == 1)
printf("The string \"%s\" is a word\n", string);
else if(flag == 0)
printf("The string \"%s\" is a digit\n", string);
else
printf("The string \"%s\" is neither a word nor a digit\n", string);
}
int main() {
printf("Enter a string:\n");
yylex();
return 0;
}
yytext is passed to the display() function.
Flags differentiate between words, digits, or invalid strings.
yymore()%{
#include <stdio.h>
%}
%%
[a-z]+ {
printf("\nIt's a lowercase word: %s", yytext);
printf("\nBeginning of yymore()");
yymore();
printf("\nEnd of yymore()\n");
}
[A-Z]+ {
printf("\nIt's an uppercase word: %s", yytext);
printf("\nBeginning of yymore()");
yymore();
printf("\nEnd of yymore()\n");
}
%%
int main() {
yylex();
return 0;
}
The yymore() function tells Lex to append the next token to yytext instead of overwriting it.
Used when you want to accumulate tokens into a single string.
yyless(n)%{
#include <stdio.h>
%}
%%
[a-z]+ { printf("\nThe word is: "); ECHO; }
[a-zA-Z]+ { yyless(2); printf("\nAfter yyless: "); ECHO; }
%%
int main() {
yylex();
return 0;
}
yyless(2) keeps only the first two characters of the current token.
The remaining characters are pushed back into the input stream.
Useful for controlling how much of a token is processed.
unput(c)%{
#include <stdio.h>
%}
%%
[a-zA-Z]+ {
printf("Original: %s\n", yytext);
unput('a'); // Push 'a' back to input
printf("After unput(a): ");
ECHO;
}
%%
int main() {
yylex();
return 0;
}
The function unput(a) pushes 'a' back into the input stream.
The next scan will re-read 'a' as if it were part of the input.
When you run flex program.l, it generates lex.yy.c, which contains:
yylex() β The main scanning function.
A finite automaton built from your regular expressions.
C code blocks you wrote in the user section.
This generated C code uses tables and transition states internally to recognize tokens efficiently.
When compiled, it creates an executable that reads from standard input and performs your defined actions.
Lex uses regular expressions (regex) to define tokens.
| Expression | Meaning |
|---|---|
a |
Match the character βaβ |
. |
Any character except newline |
[abc] |
Any one of βaβ, βbβ, or βcβ |
[^abc] |
Any character except βaβ, βbβ, or βcβ |
[a-z] |
Any lowercase letter |
r* |
Zero or more occurrences of r |
r+ |
One or more occurrences of r |
r? |
Zero or one occurrence of r |
| `r1 | r2` |
(r) |
Grouping expressions |
^ |
Beginning of line |
$ |
End of line |
These patterns are combined with C code actions to define behavior for each token.
Lex allows several special directives to enhance control:
| Directive | Description |
|---|---|
%{ ... %} |
Encloses C declarations |
%% |
Separates program sections |
%option |
Sets Lex behavior (e.g., %option noyywrap) |
%{} in Rules |
Allows inline C blocks |
ECHO |
Prints the matched text |
yywrap() |
Handles end-of-input |
Lex is used for tokenizing input.
Yacc (Yet Another Compiler Compiler) is used for parsing grammar.
Typically, yylex() from Lex provides tokens to the parser generated by Yacc.
Example:
%{
#include "y.tab.h"
%}
%%
[0-9]+ { yylval = atoi(yytext); return NUMBER; }
[+] { return PLUS; }
[-] { return MINUS; }
[\n] { return 0; }
. { return yytext[0]; }
%%
This Lex program works with a Yacc parser to perform arithmetic expression evaluation.
Automatic Token Generation: Saves time and reduces errors.
Readable Syntax: Uses clear pattern-action pairs.
Efficient Scanning: Generates optimized finite automata.
Easy Integration with Yacc: Works seamlessly with parser tools.
Reusable Components: Functions and tokens can be easily reused across programs.
| Error | Cause | Solution |
|---|---|---|
| Unrecognized pattern | Typo or bad regex | Check regex syntax |
| Undefined function | Function missing in C section | Declare in definition section |
| Unterminated string | Missing quotes or braces | Verify string delimiters |
| yywrap undefined | Missing definition | Add int yywrap() { return 1; } |
| Segmentation fault | Uninitialized pointer or missing return | Ensure all rules return properly |
Syntax Highlighting: Detect keywords, numbers, and strings.
Preprocessing: Filter comments or whitespace from source code.
Data Validation: Check format of identifiers, emails, or numbers.
Compilers: Tokenize high-level language constructs.
Log File Analysis: Extract patterns or keywords from large files.
Lex (and its variant FLEX) is an essential tool for compiler designers and developers who want to automate lexical analysis using regular expressions and pattern matching.
A Lex program follows a clear three-section structure:
%{
Declarations
%}
%%
Patterns and actions
%%
C code / user functions
Its integration with C makes it extremely flexible.
With practice, Lex becomes a gateway to understanding deeper compiler concepts like tokenization, finite automata, and parsing.