Parser Generation with Flex and Bison
Data Structures and Operations
// se_lib.h
#ifndef __SE_LIB__
#define __SE_LIB__
#include <cstddef>
typedef enum tag_e {
TAG_SYMBOL,
TAG_LIST
} tag_t;
typedef char const* symbol_t;
typedef struct list_s {
struct expr_s* hd;
struct list_s* tl;
} list_t;
typedef union body_u {
symbol_t symbol;
list_t* list;
} body_t;
typedef struct expr_s
{
tag_t tag;
body_t body;
} expr_t;
symbol_t make_symbol (char const* s, size_t len);
void free_symbol (symbol_t symbol);
expr_t* make_expr_symbol (symbol_t symbol);
expr_t* make_expr_list (list_t* list);
void free_expr (expr_t* expr);
list_t* make_list_nil (void);
list_t* make_list_cons (expr_t* hd, list_t* tl);
void free_list(list_t* list);
#endif
// se_lib.c
#include "se_lib.h"
symbol_t make_symbol (char const* s, size_t len)
{
char* result = malloc (sizeof (char)*(len+1));
strncpy (result, s, len);
result[len] = '\0';
return result;
}
void free_symbol (symbol_t symbol)
{
free (symbol);
}
expr_t* make_expr_symbol (symbol_t symbol)
{
expr_t* result = malloc (sizeof (expr_t));
*result = { TAG_SYMBOL, { .symbol = symbol } };
return result;
}
expr_t* make_expr_list (list_t* list)
{
expr_t* result = malloc (sizeof (expr_t));
*result = { TAG_LIST, { .list = list } };
return result;
}
void free_expr (expr_t* expr)
{
switch (expr->tag)
{
case TAG_SYMBOL :
free_symbol (expr->body.symbol);
break;
case TAG_LIST :
free_list (expr->body.list);
break;
};
free (expr);
}
list_t* make_list_nil (void)
{
return NULL;
}
list_t* make_list_cons (expr_t* hd, list_t* tl)
{
list_t* result = malloc (sizeof (list_t));
*result = { hd, tl };
return result;
}
void free_list(list_t* list)
{
if (NULL != list)
{
free_expr (list->hd);
free_list (list->tl);
free (list);
}
}
Bison Parser
A Bison parser follows a grammar outline which consists of
declarations
grammar rules
epilogue
These three parts are separated by a double-percent %%.
Pure Declaration
To define a pure (reentrant) parser we provide a pure declaration in the declaration part of the Bison parser.
%define api.pure full
Error Reporting
For Bison to perform error reporting we need to provide an error reporting function.
%code requires {
#include <cstdio>
void yyerror (char const* s);
}
void yyerror (char const* s)
{
fprintf (stderr, "%s\n", s);
abort ();
}
Node Types
The goal of a Bison parser is to generate an abstract syntax tree. Every node in that abstract syntax tree has a type. So, in the Bison declaration part of the Bison parser, we need to introduce these types. We do so using a union declaration. The union declaration consists of a list of semicolon ;-separated pairs. Left is the C name for the node type. Right is the Bison name for the node type.
%code requires {
#include "se_lib.h"
}
%union {
expr_t* expr;
list_t* list;
symbol_t symbol;
}
This union declaration introduces three node types: expr, list, and symbol. We use these node types to declare the types for both terminal and non-terminal symbols which should appear in the abstract syntax tree.
Token Declarations
%token LPAREN
%token RPAREN
%token <symbol> ID
Syntax Definition
The core of our parser definition is the syntax in BNF.
e : ID
| LPAREN e_list RPAREN
;
e_list :
| e e_list
;
The syntax definition consists of productions terminated by a semicolon ;. A production has a head and a body. Head and body are separated by a colon :. The head introduces a symbol. The body details how the symbol can be produced. The body itself consists of symbols. We distinguish between terminal and non-terminal symbols. We write terminal symbols in upper-case. Each terminal symbol represents a scanner token. Terminal symbols appear only in production bodies but never as a production head. We write non-terminal symbols in lower-case and terminal symbols in upper-case. Every non-terminal symbol (except the root non-terminal symbol) creates an obligation for the parser. We need a production for every non-terminal symbol except the root symbol.
Token Enumeration
In contrast, every terminal symbols creates an obligation for the scanner. In the parser we need a token declaration for every terminal symbol. Bison allows us to declare tokens via a token declaration.
Herein, we leave all tokens we intend to discard untagged, but tag all tokens which we want to keep with a data type.
%token LPAREN
%token RPAREN
%token <char const*> ID
Semantic Values
Next, we need to create a data structure for semantic values. The parser emits such a data structure whenever it encounters a match for a symbol. Bison allows us to introduce a data structure for semantic values via a union declaration.
%union semantic_value {
char const* id_value;
list const* list_value;
}
Symbol Types
Now that we have a data structure to hold a semantic value, we need to extend the syntax definition with actions.
e : ID { $$ = $1 }
| LPAREN e_list RPAREN { $$ = $1 }
;
e_list :
| e e_list { $$ = $1 }
;
Using Bison
// se.yy
%token LPAREN
%token RPAREN
%token ID
%union semantic_value {
char const* id_value;
list const* list_value;
}
%%
e : LPAREN id_list RPAREN
;
id_list :
| ID id_list
;
%%
bison se.yy
Scanner Definition
Using Flex
// se.ll
%%
%%
flex se.ll