------------------------------- Page    i -------------------------------

                   Lex - A Lexical Analyzer Generator




                                                 M. E. Lesk

                                                 E. Schmidt



                                                 Edited by UTS

                                                 February 16, 1981

------------------------------- Page   ii -------------------------------

                            TABLE OF CONTENTS


Abstract  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   1

1.    Introduction  . . . . . . . . . . . . . . . . . . . . . . . . .   1

2.    Lex Source  . . . . . . . . . . . . . . . . . . . . . . . . . .   4

3.    Lex Regular Expressions . . . . . . . . . . . . . . . . . . . .   5

4.    Lex Actions . . . . . . . . . . . . . . . . . . . . . . . . . .   9

5.    Ambiguous Source Rules  . . . . . . . . . . . . . . . . . . . .  13

6.    Lex Source Definitions  . . . . . . . . . . . . . . . . . . . .  15

7.    Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  17

8.    Lex and Yacc  . . . . . . . . . . . . . . . . . . . . . . . . .  17

9.    Examples  . . . . . . . . . . . . . . . . . . . . . . . . . . .  18

10.   Left Context Sensitivity  . . . . . . . . . . . . . . . . . . .  21

11.   Character Set . . . . . . . . . . . . . . . . . . . . . . . . .  23

12.   Summary of Source Format  . . . . . . . . . . . . . . . . . . .  24

13.   Caveats and Bugs  . . . . . . . . . . . . . . . . . . . . . . .  26

14.   Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . .  26

References  . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  26


                                                            Last Page  27

-------------------------------- Page  1 --------------------------------

ABSTRACT

Lex helps write programs whose control  flow is directed by instances  of
regular expressions in the input  stream.  It is well suited for  editor-
script type transformations and for segmenting input in preparation for a
parsing routine.

Lex source is a  table of regular  expressions and corresponding  program
fragments.  The table  is translated  to a program  which reads an  input
stream, copying it to an  output stream and  partitioning the input  into
strings which match the given expressions.  As each such string is recog-
nized the corresponding program fragment is executed.  The recognition of
the expressions  is performed  by a deterministic  finite automaton  gen-
erated by Lex.  The program fragments written by the user are executed in
the order in  which the  corresponding regular expressions  occur in  the
input stream.

The lexical analysis programs written with Lex accept ambiguous  specifi-
cations and choose the  longest match possible  at each input point.   If
necessary, substantial lookahead is performed on the input, but the input
stream will be backed up to the end of the current partition, so that the
user has general freedom to manipulate it.

Lex can generate analyzers in either C or Ratfor, a language which can be
translated automatically  to portable  Fortran.  It is  available on  the
PDP-11 UNIX-,  Honeywell  GCOS, Amdahl  UTS  and IBM  OS  systems.   This
manual, however, will only discuss  generating analyzers in C on the  UTS
system, which  is the  only supported  form  of Lex  under UTS.   Lex  is
designed to simplify interfacing with Yacc, for those with access to this
compiler-compiler system.




1.    INTRODUCTION

Lex is a program generator  designed for lexical processing of  character
input streams.  It accepts  a high-level, problem oriented  specification
for character string matching, and produces  a program in a general  pur-
pose language which recognizes regular expressions.  The regular  expres-
sions are specified by  the user  in the source  specifications given  to
Lex.  The  Lex written  code  recognizes these  expressions in  an  input
stream and partitions the input stream into strings matching the  expres-
sions.  At the  boundaries between strings  program sections provided  by
the user  are  executed.  The  Lex  source file  associates  the  regular
expressions and the program fragments.  As each expression appears in the
_______________
  -UNIX is a trademark of Bell Laboratories.

-------------------------------- Page  2 --------------------------------

input to the program written by  Lex, the corresponding fragment is  exe-
cuted.

The user supplies the additional  code beyond expression matching  needed
to complete his tasks, possibly  including code written by other  genera-
tors.  The program that  recognizes the expressions  is generated in  the
general purpose  programming  language employed  for the  user's  program
fragments.  Thus, a high level  expression language is provided to  write
the string expressions to  be matched while  the user's freedom to  write
actions is unimpaired.  This avoids forcing the user who wishes to use  a
string manipulation language for input analysis to write processing  pro-
grams in the same and often inappropriate string handling language.

Lex is not a complete language, but rather a generator representing a new
language feature which can  be added to different programming  languages,
called 'host languages.' Just  as general purpose  languages can  produce
code to run on  different computer hardware,  Lex can write code in  dif-
ferent host languages.  The  host language  is used for  the output  code
generated by Lex and  also for the  program fragments added by the  user.
Compatible run-time libraries for the  different host languages are  also
provided.  This makes  Lex adaptable to  different environments and  dif-
ferent users.  Each  application may  be directed to  the combination  of
hardware and  host language  appropriate to  the task,  the user's  back-
ground, and the  properties of  local implementations.   At present,  the
only supported host language is C, although Fortran (in the  form of Rat-
for [2] has been available in the past.  Lex itself exists on UTS,  UNIX,
GCOS, and OS/370; but the code generated by Lex may be taken anywhere the
appropriate compilers exist.

Lex turns the user's expressions and actions (called source in this memo)
into the host  general-purpose language; the  generated program is  named
yylex.  The yylex program will recognize expressions in a stream  (called
input in this memo) and perform the specified actions for each expression
as it is detected.  See Figure 1.

                                ________
                      Source ->|___Lex_|   -> yylex

                                ________
                      Input -> |__yylex|  -> Output

                           An overview of Lex

                                Figure 1

For a trivial example, consider  a program to delete  from the input  all
blanks or tabs at the ends of lines.

                               %%
                               [ \t]+$   ;

-------------------------------- Page  3 --------------------------------

is all that is required.  The program contains a %% delimiter to mark the
beginning of  the rules,  and  one rule.   This rule  contains a  regular
expression which matches one or more instances of the characters blank or
tab (written \t for visibility, in accordance with the C language conven-
tion) just prior to the end of a line.  The brackets indicate the charac-
ter class made of blank  and tab; the + indicates 'one or more ...';  and
the $ indicates 'end of line,' as in QED.  No action is specified, so the
program generated by  Lex (yylex) will  ignore these characters.   Every-
thing else will be copied.  To change  any remaining string of blanks  or
tabs to a single blank, add another rule:

                         %%
                         [ \t]+$   ;
                         [ \t]+    printf(" ");

The finite automaton generated for this  source will scan for both  rules
at once, observing  at the termination  of the string  of blanks or  tabs
whether or not there is  a newline character,  and executing the  desired
rule action.  The first rule matches all strings of blanks or tabs at the
end of lines,  and the  second rule all  remaining strings  of blanks  or
tabs.

Lex can be used  alone for  simple transformations, or  for analysis  and
statistics gathering on  a lexical level.   Lex can also  be used with  a
parser generator to perform  the lexical analysis  phase; it is  particu-
larly easy to interface  Lex and Yacc  [3].  Lex programs recognize  only
regular expressions; Yacc  writes parsers  that accept a  large class  of
context free grammars, but  require a lower  level analyzer to  recognize
input tokens.  Thus, a combination of Lex and Yacc is often  appropriate.
When used as a preprocessor for a later parser generator,  Lex is used to
partition the input stream, and the parser generator assigns structure to
the resulting pieces.  The flow of control in such a case (which might be
the first half of a compiler, for example)  is shown in Figure 2.   Addi-
tional programs, written  by other  generators or by  hand, can be  added
easily to programs written by Lex.

                      lexical        grammar
                       rules          rules
                         v              v
                    __________     __________
                   |____Lex__|    |___Yacc__|
                    _____v____     _____v____
           Input ->|___yylex_|  ->|__yyparse|  -> Parsed input

                          Lex with Yacc

                             Figure 2

Yacc users will realize that the name yylex is what Yacc expects its lex-
ical analyzer to be named, so that the use of this name by Lex simplifies

-------------------------------- Page  4 --------------------------------

Lex generates a deterministic finite  automaton from the regular  expres-
sions in the source [4].  The automaton is interpreted, rather  than com-
piled, in order to save space.  The result is still a fast analyzer.   In
particular, the time taken by a Lex program to recognize and partition an
input stream is proportional to the length  of the input.  The number  of
Lex rules or the complexity of the rules is not  important in determining
speed, unless rules which include  forward context require a  significant
amount of rescanning.  What does increase with the number and  complexity
of rules is the size of the finite  automaton, and therefore the size  of
the program generated by Lex.

In the program  written by  Lex, the user's  fragments (representing  the
actions to be performed as each regular expression is found) are gathered
as cases  of a  switch.  The  automaton interpreter  directs the  control
flow.  Opportunity is provided for the user to insert either declarations
or additional statements in the routine containing the actions, or to add
subroutines outside this action routine.

Lex is not limited to source which can be interpreted on the basis of one
character lookahead.  For example,  if there are  two rules, one  looking
for ab and another for abcdefg, and the input stream is abcdefh, Lex will
recognize ab  and leave  the input  pointer just  before "cd.  . ."  Such
backup is more costly than the processing of simpler languages.




2.    LEX SOURCE

The general format of Lex source is:

                           {definitions}
                           %%
                           {rules}
                           %%
                           {user subroutines}

where the definitions and  the user subroutines  are often omitted.   The
second %% is optional, but the first is required to mark the beginning of
the rules.  The absolute minimum Lex program is thus

                                   %%

(no definitions, no rules) which  translates into a program which  copies
the input to the output unchanged.

In the  outline of  Lex programs  shown above,  the rules  represent  the
user's control  decisions; they  are a  table, in which  the left  column

-------------------------------- Page  5 --------------------------------

contains regular expressions (see  section 3) and  the right column  con-
tains actions, program fragments to be executed when the expressions  are
recognized.  Thus an individual rule might appear

                 integer   printf("found keyword INT");

to look for the string integer in the input stream and print the  message
'found keyword INT' whenever it  appears.  In this example the host  pro-
cedural language is C and the C library function printf is used to  print
the string.  The end of the expression is indicated by the first blank or
tab character.  If the  action is merely  a single C  expression, it  can
just be given on the right side of the line;  if it is compound, or takes
more than a line, it  should be enclosed in braces.   As a slightly  more
useful example, suppose it  is desired to  change a number of words  from
British to American spelling.  Lex rules such as

                    colour      printf("color");
                    mechanise   printf("mechanize");
                    petrol      printf("gas");

would be  a start.   These rules  are not  quite enough,  since the  word
petroleum would  become  gaseum;  a way  of  dealing with  this  will  be
described later.




3.    LEX REGULAR EXPRESSIONS

A regular expression specifies a set of  strings to be matched.  It  con-
tains text characters  (which match the  corresponding characters in  the
strings being compared)  and operator characters  (which specify  repeti-
tions, choices, and other features).  The letters of the alphabet and the
digits are always text characters; thus the regular expression

                               integer

matches the string integer wherever it appears and the expression

                                  a57D

looks for the string a57D.

Operators.  The operator characters are

                 " \ [ ] ^ - ? . * + | ( ) $ / { } % < >

and if they are to be used as text characters, an escape should be  used.

-------------------------------- Page  6 --------------------------------

The quotation  mark operator  (")  indicates that  whatever is  contained
between a pair of quotes is to be taken as text characters.  Thus

                                 xyz"++"

matches the string xyz++ when it appears.  Note  that a part of a  string
may be quoted.  It is harmless but unnecessary to quote  an ordinary text
character; the expression

                                 "xyz++"

is the same as  the one  above.  Thus by  quoting every  non-alphanumeric
character being used as a text character, the user can  avoid remembering
the list above of current operator characters, and is safe should further
extensions to Lex lengthen the list.

An operator character may also be turned into a text character by preced-
ing it with \ as in

                                 xyz\+\+

which is another,  less readable,  equivalent of  the above  expressions.
Another use of the  quoting mechanism is to  get a blank into an  expres-
sion; normally, as explained above, blanks or tabs end a rule.  Any blank
character not contained within  [] (see below)  must be quoted.   Several
normal C escapes with \ are recognized: \n is newline, \t is tab, and  \b
is backspace.  To enter \ itself, use \\.  Since newline is illegal in an
expression, \n must be used; it is not  required to escape tab and  back-
space.  Every character  but blank,  tab, newline and  the list above  is
always a text character.

Character classes.   Classes of  characters can  be specified  using  the
operator pair  [].  The  construction [abc] matches  a single  character,
which may be a, b, or c.  Within square brackets, most operator  meanings
are ignored.  Only  three characters are  special: these are  \ - and  ^.
The - character indicates ranges.  For example,

                               [a-z0-9<>-]

indicates the character class containing all the lower case letters,  the
digits, the angle brackets, and underline.  Ranges may be given in either
order.  Using - between any pair of  characters which are not both  upper
case letters, both lower case  letters, or both digits is  implementation
dependent and will get a warning message.  (E.g., [0-z] in ASCII is  many
more characters than it is  in EBCDIC).  If it is desired to include  the
character - in a character class, it should be first or last; thus

                                 [-+0-9]

matches all the digits and the two signs.

-------------------------------- Page  7 --------------------------------

In character classes, the ^ operator  must appear as the first  character
after the left bracket; it  indicates that the resulting string is to  be
complemented with respect to the computer character set.  Thus

                                 [^abc]

matches all characters except a, b, or  c, including all special or  con-
trol characters; or

                                [^a-zA-Z]

is any character which  is not a  letter.  The \  character provides  the
usual escapes within character class brackets.

Arbitrary character.  To match almost any character, the operator charac-
ter

                                    .

is the class of all  characters except newline.   Escaping into octal  is
possible although non-portable:

                               [\40-\176]

matches all printable characters in  the ASCII character set, from  octal
40 (blank) to octal 176 (tilde).

Optional expressions.  The operator ?   indicates an optional element  of
an expression.  Thus

                                  ab?c

matches either ac or abc.

Repeated expressions.  Repetitions of classes are indicated by the opera-
tors * and +.

                                   a*

is any number of consecutive a characters, including zero; while

                                   a+

is one or more instances of a.  For example,

                                 [a-z]+

is all strings of lower case letters.  And

                          [A-Za-z][A-Za-z0-9]*

-------------------------------- Page  8 --------------------------------

indicates all alphanumeric strings with  a leading alphabetic  character.
This is  a typical  expression  for recognizing  identifiers in  computer
languages.

Alternation and Grouping.  The operator | indicates alternation:

                                 (ab|cd)

matches either ab or cd.   Note that parentheses  are used for  grouping,
although they are not necessary on the outside level;

                                  ab|cd

would have sufficed.  Parentheses  can be used  for more complex  expres-
sions:

                             (ab|cd+)?(ef)*

matches such strings as abefef, efefef, cdef, or cddd; but not abc, abcd,
or abcdef.

Context sensitivity.  Lex will  recognize a small  amount of  surrounding
context.  The two simplest operators for this are ^ and  $.  If the first
character of an expression is ^, the  expression will only be matched  at
the beginning of a line  (after a newline character, or at the  beginning
of the input stream).  This can never conflict with the other meaning  of
^, complementation of character  classes, since that only applies  within
the [] operators.  If the very last  character is $, the expression  will
only be matched at the  end of a line (when immediately followed by  new-
line).  The latter operator is a special  case of the / operator  charac-
ter, which indicates trailing context.  The expression

                                  ab/cd

matches the string ab, but only if followed by cd.  Thus

                                   ab$

is the same as

                                  ab/\n

Left context is handled in Lex by  start conditions as explained in  sec-
tion 10.  If a rule is only to be executed  when the Lex automaton inter-
preter is in start condition x, the rule should be prefixed by

                                   <x>

using the angle bracket operator characters.  If we considered 'being  at
the beginning of a line'  to be start condition ONE, then the ^  operator

-------------------------------- Page  9 --------------------------------

would be equivalent to

                                  <ONE>

Start conditions are explained more fully later.

Repetitions and Definitions.  The operators {} specify either repetitions
(if they  enclose numbers)  or definition  expansion (if  they enclose  a
name).  For example

                                 {digit}

looks for a predefined string named digit and inserts it at that point in
the expression.  The definitions are  given in the first part of the  Lex
input, before the rules.  In contrast,

                                 a{1,5}

looks for 1 to 5 occurrences of a.

Finally, initial % is special,  being the separator  for Lex source  seg-
ments.




4.    LEX ACTIONS

When an  expression  written  as  above  is  matched,  Lex  executes  the
corresponding action.  This section describes some features of Lex  which
aid in writing actions.  Note that there is a default action, which  con-
sists of  copying the  input to  the output.   This is  performed on  all
strings not otherwise matched.  Thus  the Lex user  who wishes to  absorb
the entire input,  without producing  any output, must  provide rules  to
match everything.  When Lex is being used  with Yacc, this is the  normal
situation.  One may  consider that  actions are what  is done instead  of
copying the input to the  output; thus, in general,  a rule which  merely
copies can be omitted.   Also, a character  combination which is  omitted
from the rules and which appears as input is likely to be printed on  the
output, thus calling attention to the gap in the rules.

One of the  simplest things  that can  be done  is to  ignore the  input.
Specifying a C null statement, ; as an action causes this result.  A fre-
quent rule is

                               [ \t\n]   ;

which causes the three spacing characters (blank, tab, and newline) to be

-------------------------------- Page 10 --------------------------------

ignored.

Another easy way  to avoid  writing actions  is the  action character  |,
which indicates that the action for this rule is the  action for the next
rule.  The previous example could also have been written

                                 " "
                                 "\t"
                                 "\n"

with the same result, although in different style.  The quotes around  \n
and \t are not required.

In more complex actions, the user will often want to know the actual text
that matched some  expression like  [a-z]+.  Lex leaves  this text in  an
external character array named yytext.  Thus, to print the name found,  a
rule like

                     [a-z]+   printf("%s", yytext);

will print the string in yytext.  The C function printf accepts a  format
argument and  data to  be printed;  in this  case, the  format is  'print
string' (% indicating data conversion, and s indicating string type), and
the data are the characters  in yytext.  So this just places the  matched
string on the output.  This action is so common that it may be written as
ECHO:

                             [a-z]+   ECHO;

is the same as the above.  Since the default action is just to print  the
characters found, one  might ask why  give a rule,  like this one,  which
merely specifies the default  action?  Such rules  are often required  to
avoid matching some  other rule  which is not  desired.  For example,  if
there is a rule which matches read  it will normally match the  instances
of read contained in bread or readjust; to avoid this, a rule of the form
[a-z]+ is needed.  This is explained further below.

Sometimes it is more convenient to know the  end of what has been  found;
hence Lex  also  provides a  count  yyleng of  the number  of  characters
matched.  To count both the number of words and the number of  characters
in words in the input, the user might write

                 [a-zA-Z]+   {words++; chars += yyleng;}

which accumulates in chars the number  of characters in the words  recog-
nized.  The last character in the string matched can be accessed by

                            yytext[yyleng-1]

-------------------------------- Page 11 --------------------------------

Occasionally, a Lex action may decide that a rule has not recognized  the
correct span of characters.  Two  routines are provided to aid with  this
situation.  First, yymore() can be called to indicate that the next input
expression recognized is to be tacked on to the end  of this input.  Nor-
mally, the next input string would overwrite the current entry in yytext.
Second, yyless (n) may be called to indicate that not  all the characters
matched by the currently successful expression are wanted right now.  The
argument n indicates the number  of characters in yytext to be  retained.
Further characters previously matched  are returned to  the input.   This
provides the same sort of  lookahead offered by the / operator, but in  a
different form.

Example: Consider a language which defines a  string as a set of  charac-
ters between quotation (") marks,  and provides that to include a " in  a
string it must be preceded by a \.  The regular expression which  matches
that is somewhat confusing, so that it might be preferable to write

                \"[^"]*   {
                          if (yytext[yyleng-1] == '\\')
                               yymore();
                          else
                               ... normal user processing
                          }

which will, when faced with a string  such as "abc\"def" first match  the
five characters "abc\; then the call to yymore() will cause the next part
of the string, "def, to be tacked on the end.  Note that the final  quote
terminating the string should  be picked up  in the code labeled  'normal
processing'.

The function yyless() might  be used  to reprocess text  in various  cir-
cumstances.  Consider the  C problem of  distinguishing the ambiguity  of
'=-a'.  Suppose it is desired  to treat this as '=-  a' but print a  mes-
sage.  A rule might be

            =-[a-zA-Z]   {
                         printf("Operator (=-) ambiguous\n");
                         yyless(yyleng-1);
                         ... action for =- ...
                         }

which prints a  message, returns  the letter  after the  operator to  the
input stream, and treats the operator as '=-'.  Alternatively it might be
desired to treat this as '=  -a'.  To do this, just return the minus sign
as well as the letter to the input:

            =-[a-zA-Z]   {
                         printf("Operator (=-) ambiguous\n");
                         yyless(yyleng-2);
                         ... action for = ...

-------------------------------- Page 12 --------------------------------

                         }

will perform the other interpretation.  Note that the expressions for the
two cases might more easily be written

                              =-   [A-Za-z]

in the first case and
                                    _
                                    _
in the second; no backup would be required in the rule action.  It is not
necessary to  recognize the  whole identifier to  observe the  ambiguity.
The possibility of '=-3', however, makes

                                   =-

a still better rule.

In addition to these routines,  Lex also permits access  to the I/O  rou-
tines it uses.  They are:

1)   input() which returns the next input character;

2)   output(c) which writes the character c on the output; and

3)   unput(c) pushes the character  c back  onto the input  stream to  be
     read later by input().

By default these routines are provided as macro definitions, but the user
can override them and supply private versions.  These routines define the
relationship between external files and internal characters, and must all
be retained or modified  consistently.  They may  be redefined, to  cause
input or output to be  transmitted to or  from strange places,  including
other programs or  internal memory;  but the character  set used must  be
consistent in all routines; a value of  zero returned by input must  mean
end of  file;  and  the relationship  between  unput and  input  must  be
retained or the Lex lookahead will not work.  Lex does not look ahead  at
all if it does not have to, but every rule  ending in + * ?  or $ or con-
taining / implies  lookahead.  Lookahead  is also necessary  to match  an
expression that is a prefix of another expression.  See below  for a dis-
cussion of  the character  set used  by Lex.   The standard  Lex  library
imposes a 100 character limit on backup.

Another Lex library routine that the user will sometimes want to redefine
is yywrap()  which is  called whenever  Lex reaches  an end-of-file.   If
yywrap returns a 1, Lex continues with the normal wrapup on end of input.
Sometimes, however, it is convenient to arrange for more input  to arrive
from a new source.  In this case, the user should provide a yywrap  which
arranges for new  input and  returns 0.  This  instructs Lex to  continue
processing.  The default yywrap always returns 1.

-------------------------------- Page 13 --------------------------------

This routine is also a convenient place to print tables, summaries,  etc.
at the end of a program.  Note that it is  not possible to write a normal
rule which recognizes end-of-file; the  only access to this condition  is
through yywrap.  In fact, unless a private version of input() is supplied
a file containing nulls cannot be handled, since a value of 0 returned by
input is taken to be end-of-file.




5.    AMBIGUOUS SOURCE RULES

Lex can handle ambiguous specifications.   When more than one  expression
can match the current input, Lex chooses as follows:

1)   The longest match is preferred.

2)   Among rules which matched  the same number  of characters, the  rule
     given first is preferred.

Thus, suppose the rules

                    integer   keyword action ...;
                    [a-z]+    identifier action ...;

to be given in that order.  If the input is  integers, it is taken as  an
identifier, because  [a-z]+ matches  8 characters  while integer  matches
only 7.  If the input is integer, both rules match 7 characters, and  the
keyword rule is selected  because it was  given first.  Anything  shorter
(e.g. int) will not match  the expression integer  and so the  identifier
interpretation is used.

The principle  of preferring  the longest  match makes  rules  containing
expressions like .* dangerous.  For example,

                                  '.*'

might seem a good way of recognizing a  string in single quotes.  But  it
is an invitation for the program to read far ahead, looking for a distant
single quote.  Presented with the input

              'first' quoted string here, 'second' here

the above expression will match

                 'first' quoted string here, 'second'

which is probably not what was wanted.  A better rule is of the form

-------------------------------- Page 14 --------------------------------

                                '[^'\n]*'

which, on the above input, will stop after 'first'.  The consequences  of
errors like this are mitigated by the fact that the  .  operator will not
match newline.  Thus expressions like .* stop on the current line.  Don't
try to defeat this with  expressions like [.\n]+ or equivalents; the  Lex
generated program will try to read the entire input file, causing  inter-
nal buffer overflows.

Note that Lex is  normally partitioning the  input stream, not  searching
for all possible matches of each expression.  This means that  each char-
acter is accounted for once  and only once.  For  example, suppose it  is
desired to count occurrences of  both she and he in an input text.   Some
Lex rules to do this might be

                               she   s++;
                               he    h++;
                               \n    |
                               .     ;

where the last two rules ignore everything besides he and she.   Remember
that . does not include newline.  Since she includes he Lex will normally
not recognize the  instances of  he included in  she, since  once it  has
passed a she those characters are gone.

Sometimes the user would like to override this choice.  The action REJECT
means 'go do the  next alternative.' It  causes whatever rule was  second
choice after the current rule to be executed.  The position of the  input
pointer is adjusted accordingly.  Suppose the user really wants to  count
the included instances of he:

                          she   {s++; REJECT;}
                          he    {h++; REJECT;}
                          \n    |
                          .     ;

these rules are one way of changing the previous example to do just that.
After counting each expression, it is rejected; whenever appropriate, the
other expression will then be counted.   In this example, of course,  the
user could note  that she includes he  but not vice  versa, and omit  the
REJECT action on he; in other cases, however, it would not be possible  a
priori to tell which input characters were in both classes.

Consider the two rules

                        a[bc]+   { ... ; REJECT;}
                        a[cd]+   { ... ; REJECT;}

If the input  is ab,  only the first  rule matches,  and on  ad only  the
second matches.  The input  string accb matches  the first rule for  four

-------------------------------- Page 15 --------------------------------

characters and then the second  rule for three characters.  In  contrast,
the input accd agrees with  the second rule for four characters and  then
the first rule for three.

In general, REJECT is useful whenever the purpose of Lex is not to parti-
tion the input  stream but to detect  all examples of  some items in  the
input, and  the instances  of these  items may  overlap or  include  each
other.  Suppose  a digram  table of  the input is  desired; normally  the
digrams overlap, that is the  word the is considered  to contain both  th
and he.  Assuming a two-dimensional array named digram to be incremented,
the appropriate source is

         %%
         [a-z][a-z]   {digram[yytext[0]][yytext[1]]++; REJECT;}
         \n           ;

where the REJECT is necessary to pick up a letter pair beginning at every
character, rather than at every other character.




6.    LEX SOURCE DEFINITIONS

Remember the format of the Lex source:

                             {definitions}
                             %%
                             {rules}
                             %%
                             {user routines}

So far only the  rules have  been described.  The  user needs  additional
options, though, to define variables  for use in his program and for  use
by Lex.  These can go either in the  definitions section or in the  rules
section.

Remember that Lex is turning  the rules into a  program.  Any source  not
intercepted by Lex is copied into the generated program.  There are three
classes of such things.

1)   Any line which is not part of a Lex rule or action which begins with
     a blank  or tab  is  copied into  the Lex  generated program.   Such
     source input prior to the first %% delimiter will be external to any
     function in the code; if it appears immediately after the  first %%,
     it appears in an appropriate place for declarations in the  function
     written by Lex which contains the actions.  This material must  look
     like program fragments, and should precede the first Lex rule.

-------------------------------- Page 16 --------------------------------

     As a side  effect of the above,  lines which begin  with a blank  or
     tab, and which  contain a  comment, are passed  through to the  gen-
     erated program.  This can be used to include comments in either  the
     Lex source or the  generated code.  The  comments should follow  the
     host language convention.

2)   Anything included between lines containing only %{ and %} is  copied
     out as above.   The delimiters are  discarded.  This format  permits
     entering text like preprocessor statements that must begin in column
     1, or copying lines that do not look like programs.

3)   Anything after the third %% delimiter, regardless of formats,  etc.,
     is copied out after the Lex output.

Definitions intended for  Lex are  given before the  first %%  delimiter.
Any line in this section  not contained between %{ and %}, and  beginning
in column 1, is assumed to  define Lex substitution strings.  The  format
of such lines is

                           name translation

and it causes the string given as a translation to be associated with the
name.  The name and translation  must be separated by at least one  blank
or tab, and the name must begin with a letter.  The translation can  then
be called out by the  {name} syntax in a rule.  Using {D} for the  digits
and {E} for an  exponent field,  for example, might  abbreviate rules  to
recognize numbers:

                 D                   [0-9]
                 E                   [DEde][-+]?{D}+
                 %%
                 {D}+                printf("integer");
                 {D}+"."{D}*({E})?   |
                 {D}*"."{D}+({E})?   |
                 {D}+{E}

Note the first two rules for real  numbers; both require a decimal  point
and contain an optional exponent  field, but the first requires at  least
one digit before the decimal point and  the second requires at least  one
digit after the decimal point.  To correctly handle the problem  posed by
a Fortran  expression such  as 35.EQ.I,  which does  not contain  a  real
number, a context-sensitive rule such as

                             [0-9]+   "."EQ

could be used in addition to the normal rule for integers.

The definitions section may  also contain other  commands, including  the
selection of a host language, a character set table, a list of start con-
ditions, or adjustments to the default  size of arrays within Lex  itself

-------------------------------- Page 17 --------------------------------

for larger  source  programs.  These  possibilities are  discussed  below
under 'Summary of Source Format,' section 12.




7.    USAGE

There are two steps in  compiling a Lex source  program.  First, the  Lex
source must be turned into  a generated program in the host general  pur-
pose language.  Then this  program must be  compiled and loaded,  usually
with a library of  Lex subroutines.  The  generated program is on a  file
named lex.yy.c.  The I/O library  is defined in terms  of the C  standard
library.

The C programs generated by Lex are slightly different on OS/370, because
the OS compiler is  less powerful than  the UTS, UNIX or GCOS  compilers,
and does less at  compile time.  C  programs generated on  UTS, GCOS  and
UNIX are the same.

The library is accessed by the loader flag -ll.  So an appropriate set of
commands is

     lex source
     cc lex.yy.c -ll

The resulting program is placed on the usual file a.out for later  execu-
tion.  To use Lex with Yacc see below.  Although the default Lex I/O rou-
tines use the C standard library, the  Lex automata themselves do not  do
so; if private versions of input, output and unput are given, the library
can be avoided.




8.    LEX AND YACC

If you want to use Lex with Yacc, note that what Lex writes is a  program
named yylex(), the name required by Yacc for its analyzer.  Normally, the
default main program on the Lex library  calls this routine, but if  Yacc
is loaded, and its main program is used, Yacc will call yylex().  In this
case each Lex rule should end with

                             return(token);

where the appropriate token value is returned.  An easy way to get access

-------------------------------- Page 18 --------------------------------

to Yacc's names for tokens  is to compile the Lex output file as part  of
the Yacc output file by placing the line

                          # include "lex.yy.c"

in the last section  of Yacc input.   Supposing the grammar  to be  named
'good' and  the  lexical rules  to  be named  'better' the  UNIX  command
sequence can just be:

                           yacc good
                           lex better
                           cc y.tab.c -ly -ll

The Yacc library (-ly) should be loaded before the Lex library, to obtain
a main program which invokes the Yacc parser.  The generations of Lex and
Yacc programs can be done in either order.




9.    EXAMPLES

As a trivial problem, consider  copying an input file  while adding 3  to
every positive number divisible by 7.  Here is a suitable Lex source pro-
gram

                    %%
                             int k;
                    [0-9]+   {
                             k = atoi(yytext);
                             if (k%7 == 0)
                                  printf("%d", k+3);
                             else
                                  printf("%d",k);
                             }

to do just that.  The rule [0-9]+ recognizes strings of digits; atoi con-
verts the digits to  binary and stores the  result in k.  The operator  %
(remainder) is used to check whether k is divisible by 7; if it is, it is
incremented by 3 as it is written out.  It may be objected that this pro-
gram will alter such input items as 49.63 or X7.  Furthermore, it  incre-
ments the  absolute value  of all  negative numbers divisible  by 7.   To
avoid this, just add a few more rules after the active one, as here:

        %%
                               int k;
        -?[0-9]+               {
                               k = atoi(yytext);

-------------------------------- Page 19 --------------------------------

                               printf("%d", k%7 == 0 ? k+3 : k);
                               }
        -?[0-9.]+              ECHO;
        [A-Za-z][A-Za-z0-9]+   ECHO;

Numerical strings containing a '.' or preceded by a letter will be picked
up by one of the  last two rules, and not changed.  The if-else has  been
replaced by a  C conditional  expression to  save space;  the form  a?b:c
means 'if a then b else c'.

For an example of statistics  gathering, here is  a program which  histo-
grams the  lengths of  words,  where a  word is  defined as  a string  of
letters.

                         int lengs[100];
                %%
                [a-z]+   lengs[yyleng]++;
                .        |
                \n       ;
                %%
                yywrap()
                {
                int i;
                printf("Length  No. words\n");
                for(i=0; i<100; i++)
                     if (lengs[i] > 0)
                          printf("%5d%10d\n",i,lengs[i]);
                return(1);
                }

This program accumulates the  histogram, while producing  no output.   At
the end of the input it prints the table.  The final statement return(1);
indicates that Lex is to perform wrapup.  If yywrap returns zero  (false)
it implies that further input is available and the program is to continue
reading and  processing.  To  provide a  yywrap that  never returns  true
causes an infinite loop.

As a larger example, here  are some parts of a  program written by N.  L.
Schryer to convert double precision Fortran to single precision  Fortran.
Because Fortran does not distinguish  upper and lower case letters,  this
routine begins by defining a set of classes including both  cases of each
letter:

                               a     [aA]
                               b     [bB]
                               c     [cC]
                               ...
                               z     [zZ]

An additional class recognizes white space:

-------------------------------- Page 20 --------------------------------

                               W   [ \t]*

The first rule changes  'double precision' to  'real', or 'DOUBLE  PRECI-
SION' to 'REAL'.

           {d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
                printf(yytext[0]=='d'? "real" : "REAL");
                }

Care is taken  throughout this  program to  preserve the  case (upper  or
lower) of  the original  program.  The  conditional operator  is used  to
select the proper form of the keyword.  The next rule copies continuation
card indications to avoid confusing them with constants:

                          ^"     "[^ 0]   ECHO;

In the regular expression, the quotes surround the blanks.  It is  inter-
preted as 'beginning of line,  then five blanks, then anything but  blank
or zero.' Note the two different meanings of ^.  There follow some  rules
to change double precision constants to ordinary floating constants.

                [0-9]+{W}{d}{W}[+-]?{W}[0-9]+     |
                [0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+     |
                "."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+     {

                     for(p=yytext; *p != 0; p++)
                          {
                          if (*p == 'd' || *p == 'D')
                               *p=+ 'e'- 'd';
                          ECHO;
                          }

After the floating point constant is recognized, it is scanned by the for
loop to find  the letter d or  D.  The program  than adds 'e'-'d',  which
converts it to the next letter  of the alphabet.  The modified  constant,
now single-precision, is  written out  again.  There follow  a series  of
names which must be respelled  to remove their initial  d.  By using  the
array yytext the same action suffices for all the names (only a sample of
a rather long list is given here).

               {d}{s}{i}{n}         |
               {d}{c}{o}{s}         |
               {d}{s}{q}{r}{t}      |
               {d}{a}{t}{a}{n}      |
               ...
               {d}{f}{l}{o}{a}{t}   printf("%s",yytext+1);

Another list of names must have initial d changed to initial a:

                {d}{l}{o}{g}     |

-------------------------------- Page 21 --------------------------------

                {d}{l}{o}{g}10   |
                {d}{m}{i}{n}1    |
                {d}{m}{a}{x}1    {
                                 yytext[0] =+ 'a' - 'd';
                                 ECHO;
                                 }

And one routine must have initial d changed to initial r:

              {d}1{m}{a}{c}{h}   {yytext[0] =+ 'r'  - 'd';



To avoid such names as  dsinx being detected as  instances of dsin,  some
final rules pick up longer  words as identifiers and copy some  surviving
characters:

                      [A-Za-z][A-Za-z0-9]*   |
                      [0-9]+                 |
                      \n                     |
                      .                      ECHO;

Note that this program is not complete; it does not deal with the spacing
problems in Fortran or with the use of keywords as identifiers.




10.   LEFT CONTEXT SENSITIVITY

Sometimes it is desirable  to have several  sets of lexical  rules to  be
applied at different times in the input.  For example, a compiler prepro-
cessor might distinguish  preprocessor statements and  analyze them  dif-
ferently from ordinary  statements.  This  requires sensitivity to  prior
context, and there are  several ways  of handling such  problems.  The  ^
operator, for example, is  a prior context operator, recognizing  immedi-
ately preceding left context just  as $ recognizes immediately  following
right context.  Adjacent  left context  could be extended,  to produce  a
facility similar to that for adjacent  right context, but it is  unlikely
to be as useful, since often the relevant left context appeared some time
earlier, such as at the beginning of a line.

This section describes  three means  of dealing  with different  environ-
ments: a  simple use  of flags,  when only  a few rules  change from  one
environment to another, the  use of  start conditions on  rules, and  the
possibility of making  multiple lexical analyzers  all run together.   In
each case,  there  are rules  which  recognize the  need  to  change  the
environment in which the following  input text is analyzed, and set  some

-------------------------------- Page 22 --------------------------------

parameter to reflect the change.  This may be a flag explicitly tested by
the user's action code; such  a flag is the simplest way of dealing  with
the problem, since  Lex is  not involved  at all.   It may  be more  con-
venient, however, to have Lex remember the flags as initial conditions on
the rules.  Any rule may be associated  with a start condition.  It  will
only be  recognized when  Lex is  in that start  condition.  The  current
start condition may  be changed  at any time.   Finally, if  the sets  of
rules for the different environments are very dissimilar, clarity may  be
best achieved by writing several distinct lexical analyzers, and  switch-
ing from one to another as desired.

Consider the following problem:  copy the input  to the output,  changing
the word magic  to first  on every line  which began  with the letter  a,
changing magic to second on every line which began with the letter b, and
changing magic to third on every line which began with the letter c.  All
other words and all other lines are left unchanged.

These rules are so simple that the easiest way to  do this job is with  a
flag:

                       int flag;
               %%
               ^a      {flag = 'a'; ECHO;}
               ^b      {flag = 'b'; ECHO;}
               ^c      {flag = 'c'; ECHO;}
               \n      {flag =  0 ; ECHO;}
               magic   {
                       switch (flag)
                       {
                       case 'a': printf("first"); break;
                       case 'b': printf("second"); break;
                       case 'c': printf("third"); break;
                       default: ECHO; break;
                       }
                       }

should be adequate.

To handle the same  problem with start  conditions, each start  condition
must be introduced to Lex in the definitions section with a line reading

                        %Start   name1 name2 ...

where the conditions may be  named in any order.  The  word Start may  be
abbreviated to s or S.  The conditions may be referenced at the head of a
rule with the <> brackets:

                            <name1>expression

is a rule which  is only recognized  when Lex is  in the start  condition

-------------------------------- Page 23 --------------------------------

name1.  To enter a start condition, execute the action statement

                              BEGIN name1;

which changes the start condition to name1.  To resume the normal state,

                                BEGIN 0;

resets the initial condition  of the Lex  automaton interpreter.  A  rule
may be active in several start conditions:

                           <name1,name2,name3>

is a legal prefix.  Any rule not beginning with the <> prefix operator is
always active.

The same example as before can be written:

                   %START AA BB CC
                   %%
                   ^a                {ECHO; BEGIN AA;}
                   ^b                {ECHO; BEGIN BB;}
                   ^c                {ECHO; BEGIN CC;}
                   \n                {ECHO; BEGIN 0;}
                   <AA>magic         printf("first");
                   <BB>magic         printf("second");
                   <CC>magic         printf("third");

where the logic is exactly the same as in the previous method of handling
the problem, but Lex does the work rather than the user's code.




11.   CHARACTER SET

The programs generated by Lex handle character I/O only through the  rou-
tines input, output, and  unput.  Thus the character representation  pro-
vided in these routines is accepted by Lex and employed to return  values
in yytext.   For  internal use  a  character is  represented as  a  small
integer which, if the standard library is used, has a value equal to  the
integer value of the bit  pattern representing the character on the  host
computer.  Normally, the letter a is represented as the same form as  the
character constant 'a'.  If this interpretation is changed, by  providing
I/O routines which translate the characters,  Lex must be told about  it,
by giving a  translation table.   This table must  be in the  definitions
section, and must be bracketed by lines containing  only '%T'.  The table
contains lines of the form

-------------------------------- Page 24 --------------------------------

                      {integer} {character string}

which indicate the value associated  with each character.  Thus the  next
example

                                %T
                                 1    Aa
                                 2    Bb
                                ...
                                26    Zz
                                27    \n
                                28    +
                                29   ___
                                30    0
                                31    1
                                ...
                                39    9
                                %T


                         Sample character table.
maps the  lower and  upper  case letters  together into  the  integers  1
through 26, newline into 27, + and - into 28  and 29, and the digits into
30 through 39.  Note  the escape for  newline.  If a  table is  supplied,
every character that  is to appear either  in the rules  or in any  valid
input must be included in  the table.  No character  may be assigned  the
number 0, and no character may be assigned a bigger  number than the size
of the hardware character set.




12.   SUMMARY OF SOURCE FORMAT

The general form of a Lex source file is:

                           {definitions}
                           %%
                           {rules}
                           %%
                           {user subroutines}

The definitions section contains a combination of

1)   Definitions, in the form 'name space translation'.

2)   Included code, in the form 'space code'.

-------------------------------- Page 25 --------------------------------

3)   Included code, in the form

                                     %{
                                     code
                                     %}

4)   Start conditions, given in the form

                              %S name1 name2 ...

5)   Character set tables, in the form

                        %T
                        number space character-string
                        ...
                        %T

6)   Changes to internal array sizes, in the form

                                   %x  nnn

     where nnn is  a decimal  integer representing  an array  size and  x
     selects the parameter as follows:

                      Letter           Parameter
                         p     positions
                         n     states
                         e     tree nodes
                         a     transitions
                         k     packed character classes
                         o     output array size

Lines in the rules section have  the form 'expression  action' where  the
action may be continued  on succeeding lines  by using braces to  delimit
it.

Regular expressions in Lex use the following operators:

      x        the character "x"
      "x"      an "x", even if x is an operator.
      \x       an "x", even if x is an operator.
      [xy]     the character x or y.
      [x-z]    the characters x, y or z.
      [^x]     any character but x.
      .        any character but newline.
      ^x       an x at the beginning of a line.
      <y>x     an x when Lex is in start condition y.
      x$       an x at the end of a line.
      x?       an optional x.
      x*       0,1,2, ... instances of x.

-------------------------------- Page 26 --------------------------------

      x+       1,2,3, ... instances of x.
      x|y      an x or a y.
      (x)      an x.
      x        y
      {xx}     the translation of xx from the definitions section.
      x{m,n}   m through n occurrences of x




13.   CAVEATS AND BUGS

There are pathological  expressions which produce  exponential growth  of
the tables when  converted to  deterministic machines; fortunately,  they
are rare.

REJECT does not rescan the input; instead it remembers the results of the
previous scan.  This means that if a rule with trailing context is found,
and REJECT executed,  the user  must not have  used unput  to change  the
characters forthcoming from the input stream.  This is the only  restric-
tion on the user's ability to manipulate the not-yet-processed input.




14.   ACKNOWLEDGMENTS

As should be obvious from the above, the  outside of Lex is patterned  on
Yacc and the inside on  Aho's string matching routines.  Therefore,  both
S. C. Johnson and  A. V. Aho are  really originators of  much of Lex,  as
well as debuggers of it.  Many thanks are due to both.

The code  of  the current  version  of Lex  was  designed,  written,  and
debugged by Eric Schmidt.




REFERENCES

1.   B. W.  Kernighan and  D.  M. Ritchie,  The C  Programming  Language,
     Prentice-Hall, N. J. (1978).

2.   B. W.  Kernighan, Ratfor:  A Preprocessor  for a  Rational  Fortran,

-------------------------------- Page 27 --------------------------------

     Software - Practice and Experience, 5, pp. 395-496 (1975).

3.   Yacc: Yet Another Compiler Compiler.

4.   A. V. Aho and M. J. Corasick,  Efficient String Matching: An Aid  to
     Bibliographic Search, Comm. ACM 18, 333-340 (1975).

-------------------------------- The End --------------------------------
