Lexical analyzer converts stream of input characters into a stream of tokens.
The different tokens that our lexical analyzer identifies are as follows:
KEYWORDS: int, char, float, double, if, for, while, else, switch, struct, printf,
scanf, case, break, return, typedef, void
IDENTIFIERS: main, fopen, getch etc
NUMBERS: positive and negative integers, positive and negative floating point numbers.
OPERATORS: +, ++, -, --, ||, *, ?, /, >, >=, <, <=, =, ==, &, &&.
BRACKETS: [ ], { }, ( ).
STRINGS : Set of characters enclosed within the quotes
COMMENT LINES: Ignores single line, multi line comments
For tokenizing into identifiers and keywords we incorporate a symbol table which initially
consists of predefined keywords. The tokens are read from an input file. If the encountered
token is an identifier or a keyword the lexical analyzer will look up in the symbol table to
check the existence of the respective token. If an entry does exist then we proceed to the
next token. If not then that particular token along with the token value is written into the
symbol table. The rest of the tokens are directly displayed by writing into an output file.
The output file will consist of all the tokens present in our input file along with their
respective token values.
INTRODUCTION
Lexical analysis involves scanning the program to be compiled and recognizing the tokens that make up the source statements Scanners or lexical analyzers are usually designed to recognize keywords , operators , and identifiers , as well as integers, floating point numbers , character strings , and other similar items that are written as part of the source program . The exact set of tokens to be recognized of course, depends upon the programming language being used to describe it.
A sequence of input characters that comprises a single token is called a lexeme. A lexical analyzer can insulate a parser from the lexeme representation of tokens. Following are the list of functions that lexical analyzers perform.
Removal of white space and comment
Many languages allow �white space� to appear between tokens. Comments can likewise be ignored by the parser and translator , so they may also be treated as white space. If white space is eliminated by the lexical analyzer, the parser will never have to consider it.
Constants
An integer constant is a sequence of digits, integer constants can be allowed by adding productions to the grammar for expressions, or by creating a token for such constants . The job of collecting digits into integers is generally given to a lexical analyzer because numbers can be treated as single units during translation. The lexical analyzer passes both the token and attribute to the parser.
Recognizing identifiers and keywords
Languages use identifiers as names of variables, arrays and functions. A grammar for a language often treats an identifier as a token. Many languages use fixed character strings such as begin, end , if, and so on , as punctuation marks or to identify certain constructs. These character strings, called keywords, generally satisfy the rules for forming identifiers.