Compiler design notes

Unit 1

Lexical Analysis

Introduction to Compiling

INTRODUCTION OF LANGUAGE PROCESSING SYSTEM

Preprocessor:-

· A preprocessor produce input to compilers. They may perform the following functions.

1) Macro processing: A preprocessor may allow a user to define macros that are short hands for longer constructs.

2) File inclusion: A preprocessor may include header files into the program text.

3) Rational preprocessor: these preprocessors augment older languages with more modern flow-ofcontrol and data structuring facilities.

4) Language Extensions: These preprocessor attempts to add capabilities to the language by certain amounts to build-in macro

Compiler:-

· Compiler is a translator program that translates a program written in (HLL) the source program and translate it into an equivalent program in (MLL) the target program.

· As an important part of a compiler is error showing to the programmer.

· Executing a program written n HLL programming language is basically of two parts. the source program must first be compiled translated into a object program.

· Then the results object program is loaded into a memory executed.

ASSEMBLER

· A mnemonic machine language is now called an assembly language. Programs known as assembler were written to automate the translation of assembly language in to machine language.

· The input to an assembler program is called source program, the output is a machine language translation (object program).

INTERPRETER

· An interpreter is a program that appears to execute a source program as if it were machine language.

· Languages such as BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses interpreter. The process of interpretation can be carried out in following phases.

1. Lexical analysis

2. Syntax analysis

3. Semantic analysis

4. Direct Execution

Advantages:

· Modification of user program can be easily made and implemented as execution proceeds.

· Type of object that denotes various may change dynamically.

· Debugging a program and finding errors is simplified task for a program used for interpretation.

· The interpreter for the language makes it machine independent.

Disadvantages:

· The execution of the program is slower. Memory consumption is more.

LOADER AND LINK-EDITOR:

· Once the assembler procedures an object program, that program must be placed into memory and executed.

· The assembler could place the object program directly in memory and transfer control to it,

· System programmers developed another component called loader “A loader is a program that places programs into memory and prepares them for execution.”

· If subroutines could be translated into object form the loader could “relocate” directly behind the user’s program.

· The task of adjusting programs o they may be placed in arbitrary core locations is called relocation. Relocation loaders perform four functions.

TRANSLATOR

· Translator is a program that takes as input a program written in one language and produces as output a program in another language.

· The translator performs another very important role, the error-detection. Any violation of d HLL specification would be detected and reported to the programmers.

· Important role of translator are:

1) Translating the HLL program input into an equivalent ml program.

2) Providing diagnostic messages wherever the programmer violates specification of the HLL.

STRUCTURE OF THE COMPILER DESIGN

Phases of a compiler:

A compiler operates in phases. A phase is a logically interrelated operation that takes source program in one representation and produces output in another representation. The phases of a compiler are shown in below

There are two phases of compilation.

a. Analysis (Machine Independent/Language Dependent)

b. Synthesis(Machine Dependent/Language independent)

Compilation process is partitioned into no-of-sub processes called ‘phases’.

Lexical Analysis:-

· LA or Scanners reads the source program one character at a time, carving the source program into a sequence of atomic units called tokens.

Syntax Analysis:-

· The second stage of translation is called Syntax analysis or parsing. In this phase expressions, statements, declarations etc… are identified by using the results of lexical analysis.

· Syntax analysis is aided by using techniques based on formal grammar of the programming language.

Intermediate Code Generations:-

· An intermediate representation of the final machine language code is produced. This phase bridges the analysis and synthesis phases of translation.

Code Optimization:-

· This is optional phase described to improve the intermediate code so that the output runs faster and takes less space.

Code Generation:-

· The last phase of translation is code generation. A number of optimizations to reduce the length of machine language program are carried out during this phase.

· The output of the code generator is the machine language program of the specified computer.

Table Management (or) Book-keeping:-

· This is the portion to keep the names used by the program and records essential information about each. The data structure used to record this information called a ‘Symbol Table’.

Error Handlers:-

· It is invoked when a flaw error in the source program is detected. The output of LA is a stream of tokens, which is passed to the next phase, the syntax analyzer or parser.

· The SA groups the tokens together into syntactic structure called as expression. Expression may further be combined to form statements.

· The syntactic structure can be regarded as a tree whose leaves are the token called as parse trees.

Example

Regular languages:

· A regular language is a language that can be expressed with a regular expression or a deterministic or non-deterministic finite automata or state machine.

· A language is a set of strings which are made up of characters from a specified alphabet, or set of symbols. Regular languages are a subset of the set of all strings.

· Regular languages are used in parsing and designing programming languages

Operations on Regular Languages

· A regular language can be represented by a string of symbols and operations.

Concatenation

· Concatenation is an operation that combines two symbols, sets, or languages.

· There are two main ways to write a concatenation. X∘Y and XY both mean “X concatenated with Y.”

· Languages can also be concatenated: L1∘L2means strings from are written first, and then strings from come after.

Formally, concatenation is defined as follows: A∘B={xy ∣ x∈A and y∈B}.

Union

· Union is an operation where a finite state machine can make one choice or another at any step (this can also be thought of as an “or” operation). For example, there could be a finite state machine that can choose between concatenating X with Y or concatenating X with Z. Or is written with a vertical bar and the expression from the previous sentence looks like this: (X∘Y∣X∘Z).

Formally, union is described as follows: A∪B={x∣x∈A or x∈B}. Note, here the vertical bar means “such that.”

Kleene Star

· The Kleene star is an operation denoted by an asterisk ∗ that basically represents repeated self-concatenation.

· For example, if a language L represents all strings of the form X∗, then this language includes X, XX, XXXXXX, XXXXXXXXXXX, etc.

· The Kleene star can repeat the symbol it operates on any number of times.

· The regular language X∘Y∣Z∗ represents strings of the form “X concatenated with Y or Z repeated any number of times.”

· Formally, the Kleene star is defined as follows: A∗={x1,x2...xk∣k≥0 and each xi∈A}. The Kleene star is a unary operation since it operates only on one thing instead of two (unlike union and concatenation, which are binary operations).

Because k can be 0, x can be repeated 0 times. This means that the empty string is always a member of A∗.

The Empty String

Another important symbol is ϵ which represents the empty string.

FINITE AUTOMATON

· A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence of that language, and “no” otherwise.

· We call the recognizer of the tokens as a finite automaton.

· A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)

· This means that we may use a deterministic or non-deterministic automaton as a lexical analyzer.

· Both deterministic and non-deterministic finite automaton recognize regular sets.

· Which one?

– deterministic – faster recognizer, but it may take more space

– non-deterministic – slower, but it may take less space

– Deterministic automatons are widely used lexical analyzers.

· First, we define regular expressions for tokens; Then we convert them into a DFA to get a lexical analyzer for our tokens.

Non-Deterministic Finite Automaton (NFA)

· A non-deterministic finite automaton (NFA) is a mathematical model that consists of:

o S - a set of states

o Σ - a set of input symbols (alphabet)

o move - a transition function move to map state-symbol pairs to sets of states.

o s0 - a start (initial) state

o F- a set of accepting states (final states)

· ε- transitions are allowed in NFAs. In other words, we can move from one state to another one without consuming any symbol.

· A NFA accepts a string x, if and only if there is a path from the starting state to one of accepting states such that edge labels along this path spell out x.

Example

Deterministic Finite Automaton (DFA)

· A Deterministic Finite Automaton (DFA) is a special form of a NFA.

· No state has ε- transition

· For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition function is from pair of state-symbol to state (not set of states)

Example:

Converting RE to NFA

· This is one way to convert a regular expression into a NFA.

· There can be other ways (much efficient) for the conversion.

· Thomson’s Construction is simple and systematic method.

· It guarantees that the resulting NFA will have exactly one final state, and one start state.

· Construction starts from simplest parts (alphabet symbols).

· To create a NFA for a complex regular expression, NFAs of its sub-expressions are combined to create its NFA.

To recognize an empty string ε:

· To recognize a symbol a in the alphabet Σ:

For regular expression r1 | r2:

N(r1) and N(r2) are NFAs for regular expressions r1 and r2.

· For regular expression r1 r2

Here, final state of N(r1) becomes the final state of N(r1r2).

For regular expression r*

· Example:

· For a RE (a|b) * a, the NFA construction is shown below.

Converting NFA to DFA (Subset Construction)

We merge together NFA states by looking at them from the point of view of the input characters:

· From the point of view of the input, any two states that are connected by an –transition may as well be the same, since we can move from one to the other without consuming any character. Thus states which are connected by an -transition will be represented by the same states in the DFA.

· If it is possible to have multiple transitions based on the same symbol, then we can regard

a transition on a symbol as moving from a state to a set of states (ie. the union of all those states reachable by a transition on the current symbol). Thus these states will be combined into a single DFA state.

To perform this operation, let us define two functions:

· The -closure function takes a state and returns the set of states reachable from it based on (one or more) -transitions. Note that this will always include the state itself. We should be able to get from a state to any state in its -closure without consuming any input.

· The function move takes a state and a character, and returns the set of states reachable by one transition on this character.

Scanner Generator

Lex specifications:

A Lex program (the .l file ) consists of three parts:

declarations

translation rules

auxiliary procedures

1. The declarations section includes declarations of variables, manifest constants(A manifest constant is an identifier that is declared to represent a constant e.g. # define PIE 3.14), and regular definitions.

2. The translation rules of a Lex program are statements of the form :

p1 {action 1}

p2 {action 2}

p3 {action 3}

… …

Where, each p is a regular expression and each action is a program fragment describing what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex the actions are written in C.

3. The third section holds whatever auxiliary procedures are needed by the actions. Alternatively these procedures can be compiled separately and loaded with the lexical analyzer.

EngineersKatta

Menu