Compiler: Difference between revisions

From Citizendium
Jump to navigation Jump to search
imported>Pat Palmer
mNo edit summary
imported>Peter Schmitt
 
(36 intermediate revisions by 10 users not shown)
Line 1: Line 1:
In [[computer science]], a '''compiler''' is a [[translation system]] for a [[computer]] [[programming language]].  For example, a compiler might translate a human readable program, called [[source code]], into [[machine code]].  The theory behind compilers is sufficient for translation between any two [[formal language|formal languages]], which are fully specified so there can be no ambiguity, but not for translating between [[natural language|natural languages]], which are much more complex. 
{{subpages}}


=Input and Output=
A '''compiler''' is a program that translates a human-readable, plain text [[computer]] program, called [[source code]], into a less human-readable [[machine code]].  In [[Computer science]] terms, a compiler is a [[translation system]] for a [[computer]] that can automatically translate between any two [[formal language]]s (i.e., [[languages]] which are fully specified so there can be no ambiguity, as opposed to [[natural language]]s spoken by people).  A formal language specification, together with a compiler which creates machine code for a [[computer]], constitutes a [[programming language]].
 
The very first electronic [[computer]]s had to be programmed without the benefit of a compiler.  The first implementation of a compiler, as well as the very idea for a compiled language, was invented by [[Dr. Grace Murray Hopper]], a Harvard mathematics professor and early programmer of the [[History_of_computing#Harvard_Mark_I_.281943.29|Mark I computer]].  Hopper, a pioneer along with several other women working on [[History_of_computing#The_first_electronic_computers_.281940.27s.29|early computers]], arguably can be credited with launching the field of [[programming languages]]<ref name="Hopper1">{{cite book|url=http://www.amazon.com/Portraits-Silicon-Robert-Slater/dp/0262691310|title="Portraits in Silicon" by Robert Slater, ch. 20, p. 219|publisher=The MIT Press|year=1987}}</ref>.  The [[history of compilers]] is deserving of its own article.
 
==Input and output==
The input to a compiler is a file (or files) containing a program in a source language.  The source file is likely to be a human-readable [[programming language]], though it could be any unambiguous representation of an [[algorithm]], such as a [[flow chart]] or other representation of a [[finite state machine]].  The output of a compiler is a different file containing code in a target language, often a low-level [[machine language]], though it could just as well be another high-level language.
The input to a compiler is a file (or files) containing a program in a source language.  The source file is likely to be a human-readable [[programming language]], though it could be any unambiguous representation of an [[algorithm]], such as a [[flow chart]] or other representation of a [[finite state machine]].  The output of a compiler is a different file containing code in a target language, often a low-level [[machine language]], though it could just as well be another high-level language.


=Two levels of compilation=
==Two levels of compilation==
Most modern programming languages perform compilation in two stages, first from the source language to an intermediate language (typically an assembler), and second from the intermediate language to machine code.  In so-called ''managed'' programming languages such as [[Java]] and [[C sharp|C#]], the second compilation is postponed until right before the program needs to execute, in which case it is called "just-in-time" compilation.
Most modern programming languages perform compilation in two stages, first from the source language to an intermediate language (typically an assembler), and second from the intermediate language to machine code.  In so-called ''managed'' programming languages such as [[Java]] and [[C sharp|C#]], the second compilation is postponed until right before the program needs to execute, in which case it is called "just-in-time" compilation.


=How a compiler translates=
==How a compiler translates==
The actions which a compiler must perform usually include the following:
The tasks which a compiler must accomplish include the following:


# [[lexical analysis|Lexical Analysis or Scanning]], in which the input characters are recognized by a set of [[regular expressions]] and output as a sequence of [[token|tokens]].
# [[lexical analysis|Lexical Analysis or Scanning]], in which the input characters are ''recognized'' ([[parsed]]), usually by a set of [[regular expressions]], and output as a sequence of [[token|tokens]].
# [[syntactic analysis|Syntactical Analysis or Parsing]], in which the input tokens are recognized by a set of [[pushdown automatons]] and output a sequence of semantic actions.  
# [[syntactic analysis|Syntactical Analysis or Parsing]], in which the input tokens are recognized by a set of [[pushdown automatons]] and output a sequence of semantic actions.  
# [[semantic analysis|Semantic Analysis]], in which each semantic action builds an internal or intermediate representation of the source program, and [[context sensitive]] errors (any error that cannot be discriminated by a [[context-free language]]) are detected.
# [[semantic analysis|Semantic Analysis]], in which each semantic action builds an internal or intermediate representation of the source program, and [[context sensitive]] errors (any error that cannot be discriminated by a [[context-free language]]) are detected.
# [[optimization (computer science)|Optimization]], in which the compiler attempts to replace computationally expensive portions of the program with less expensive versions, provided that no substitution affects the operation of the program.
# [[code generation|Code Generation]], in which the intermediate language is translated a piece at a time to the target language.
# [[code generation|Code Generation]], in which the intermediate language is translated piece at a time to the target language.
# [[peephole optimizations|Peephole Optimization]], a final optimization pass in which analyses the output code over a small region (the peephole), searching for very localized optimizations.


In actuality, there may be multiple optimization stages scattered throughout this process.  Additionally, most modern compilers repeatedly translate the language from an intermediate representation to a simpler intermediate representation in order to accomodate a wide swath of optimizations that operate on different levels of detail.
In actuality, there may be multiple optimization stages scattered throughout this process.  Additionally, most modern compilers repeatedly translate the language from an intermediate representation to a simpler intermediate representation in order to accommodate a wide swath of optimizations that operate on different levels of detail.


==== Lexical Analysis ====
=== Lexical analysis ===


During lexical analysis, a set of regular expressions translate the input sequence (generally characters) into an output sequence (called tokens).  One popular tool to simplify the creation of lexical analyzers is a software package called [[lex]].
During lexical analysis, a set of regular expressions translate the input sequence (generally characters) into an output sequence (called tokens).  One popular tool to simplify the creation of lexical analyzers is a software package called [[lex]].
Line 25: Line 27:
Readers accustomed to programming may benefit from a few examples of errors that can be detected during this phase.  A lexical analyzer could detect errors in a single token, for instance a number that has the letter 'y' in it, or a string with a missing end quote.
Readers accustomed to programming may benefit from a few examples of errors that can be detected during this phase.  A lexical analyzer could detect errors in a single token, for instance a number that has the letter 'y' in it, or a string with a missing end quote.


==== Syntactic Analysis ====
=== Syntactic analysis ===


During syntactic analysis, an input sequence of tokens is matched against a set of gramatical constructs called [[productions]].  As each production is matched, a semantic action routine is called.  The role of each semantic action is to build an intermediate representation of the input program, such as a list of variables and functions, and a sequence of instructions comprising each function.
During syntactic analysis, an input sequence of tokens is matched against a set of grammatical constructs called [[productions]].  As each production is matched, a semantic action routine is called.  The role of each semantic action is to build an intermediate representation of the input program, such as a list of variables and functions, and a sequence of instructions comprising each function.


Readers accustomed to programming may benefit from a few examples of errors that can be detected during this phase.  A Syntactic analyzer could detect a syntactic error, such as a missing semicolon or curly brace.  A syntactic analyzer cannot detect the use of an undeclared variable.  This is because the declaration of a variable before its use is a [[context sensitive]] langauge requirement, though syntactic analyzers are generally [[context-free]] language recognizers.
Readers accustomed to programming may benefit from a few examples of errors that can be detected during this phase.  A Syntactic analyzer could detect a syntactic error, such as a missing semicolon or curly brace.  A syntactic analyzer cannot detect the use of an undeclared variable.  This is because the declaration of a variable before its use is a [[context sensitive]] language requirement, though syntactic analyzers are generally [[context-free]] language recognizers.


==== Semantic Analysis ====
=== Semantic analysis ===


During semantic analysis, a compiler builds and examines an intermediate representation of the source program and checks it for consistency.
During semantic analysis, a compiler builds and examines an intermediate representation of the source program and checks it for consistency.
Line 37: Line 39:
Readers accustomed to programming may benefit from a few examples of errors that can be detected during this phase.  A semantic analyzer could detect errors, such as undeclared variables or functions.
Readers accustomed to programming may benefit from a few examples of errors that can be detected during this phase.  A semantic analyzer could detect errors, such as undeclared variables or functions.


==== Code Generation ====
=== Code generation ===
* [[address mode]]
* [[address mode]]
* [[application binary interface|application binary interface (ABI]]
* [[application binary interface|application binary interface (ABI]]
Line 50: Line 52:
* [[stack frame]]
* [[stack frame]]


= Optimization =
== Optimizations ==
 
Optimizations are ''optional'' strategies which a compiler may use when emitting output code.  Optimizations may be used to improve code execution speed or memory usage, but only if the performance can be improved without sacrificing the correctness of the translation.
During optimization, a compiler attempts to alter its internal representation of the input program as to improve code speed, size, or many other code characteristics.


* [[alias analysis]]
* [[alias analysis]]
Line 58: Line 59:
* [[constant folding]]
* [[constant folding]]
* [[copy propagation]]
* [[copy propagation]]
* [[common subexpression elimination]]
* [[dead assignment elimination]]
* [[dead code elimination]]
* [[dead code elimination]]
* [[function inlining]]
* [[function inlining]]
Line 63: Line 66:
* [[function inlining|inlining]]
* [[function inlining|inlining]]
* [[loop optimization]]
* [[loop optimization]]
* [[loop peeling]]
** [[loop peeling]]
* [[loop unrolling]]
** [[loop unrolling]]
* [[peephole optimization]]
** [[code motion]]
* [[reduction in strength]]
** [[induction-variable elimination]]  
** [[reduction in strength]]
* [[peephole optimization]] - analyses the output over a small region (the peephole), searching for localized improvements
* [[tail call optimization]]
* [[tail call optimization]]
= See Also =
* [[List_of_code_generation_topics|List of code generation topics]]
* [[List of Compiler Optimizations]]
[[Category:Computers Workgroup]]
[[Category:Mathematics Workgroup]]
[[Category:CZ_Live]]

Latest revision as of 16:53, 16 February 2010

This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
 
This editable Main Article is under development and subject to a disclaimer.

A compiler is a program that translates a human-readable, plain text computer program, called source code, into a less human-readable machine code. In Computer science terms, a compiler is a translation system for a computer that can automatically translate between any two formal languages (i.e., languages which are fully specified so there can be no ambiguity, as opposed to natural languages spoken by people). A formal language specification, together with a compiler which creates machine code for a computer, constitutes a programming language.

The very first electronic computers had to be programmed without the benefit of a compiler. The first implementation of a compiler, as well as the very idea for a compiled language, was invented by Dr. Grace Murray Hopper, a Harvard mathematics professor and early programmer of the Mark I computer. Hopper, a pioneer along with several other women working on early computers, arguably can be credited with launching the field of programming languages[1]. The history of compilers is deserving of its own article.

Input and output

The input to a compiler is a file (or files) containing a program in a source language. The source file is likely to be a human-readable programming language, though it could be any unambiguous representation of an algorithm, such as a flow chart or other representation of a finite state machine. The output of a compiler is a different file containing code in a target language, often a low-level machine language, though it could just as well be another high-level language.

Two levels of compilation

Most modern programming languages perform compilation in two stages, first from the source language to an intermediate language (typically an assembler), and second from the intermediate language to machine code. In so-called managed programming languages such as Java and C#, the second compilation is postponed until right before the program needs to execute, in which case it is called "just-in-time" compilation.

How a compiler translates

The tasks which a compiler must accomplish include the following:

  1. Lexical Analysis or Scanning, in which the input characters are recognized (parsed), usually by a set of regular expressions, and output as a sequence of tokens.
  2. Syntactical Analysis or Parsing, in which the input tokens are recognized by a set of pushdown automatons and output a sequence of semantic actions.
  3. Semantic Analysis, in which each semantic action builds an internal or intermediate representation of the source program, and context sensitive errors (any error that cannot be discriminated by a context-free language) are detected.
  4. Code Generation, in which the intermediate language is translated a piece at a time to the target language.

In actuality, there may be multiple optimization stages scattered throughout this process. Additionally, most modern compilers repeatedly translate the language from an intermediate representation to a simpler intermediate representation in order to accommodate a wide swath of optimizations that operate on different levels of detail.

Lexical analysis

During lexical analysis, a set of regular expressions translate the input sequence (generally characters) into an output sequence (called tokens). One popular tool to simplify the creation of lexical analyzers is a software package called lex.

Readers accustomed to programming may benefit from a few examples of errors that can be detected during this phase. A lexical analyzer could detect errors in a single token, for instance a number that has the letter 'y' in it, or a string with a missing end quote.

Syntactic analysis

During syntactic analysis, an input sequence of tokens is matched against a set of grammatical constructs called productions. As each production is matched, a semantic action routine is called. The role of each semantic action is to build an intermediate representation of the input program, such as a list of variables and functions, and a sequence of instructions comprising each function.

Readers accustomed to programming may benefit from a few examples of errors that can be detected during this phase. A Syntactic analyzer could detect a syntactic error, such as a missing semicolon or curly brace. A syntactic analyzer cannot detect the use of an undeclared variable. This is because the declaration of a variable before its use is a context sensitive language requirement, though syntactic analyzers are generally context-free language recognizers.

Semantic analysis

During semantic analysis, a compiler builds and examines an intermediate representation of the source program and checks it for consistency.

Readers accustomed to programming may benefit from a few examples of errors that can be detected during this phase. A semantic analyzer could detect errors, such as undeclared variables or functions.

Code generation

Optimizations

Optimizations are optional strategies which a compiler may use when emitting output code. Optimizations may be used to improve code execution speed or memory usage, but only if the performance can be improved without sacrificing the correctness of the translation.