Dr. Jean-Claude Franchitti

Learning Objectives

By the end of this section, you will be able to:

Discuss how to build and run programs written in various HLLs
Describe the work of an HLL runtime management implementation
List and explain various HLL optimization methods applicable to programs

To implement programs that you create, you must use a process to generate machine code from source code. As previously discussed, the major methods of implementing programming languages are compilation, pure interpretation, and hybrid implementation. These are complex processes best learned in stages. There are differences between a compiler and an interpreter, as shown in Table 7.9.

Compiler	Interpreter
Scans and translates the entire source code at once	Scans and translates the source code one line at a time
Takes a relatively large amount of time to scan and translate	Takes a relatively small amount of time to scan and translate but has an overall longer execution time
Always generates an intermediary object code and will need further linking, thus will need more memory	Does not generate an intermediary code so highly efficient in terms of its memory
Generates any error message only after it scans the complete program	Translates and executes the program until an error is encountered, then it stops working

Table 7.9 Differences Between Compilation and Interpretation

Preprocessing

The single process that takes the source code the programmer creates and transforms it into another language is called compilation. In short, it is the rendering from one language into another and includes a deconstruction of the input, and its product may be machine language. Modern compilation usually starts with preprocessing, a step in which the source code is manipulated to ready it for the compilation stage. Some of the actions the preprocessor can take are as follows:

Removing any comments and white space
Executing a preprocessor directive, which describes resources the program needs to compile, usually files to insert into the source code. The C/C++ preprocessor provides a fine example.
1. The #include directive of C/C++ deals with a variety of files such as header or standard files containing the definitions of objects and functions from the API that are going to be used in the program. It can also take care of user-defined files or modules.
2. The #define directive of C/C++ names a piece of code for the preprocessor to replace every time it sees the name.
3. The #ifdef directive of C/C++ tells the compiler to include a specified piece of code in compilation depending upon some condition.
Other tasks include pre-identifying high-level constructs such as loops and functions.

Compilation

The phases of compilation are grouped into three stages that are usually executed in three passes through the preprocessed source code:

front end that is responsible for the analysis of source code
middle end that is responsible for performing optimizations of the code
back end (or code generator) that is responsible for the production of the target code

Front-End Compiler Structure

The front end is responsible for analyzing the source code for syntax, semantics, and other tasks. The first part of this process is lexical analysis, which converts the source code strings of characters into tokens. These are sequences of compiler recognizable atomic symbols which make it much easier for the next steps to recognize them.

The process proceeds to syntax analysis, usually referred to as parsing or reading the code to make sure it conforms with the syntax rules of the language. This is done to test conformability logically by breaking code into separate parts.

The next step is semantic analysis, which is the discovery of meaning in the code. It takes care of type checking in strongly typed languages. It also performs binding, which associates variable and function references with their declarations and definitions. The compiler runs a static semantic analysis which concerns itself only with the source code without testing inputs.

Semantic analysis cannot figure out all meaning in the code. Many languages also use dynamic binding (late binding) that defers the binding of certain elements such as objects and methods until runtime. This is particularly true of OOP languages which support inheritance and polymorphism. Some processes, such as when an array goes out of bounds, will not be known until run time, and are part of the program’s dynamic semantics.

Semantic analysis concludes with the creation and management of the symbol table, a data structure that is used to connect every symbol with necessary information such as data type, scope, and memory location.

Middle-End Compiler Structure

The main job of the middle-end stage of compilation is optimization, producing what will be needed to generate the fastest and most efficiently running code as possible. The attributes which are usually optimized are execution time and space (memory usage). Figure 7.24 outlines this process.

Process: Source code, Front end, Middle end, Back end, Machine code. Optimization process table. Columns: Middle End – empty; Task (Optimization, Transforms), Goal (improve code and analyze and rewrite code).

Figure 7.24 Code optimization is the phase between the front end and back end of the compilation process. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

Intermediate Form

The result created after semantic analysis when the program passes all checks is called intermediate form (IF). The nature of the IF is often chosen for machine independence and ease of optimization. IFs usually resemble the machine code for some imaginary idealized machine. An example is managed code which is targeted to run under a particular runtime environment. This is true of C# and other Microsoft.net languages. They are designed to run under its CLR, which manages the execution of Microsoft’s .NET languages. In this case, the purpose is to allow code from the different languages to work well together at execution time.

Technology in Everyday Life

HLLs and Big Data

There is a growing need to process massive amounts of “big” data to gather insights in real time about the best way a business can serve their customers. This is called data analytics, the systematic computational analysis of data or statistics. Netflix uses this technique as a means with which to plan its targeted advertising. Basically, Netflix monitors which types of movies its customers typically watch and collects related data that it uses to bring in new movies and advertise them to its customers. This has led to an increased use of concurrent/parallel processing frameworks that allow concurrent programming. The popularity has caused languages like Python to get a lot of attention for big data analysis. Do some research on Python and provide a well-documented opinion on the features of Python that support this.

Back-End Compiler Structure

The back-end process is known as code generation, the transformation of the code to the target or object language. It uses the intermediate form of the code in combination with the symbol table to carry this out. As previously stated, the final produced code is not necessarily machine language. In hybrid implementations, such as in Java and others, the code will still be a type of intermediate form and it is left for another process to turn it into machine code, usually at runtime.

Pure compilation code generation generally produces this phase (and all compile phases) while also relying on the symbol table, which is the structure responsible for following identifiers within a program and tracking the work of the compiler. Once compilation is complete, the debugger may retain the symbol table.

The back-end phase of the compiler may also perform machine-independent code generation. The purpose of this is to allow a piece of source code to generate instructions that will run on different platforms without change such as Windows and Android. Java is a great example of this, as referenced in Figure 7.25.

Illustration of Write once (Source code to Java compiler to Byte code) to Run anywhere (Java virtual machine (JVM) separating out to: Windows JVM, Linux JVM, Mac JVM, Android JVM).

Figure 7.25 The Java compiler generates byte code from source code. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

This can then be run in any environment for which a JVM has been built.

Compilation of Interpreted Languages

In some languages, compilers are present, but they aren’t pure and only perform a selective compilation of pieces, or pre-processing, of the remaining source. Occasionally the compiler generates code that builds expectations around decisions that won’t be completed until runtime. If these assumptions are sound, the code runs very fast. If not, a dynamic check reverts to the interpreter.

Dynamic and Just-in-Time Compilation

Sometimes a program delays compilation until the last minute. In Java, a bytecode is a set of instructions for a virtual machine. The process of just-in-time (JIT) compilation is when intermediate code, which is the language translation phase that produces a code level between the source code and the object code, gets its final compilation (or usually interpretation) phase right at the start of runtime. Bytecode is the standard format for distribution of Java programs to any runtime platform such as Windows, macOS, and Linux. The bytecode is interpreted by a Java virtual machine (JVM) that has been implemented by the producer specifically for their platform. Since there is a JVM for macOS but not for Mac IOS, Java is incompatible with iPhones.

Assembly

Most compilers generate assembly language that must subsequently be processed by an assembler to create an object file. The process of converting a low-level assembly language of the compiler into machine language that the computer can execute in binary form is called assembly. Assembly language is a very low-level language that has a strong correspondence with the machine language but is still humanly readable, as visible in Figure 4.7. Post-compilation assembly has some distinct advantages, such as expediting debugging so that the language is easier to read while making sharing possible among compilers. It also isolates the compiler from changes needed in the format of machine language files; for example, when computer chips change, only the assembler must be changed.

A computer’s hardware does not implement the assembly-level instruction set; it is run by the interpreter. The interpreter is written in low-level instructions (microcode or firmware). These items are housed in read-only memory and executed by the hardware.

Linking

Language implementations that are intended for the construction of large programs support separate compilation where pieces of the program can be compiled and built independently. Once compilation is complete, these pieces called compilation units are rejoined by a linker.

An important function of the linker is to join modules of precompiled libraries of subroutines, such as those in the language API or precompiled user defined functions, into the final program so that they can be separately compiled. If the program draws from a library, it will still need to be linked. Additionally, a static linker produces an executable object file by completing its tasks prior to the program running, while a dynamic linker allows the program to be carried over to memory for execution after it runs.

Runtime Program Management

The runtime system refers to the set of libraries that the language implementation depends upon for correct operation. The compiler generates program-speciﬁc metadata, data about data, that the runtime must inspect to do its job. It is recommended that the compiler and runtime system be developed together to avoid some of the following issues:

Garbage collection: the ability to dynamically destroy program objects when no longer needed and recover their memory
Variable numbers of arguments: for languages that allow differing numbers and types for function arguments only encountered at runtime; languages support function overriding, such as Java, C++, and C#
Exception handling: recovery implemented when a program encounters a runtime error or exception
Event handling: the ability to respond to runtime occurrences such as a button click or other unpredictable event
Coroutine and thread implementation: languages where implement can handle concurrency and parallelism

Virtual Machines

A virtual machine (VM) provides a complete program execution environment that is the reproduction of a computer architecture. Many modern languages employ virtual machines as their runtime environment. In general computer science, VMs can refer to a system VM as supplied by such vendors as VMware, which provide all the hardware facilities needed to run a standard OS so that programs built for one OS can run on another.

However, with HLLs we make another distinct type of VM called a process VM which provides the environment needed by a single user-level process. Perhaps the first and best example is the Java virtual machine (JVM) because its purpose was to elegantly and efficiently meet a main design goal of the language, which was hardware and OS independence.

The JVM is a complete runtime manager and interpreter. As we have learned, the Java compiler generates as output an intermediate form of code known as bytecode which is placed in object files with a .class extension. This bytecode is identical for every platform and OS on which Java may be run. The JVM is designed to take the last step of interpreting the bytecode into the machine language and the OS instruction set of the target architecture. It is also the runtime manager for the process. A more recent invention is the Java JIT compiler, which may call for more speed than a JVM interpreter can produce. Figure 7.26 illustrates these principles.

JVM interpreter illustration from Source code (.java file) to Java compiler to Byte code (.class file) to Java virtual machine (JVM), dividing into: Interpreter for Windows, Interpreter for Linux, Interpreter for Mac.

Figure 7.26 The translation of a code snippet of assembly language into binary machine language. (attribution: Copyright Rice University, OpenStax, under CC BY 4.0 license)

A JVM manages the runtime as well as provides storage management functionality that embraces features, including class structures (metadata), heaps, register sets, thread stacks, and code storage.

The Microsoft CLR is a similar animal but directed for the Microsoft .NET platform and all its component languages. CLR and JVM share features including multithreading and stack-based VMs, and both have garbage collection. Additionally, both use platform-independence, are self-descriptive, and contain bytecode notation.

Industry Spotlight

Using Proven HLLs

While runtime program management using virtual machines provides large flexibility gains, many industries that rely on large enterprise-level software to support demanding mission-critical business applications typically rely on more traditional runtime environments and older, proven HLLs. This is particularly true in the financial industry today. A good example is the use of Python as a core language for J.P. Morgan’s Athena program and Bank of America’s Quartz program. Why do you think this is the case?

Symbolic Debugging

While debuggers are present in most virtual machines and integrated program development environments (IDEs) and provided by programming language interpreters, they also come as standalone tools. Symbolic debuggers understand high-level syntax. In short, a debugger finds errors in programs. It can also set parameters for stopping an execution should it reach a certain point in the source code or if a constant is read.

Code Optimization and Improvement

The goal of code optimization is to rework parts of your code for efficiency. Some of the problem areas of optimization are redundant computations, as well as inefficient use of the hardware registers, multiple functional units (input, output), memory, and cache.

We can interpret the results of optimization to mean fast and/or decreasing memory requirements. Optimization is best carried out at multiple levels, one of which is at the basic block level. A basic block is a minimal length sequence of instructions that will always execute in its entirety if it executes at all, equivalent to a code block. Code improvement at the level of basic blocks is known as local optimization, consisting of eliminating redundant operations (unnecessary loads, common subexpression calculations), providing effective instruction scheduling, and providing register allocation.

At higher levels, compilers review and analyze all subroutines for further improvements. The code improvements at multiple layers that include loop performance improvement, register allocation, and instruction scheduling is called global optimization.

Think It Through

Understanding Optimization

The medical industry views code optimization as a must-have. Most medical operations make use of patient portals for communication between patients and the medical practice. This requires both speed and platform portability such as Windows and Android. Because of this, Java is the language of choice for these applications. Why is it important for programmers to understand code optimization techniques? Should they only rely on the optimizing compiler provided for the HLL they use? Why or why not? Will you become a better programmer by understanding the way that lower-level code works in order to have a high-level view of optimization?

As with code optimization, code improvement also aims to increase execution speed. Code improvement focuses on eliminating basic blocks redundancy, loop improvements, and instruction scheduling. The biggest area of improvement focuses on the review and enhancement of the behavior of loops. Reordering loops can be difficult but rewarding as all data dependencies must be respected (loop-carried dependencies).

The conclusion that one can form based on code optimization and code improvement is that the “need for speed” is paramount in today’s global software and internet environments.

Concepts In Practice

Optimizing Your Code

Did you know that compilers can often improve the performance of your code? This process is called optimization. Optimizers can make your code run faster by using techniques such as removing unnecessary calculations, improving memory usage, and reorganizing code for better efficiency.

By understanding optimization techniques, you can write smoother code and use fewer resources. As a side note, mainstream programming language compilers provided by Microsoft and Oracle are designed to optimize the execution of programs. Microsoft’s .Net framework uses virtual machines that manage the CLR of various programming languages. The intermediate byte code generated by compilers that target these virtual machines is optimized using some of the techniques described in this chapter.

7.4 Programming Language Implementation