Learning Objectives
By the end of this section, you will be able to:
- Write C code using fundamental elements of the language
- Summarize the steps to develop a C program
- Understand the process to compile and run a C program
- Describe how linking is used in a C program
- Understand how to apply version control management
As discussed, a programming language is a kind of computational model that is used to write programs. C is a popular middle-level language that is widely used to create systems software. This section is a crash course in the essentials of C.
Introduction to C
The C programming language was invented in 1972 by Dennis Ritchie of Bell Labs (Figure 4.15) and popularized by the book The C Programming Language by Brian Kernighan and Dennis Ritchie. C’s peculiar name—a single letter—was a pragmatic choice, since C replaced an earlier language named B. C is a procedural, middle-level language that gives programs low-level access to memory. It is a relatively simple language, which makes learning C, and creating a C compiler, easier than for more complex languages. This combination of features made C an instant hit, and it has maintained great popularity and import to this day. C has influenced other programming languages, too. C++ is a newer language that adds the object-oriented paradigm to C.
Why is C so popular? Mainly because its designers managed to strike a balance between low-level and middle-level features that allows C code to execute at practically the same speed as assembly language, while allowing programmers to be productive enough to create large, dependable, programs. C is the programming language behind much of the lower-level software that we depend on, including operating systems, language compilers, assemblers, text editors, print servers, network drivers, language interpreters, and command-line utilities. Here are some specific software products that are written in C:
- The Java virtual machine (ANSI C)
- Linux, an open-source operating system (C, and some assembly)
- Python (C)
- macOS X kernel (C)
- Windows (C, C++)
- The Oracle database (C, C++)
- Cisco routers (C)
Industry Spotlight
Applications of C
C is used in a variety of industries. One example is astrophysics, where scientists write programs that simulate the motion of stellar bodies, and control instruments such as telescopes. Owing to the large size of the universe, these simulations involve performing calculations on very large arrays of numbers. C’s ability to execute fast, and control the layout of large arrays in memory, is advantageous for this application. As a relatively simple language, C is approachable to physicists who are not necessarily expert in computer science. Scientific experiments need to be reproducible, which means that code involved in science needs to work even decades in the future. The fact that C has been a stable, popular language for so long means that it is very likely to endure, which cannot be said of newer niche languages.
One notable feature of C is the way it handle memory. In a program, we have variables and values. For example, in x = 10, x is a variable and 10 is the value. Every value in a program is stored in memory. Memory regions are divided into four blocks: stack, heap, static, and code blocks. These regions store various parts of a running program. Running programs create and destroy values extremely rapidly (perhaps millions or billions per second), and memory is finite, so memory locations must be reused, or else would run out quickly. When a value is created, memory is set aside as allocated memory to hold that value. Eventually, when the value is no longer needed, that memory becomes freed memory, meaning it is given back so that it can be reused. The process of allocating and freeing memory is called memory management. A memory leak happens when some memory is allocated but never freed. A memory leak is a bug that causes a program to waste resources; severe leaks can waste all the memory on the computer, causing it to become unresponsive or crash. As a middle-level programming language, C requires programmers to handle memory management manually. This type of flexibility must be used with caution as it may result in creating programs that are not reliable and secure. In high-level languages, memory management is automated.
Here are some other notable features of C:
- Efficient execution: C is lower in expressive power than some other middle-level languages like C++ and yet simple enough that compilers can generate machine code that is comparable in speed to hand-written assembly code. A lot of research and development have focused on creating performance-oriented C compilers.
- Portability: C can run in multiple computing environments, also known as having the property of portability. Unix was designed to work on various hardware architectures, so the C language is not hardware-dependent. The same C code can be compiled and executed on different hardware architectures and operating systems.
- Modularity: Modular programming refers to the process of dividing computer programs into separate sub-programs. A module is a separate software component, such as an error handler, that may be used by a variety of applications and functions within a system. C has language support for modularity.
- Procedural and structured programming support: C adheres to the procedural and structured paradigms.
- Data types and operators: Every variable in a C program has a data type. Data types dictate how much memory is used to store the variable, and which kinds of operators can be used with the variable.
- Recursion support: Recursion is the phenomenon of a system being defined in terms of itself. In code, this means a function may call itself again and again. C supports recursion. However, it does not provide a feature called “tail-calling” that makes recursion efficient, so recursion is not used in C as much as in languages that provide tail-calling. A tail call is a function call performed as the final action of a function. If the target function of a tail is the same function, the function is said to be tail recursive, which is a special case of recursion. Tail recursion (also called “tail-end recursion”) is useful and helps with code optimizations.
- Pointers: A pointer is a variable that holds the memory address of another variable and points to that variable (Figure 4.16). Pointers play a crucial role in the C language. They are used to store and manage addresses of dynamically allocated blocks in memory in the underlying computer system. Managing hardware devices involves manipulating certain memory locations, and C’s support for pointers is one of the reasons that it is used to implement kernels and device drivers.
Technology in Everyday Life
C’s Application in the Early Stages of YouTube
C has a variety of integer types, such as short
, int
, and long
. A C programmer needs to decide on the most appropriate type for each piece of information in their program. Smaller types use less memory but can only store a narrower range of values. On a typical computer, the maximum short
value is about 32 thousand and the maximum int
value is about 2 billion. A good practice is to think critically about how large a particular value might become and pick the smallest data type that accommodates that range.
The YouTube programmers faced this issue when they implemented the view counter for YouTube videos. They had to think through: what is the maximum number of views that a video is likely to garner? Two billion seemed like a safe choice, so they chose int
.
This decision turned out to be misguided. In 2014, the viral hit music video “Gangnam Style” by the Korean artist Psy accumulated more than two billion views, and the view counter broke. The int
variable storing the number of views of “Gangnam Style” overflowed and wrapped around to a negative number. This proved to be an embarrassment for YouTube, who had to quickly change their code to use long
instead.
What is the most appropriate integer data type (short
, int
, or long
) for the following quantities?
- The number of people on an airplane
- The number of people on Earth
- The number of people in a household
- The number of dollars in a bank account
High-level languages usually check array indices at runtime, which makes out-of-range bugs easy to identify and fix, but slows down array subscripts slightly. As a middle-level language, C does not check array indices. An array is a storage space where the elements are stored in contiguous memory cells. They are indexed from 0 (the first cell) to n−1 (last cell).
In C, an invalid array subscript will access memory outside of the array variable. If the subscript is only out of range by a little bit, this will access nearby variables, which is a subtle bug that may go unnoticed. A segmentation fault (“segfault” for short) occurs if the subscript is very far out of range. When this occurs, it will access a memory address that is off-limits to the program, and your operating system will forcibly shut down the program in response. This kind of runtime error can be notoriously difficult to remedy. Out-of-bounds array subscripts are a common source of segmentation fault errors.
Every value in a program is stored at a specific memory address. A pointer is a value that contains a memory address. Technically, a pointer should contain the location of a valid data value. However, many memory locations do not contain valid data values, so it is possible to have an invalid pointer that does not hold a valid location. The pointee is the value that a pointer points at. A pointer is analogous to a street address such as “123 Main Street,” because it refers to a specific location. In that analogy, each building is a pointee. Usually, an address is valid and refers to a place you can visit. However, it is possible to have an invalid address that is not a place that can be visited; for example, if the building at that location was demolished.
One of the differences between middle-level and high-level languages is that high-level languages either prohibit invalid pointers entirely, or provide mechanisms to handle them safely. As a middle-level language, C gives programmers the freedom to create null/invalid pointers, which can be helpful when writing code that interfaces with hardware devices. Since all hardware devices do not support the same functionality, the support of individual features by a given device may be indicated as a null/uninitialized pointer, which is fine as long as the program checks for un-initialized pointers to determine if a given functionality is available. However, in general, the freedom of using null/invalid pointers comes with a responsibility to ensure that pointers are always used properly. This has proven to be difficult; invalid pointers are a common source of bugs in C programs.
In C, the programmer is responsible for making sure that character arrays are actually big enough to fit strings, and that strings include the null terminator character. A character array is a string of characters sometimes terminated using a null. An example might be something like: char *arr= "string\0"
. Overlooking either of these results in bugs. This is a prime example of how middle-level languages such as C expect programmers to deal with more details than do high-level languages.
Link to Learning
The C standard library has dozens of header files and hundreds of functions. It is impractical to memorize all this information. Programmers do not memorize the prototypes (i.e., name and parameters) of library functions. Instead, they refer to reference documents, and develop the skill of finding information in these documents quickly. These C library reference documents are available in many places.
Developing C Programs
A programmer spends significant time working in their development environment; indeed, a professional developer might spend most of their workday using it. It pays to invest some up-front time and attention toward learning your environment and customizing it to your needs so that your ongoing experience will be frictionless and ergonomic. Chefs, mechanics, and other tradespeople focus much attention on cultivating safe and productive workspaces, and in the same way, experienced programmers attend to their development environment.
Programmers working with compiled languages, including C, generally work using the cycle shown in Figure 4.17.
Specifically, these steps are:
- Algorithm Development: The developer designs a high-level understanding of what the code will do and how it will do it. They may document the algorithm with pseudocode, a block diagram, or a sketch. In the case of extremely simple programs, the algorithm may be trivial enough that the programmer can keep it in their head. In some cases, a programmer is implementing an algorithm that someone else created and described in a reference work or research paper.
- Program Development: The programmer writes code that implements the steps of the algorithm.
- Program Translation: The programmer runs the compiler on the code. Often, the code has syntax errors, and the compiler provides error messages describing the errors. A syntax error is a violation of the rules for constructing valid statements in the language. For example, the user may have introduced a typo of some sort, like a missing semicolon, or using a keyword as a variable name. In this case, the programmer goes back to Step 2 (Program Development) to resolve the errors one by one.
- Program Execution: At this step, the code has no syntax errors, so it successfully compiled into a runnable program. The developer runs the program, and tests that it operates properly. An initial draft of code often has a semantic error, which is when code compiles and runs, but does not behave as it should. When a programmer finds a semantic error, they go back to Step 2 to debug the code and fix the semantic error. Eventually, after thorough testing, which requires a specific approach not described here, no more semantic errors can be found, and the code is considered finished.
Some C compilers include:
- GCC, an open-source C compiler developed by the GNU Project
- Clang, an open-source C compiler developed by the LLVM project
- Visual C++, a C and C++ compiler developed by Microsoft
Depending on which operating system you are using, there will be many viable alternative C development environments. An operating system is a complex software program that helps the user control the hardware and help with several other applications. Examples include Windows 10 and 11, and Linux versions such as Ubuntu, Fedora, CentOS. You may choose to use an integrated development environment (IDE), which is a program with a graphical user interface that includes a text editor, compiler, and other tools, all in one application. For example, you can install and use Eclipse for C/C++, an open-source multi-language IDE originally created for Java programming. Eclipse is portable as it is built in Java and can be installed on any operating system.
Compiling and Running C Programs
The compilation process involves several steps:
- compiler: high-level language converts to assembly
- assembler: assembly converts to machine code
- linker: a program that performs linking, a process of collecting and combining various pieces of object code into a single program file that can be loaded into memory and executed
In practice, compilers such as GCC bundle all these steps into one command. Usually, when you run the GCC command, GCC compiles, assembles, and links a program.
To write, compile, and run a simple C program:
- Write text of program (i.e., source code) using a text editor, and save it as a text file (e.g., “my_program.c”)
- Run the compiler, assembler, and linker to convert your program from source to an “executable” or “binary.” Compilation is necessary for every program to run and perform the desired operation.
gcc –Wall –g –o my_program my_program.c
GCC compiler options:
-Wall
tells the compiler to generate all “warnings.” These warnings will often identify mistakes.-g
tells the compiler to generate debugging information.- If you don’t supply a
–o
option to set an output filename, it will create an executable calleda.out
. - A
.c
file is called a “module.” Often programs are composed of multiple.c
files and libraries that are linked together during the compilation process.
- If the compiler gives errors and warnings, edit the source file, fix it, and recompile. It is a good practice to work on just one error/warning at a time, namely the first one. This is because a syntax error can cause false-alarm errors later in the source code, so warnings/errors after the first one could be false alarms. We recommend that, when you get compile errors or warnings, you edit to fix just the first one, and recompile; do not try to fix warnings/errors after the first one.
Consider the following “Hello World” C program1:
#include <stdio.h> /* include printf prototype */
/* The simplest C Program */
int main(int argc, char **argv) /* main program entry point */
{
printf("Hello World\n");
return 0; /* return without error */
}
To run a program in the current directory (on Linux) use ./program .
("." means the current directory). In the world of operating systems, everything is defined in terms of directories and files. Even the desktop is a directory, which is a collection of files. A directory can sometimes be empty too, and some directories have hidden files for security reasons. A subdirectory is a directory within a directory.
> ./my_program
Hello World
>
Linking Programs
Figure 4.18 illustrates the processing steps of C programs from source code to execution.
Linking refers to the process of collecting and combining various pieces of object code into a single program file that can be loaded into memory and executed. A linker is a program that performs linking. Understanding linkers will help you build large programs, avoid dangerous programming errors, understand how language scoping rules are implemented, understand other important systems concepts (such as virtual memory and paging), and use shared libraries (a file that is to be shared by an executable file). Virtual memory is an operating system concept where the secondary memory acts as main memory to compensate for memory shortage. Paging is a technique where the secondary memory is used to store and retrieve the data into the main memory. The memory is divided into small regions called pages which enables for the quick access of the data. If a page is found, it is called a “Page hit;” otherwise, it is a “Page miss.”
Programs are translated and linked using a compiler driver, a program that invokes other components that helps in translating the high-level program to a machine code, as in Figure 4.19 and using the following code:
linux> gcc -Og -o prog main.c sum.c
linux> ./prog
Linkers are used to ensure:
- Modularity: Program can be written as a collection of smaller source files, rather than one monolithic mass. Using a linker facilitates building libraries of common functions (e.g., Math library, standard C library). A library is a file that contains object code (a code from the object file that is generated after compilation) for functions and global variables (variables that have global scope and can be used anywhere in the program) that are intended to be reused.
- Efficiency: It saves time to run separate compilations and change one source file, compile, and then relink since there is no need to recompile other source files. Also, libraries save memory space because common functions can be aggregated into a single file and yet executable files (the end product after compiling and linking) and running memory images (current memory) contain only code for the functions they actually use.
Linking Steps
Programs define and reference symbol. A symbol is an identifier for a function or a global variable. The first linking step performs symbol resolution. During the symbol resolution step, the linker associates each symbol reference with exactly one symbol definition (Figure 4.20).
Symbol definitions are stored in an object file (by the assembler) called a symbol table. A symbol table is an array of structures in which each entry includes name, size, and location of symbol.
The second linking step performs code relocation (Figure 4.21). This step merges separate code and data sections into single sections (one for code and one for data). It relocates symbols from their relative locations in the .o
files (the object files) to their final absolute memory locations in the executable. It updates all references to these symbols to reflect their new positions.
Executable and Linkable Module Format
There are three kinds of object files (modules) that relate to the linking process (Figure 4.22):
- Relocatable Object File (
.o
file): Contains code and data in a form that can be combined with other relocatable object files to form executable object file. Each.o
file is produced from exactly one source (.c
) file. - Executable Object File (
a.out
file): Contains code and data in a form that can be copied directly into memory and then executed. - Shared Object File (
.so
file): Special type of relocatable object file that can be loaded into memory and linked dynamically, at either load time or runtime. These object files are called Dynamic Link Libraries (DLLs) on Windows.
All three object files follow the executable and linkable format (ELF) which is a standard binary format for object files originally proposed by AT&T System V Unix, and later adopted by BSD Unix variants and Linux. Unix is an operating system that has been used widely, primarily in servers and software development since the 1970s, and Linux is Unix-compatible.
Symbol Types and Resolution
A linker classifies symbols in three categories as illustrated in Figure 4.23 and Figure 4.24. A symbol can be the name of a variable or a string. In other cases, it can be the function names or procedure, such as
- Global symbols: Symbols defined by module m that can be referenced by other modules (e.g., non-static C functions and non-static global variables)
- External symbols: Global symbols that are referenced by module m but defined by some other module.
- Local symbols: Symbols that are defined and referenced exclusively by module m (e.g., C functions and global variables defined with the static attribute); local linker symbols are not local program variables (linker does not deal with the local variables of a function). Also note that local non-static C variables are stored on the stack while local static C variables are stored in either .bss, or .data.
Program symbols are either strong (e.g., procedures and initialized globals) or weak (e.g., uninitialized globals). A strong symbol has a unique memory location. Let’s take the example of: int array[2] = {1 ,2};
. This creates an ambiguity during the linking process when there is another file that tries to access the same symbol again due to strong definition. On the other hand, a weak symbol allows multiple definitions of the same symbol without creating an ambiguity. This helps during the linking process when another file creates a strong definition of the same name. In languages like C and C++, the weak symbol is defined using the _attribute_((weak))
keyword. Global variables should be avoided (i.e., use static whenever you can, initialize the global variable, or use extern if you reference an external global variable).
The linker applies the following rules:
- Rule 1: Multiple strong symbols are not allowed. Each item can be defined only once, otherwise the linker issues an error.
- Rule 2: Given a strong symbol and multiple weak symbols, choose the strong symbol (references to the weak symbol resolve to the strong symbol).
- Rule 3: If there are multiple weak symbols, pick an arbitrary one (can override this with
gcc –fno-common
). “-fno-common” helps in catching accidental common name collisions.
Static Libraries
Functions commonly used by programmers (e.g., math, I/O, memory management, string manipulation) can be packaged into a file called a library. A static library (or .a
, an archive file) is a simple kind of library that that copies the contents of object files into a single file called an archive. The linker tries to resolve unresolved external references by looking for the symbols in one or more archives. An external reference is a symbol that is used in a module, but not defined in that module, so it is expected to be defined in some other module. If an archive member file resolves a reference, the linker links it into the executable. The archiver allows incremental updates; it also recompiles functions that changed and replaces the corresponding .o
file in the archive (Figure 4.25).
Commonly used libraries include libc.a
(the C standard library), which handles: I/O, memory allocation, signal handling, string handling, data and time, random numbers, and integer math. Another common library is libm.a
(the C math library) that handles floating point math (e.g., sin
, cos
, tan
, log
, exp
, sqrt
).
Figure 4.26 illustrates how to link programs with static libraries.
The linker uses the following algorithm to resolve external references:
- Scan
.o
files and.a
files in the command line order. - During the scan, keep a list of the current unresolved references.
- As each new
.o
or.a
file, obj, is encountered, try to resolve each unresolved reference in the list against the symbols defined in obj. - If any entries in the unresolved list at end of scan, then issue error.
Therefore, the command line order matters and libraries should be placed at the end of the command line to avoid linker errors as illustrated in Figure 4.27.
Concepts In Practice
APIs in C
The C language makes it possible to create a modular API (Application Programming Interface) as a library with publicly visible function prototypes but secret function definitions. This is accomplished by distributing the .h
files with function declarations freely, while keeping the .c
files secret and instead distributing only .a
or .so
compiled object code. A .h
file is used in C, C++, where the libraries can be used in the current program instead of writing the code completely.
An example is math.h
. This strategy is used in many industries, such as video games. DirectX is an API created by Microsoft for the platforms that are used on Windows PCs and Xbox. Microsoft provides a C library with many function calls for game-related operations such as drawing graphics, playing sounds, and reading inputs from the keyboard, mouse, or joystick. A game programmer writes their game as a C program that calls those functions. This arrangement is a good compromise—the convenience of the DirectX API makes game programmers’ work easier, and entices them to create games for Windows and Xbox. But keeping the .c
files proprietary means that Microsoft does not have to give away the hard work that went into creating DirectX, Windows, or Xbox.
The same arrangement works on other platforms, too. OpenGL is a cross-platform API that works on almost every modern platform, and Sony PlayStation has a similar API. Both of these are distributed as C libraries with public .h files and proprietary implementations.
Loading Executable Object Files
An object file is a file that is a combination of metadata from the source or object code along with a combination of bytecode
Dynamic Load-Time Linking
Static libraries have the following disadvantages: duplication in the stored executables (every function needs libc
), duplication in the running executables, and minor bug fixes of system libraries require each application to explicitly relink. A modern solution to this problem is to use shared libraries (also called dynamic link libraries, DLLs, or .so
files). A shared library is a library file that can be shared by multiple programs at the same time (Figure 4.28).
When using shared libraries, object files that contain code and data may be loaded and linked into an application dynamically at load time, as illustrated in Figure 4.29. This load time linking occurs when dynamic linking happens at the same time that a program executable is first run. This is a common case for Linux, which is handled automatically by the dynamic linker (ld-linux.so
). The standard C library (libc.so
) is usually dynamically linked. The ldd tool may be used to identify dependencies/libraries needed at load time. In static linking the routines code becomes a part of the executable. In dynamic linking, the routines can be updated during the code execution. To dynamically link a library at load time on Linux, place it in the /lib/x86_64-linux-gnu/
directory and compile the source files with the -l
flag (e.g., gcc main.c -lcso
).
Dynamic Runtime Linking
An alternative to load-time linking is runtime linking, which means that linking occurs after a program has already started running. As illustrated in the sample code, the program source code needs to explicitly call functions to link additional libraries. In Linux, this is done by calls to the dlopen()
interface and compiling the source with the -l
flag (e.g., gcc main.c -ldl
). This is a better approach to help distribute software, support high-performance Web servers, or perform runtime library interpositioning.
#include <stdio.h>
#include <stdlib.h>
#include <dlfcn.h>
int x[2] = {1, 2};
int y[2] = {3, 4};
int z[2];
int main()
{
void *handle;
void (*addvec)(int *, int *, int *, int);
char *error;
/* Dynamically load the shared library that contains addvec() */
handle = dlopen("./libvector.so", RTLD_LAZY);
if (!handle) {
fprintf(stderr, "%s\n", dlerror());
exit(1);
}
...
/* Get a pointer to the addvec() function we just loaded */
addvec = dlsym(handle, "addvec");
if ((error = dlerror()) != NULL) {
fprintf(stderr, "%s\n", error);
exit(1);
}
/* Now we can call addvec() just like any other function */
addvec(x, y, z, 2);
printf("z = [%d %d]\n", z[0], z[1]);
/* Unload the shared library */
if (dlclose(handle) < 0) {
fprintf(stderr, "%s\n", dlerror());
exit(1);
}
return 0;
}
Tools to Manipulate Object Files
An object file contains a lot of information such as metadata, machine code, and other information from symbols. To manipulate such files, Unix provides certain tools to use them effectively, such as:
- ar: Creates static libraries, and inserts, deletes, lists and extracts members.
- strings: Lists all the printable strings contained in an object file.
- strip: Deletes symbol information from an object file.
- nm: Lists the symbols defined in the symbol table od an object file.
- size: Lists the names and sizes of the sections in an object file.
- readelf: Displays the complete structure of an object file, including all of the information encoded in the ELF header; subsumes the functionality of “size” and “nm.”
- objdump: Displays all of the information in an object file; useful for disassembling binary instructions in the .text section.
- ldd (linux): Lists the shared libraries that an executable needs at runtime.
Version Control Management
The process and tools used to store and improve multiple versions of project files is called version control. Version control also helps support team collaboration, and allows for the ability to revert to an earlier versions. Git is a widely-used version control system. Creating and updating project files using Git requires the creation of a Git repository, also known as “repo” for short. A repository is a container for files and related information stored in a version control tool. GitHub is a website that allows free storage of public git repositories.
Link to Learning
Learn more by installing Git on your local machine on any platform. You may run “brew install git
” on MacOS to install Git or “sudo apt install git
” on Linux.
Useful Git commands are as follows:
git config --global user.email "you@example.com"
andgit config --global user.name "Your Name"
- Clone: to download contents
- Pull:
git pull origin master
to pull latest changes - Status:
git status
to see staged (shown in green) and un-staged (shown in red) files - Staging:
git add <filename’
to add files to staged area (wildcards accepted) - Commit:
git commit –m "<your message here>"
to commit the staged files - Push:
git push origin master
to push all changes made locally to the origin
Link to Learning
Explore the Git/GitHub tutorial for more details on how to use Git.
Footnotes
- 1**argv means that the program is accepting a multidimensional array of input arguments. It is a pointer to the pointer of array of arguments.