Can you explain how programming languages are designed?
This topic is important for architects because… architects should have a good theoretical grounding of their craft and programming language design is foundational. Programming language design knowledge quickly pays off in ways that are not easily foreseen.
Focusing on general purpose languages. In this post, we will define various types of programming languages. The focus, however, is on “general purpose” programming languages like C# or Java.
Lots of little concerns. Implementing a programming language requires considering a lot of details that are not obvious. We can’t cover all of these in details, but we will define the main concerns here.
Graduate-level course. Programming language theory is not typically taught in an undergraduate level Computer Science program, but is typically taught as part of a Master’s program. At the undergrad level, students are taught how to use programming languages, but not how to create them.
History of programming languages. The history of programming languages is important to the serious scholar, but our site is focused on architects, so we are going to skip that.
Co-evolution of programming languages. Programming languages evolve depending on the context in which they are used. Faster machines, multiple CPUs on one machine, running multiple concurrent programs, and the emergence of the web some are just a few of the factors that have influenced the evolution of programming languages.
Programming languages vs human languages? Programming languages and human languages are often compared, but that is weak comparison at best, so we won’t be addressing that again here.
Types of Programming Languages
There are 75 important languages because they each have a niche – something that they excel at. There are many different types of programming languages below. I have probably missed a few categories.
Overlapping categories. Most languages can be categorized in two or more of the categories below.
General-purpose. A general-purpose language can be used to create almost anything, to include other programming languages. C#, C++, Java, Visual Basic, Python, and Ruby are general-purpose languages.
Special-Purpose. Some languages are designed for special purposes like XML for system integration, SQL for database interaction, and R for statistics. You can’t create “anything”, but these languages excel at the important tasks they are designed for. Although some languages are specialized, they typically do qualify as languages because they share the common attributes of programming languages discussed below.
Ease of Use. Some languages like Visual Basic are designed for ease of use. Generally speaking, the easier a language is to use, the less capable it is as the tasks grow in complexity. That is, easy languages should generally only be considered for easy tasks.
Interpreted vs. Compiled. Some languages must be compiled to run on the native CPU of the computer. Other languages are interpreted. Compiled languages typically run faster than interpreted languages, but this should not considered a significant issue given the speed of modern CPUs. This was a major concern to decades ago, but not any more.
Functional vs. Imperative. Most of us program in “imperative” languages that allow us to hold the program’s current state in variables. Functional languages such as Erlang, Haskell, and F# are based strictly on functions only rather than allowing variables. Such languages typically rely heavily on recursion. Experts argue – correctly – than functional languages allow purer and more precise algorithms. Having said that, functional languages tend to be much more difficult to understand for an average programmer than imperative languages. Because they are difficult to use, functional languages are mainly found in academic settings. They are especially prized in academic settings because they force students to focus on core CS principles to get the work done.
Object-Oriented. Object-orientation can be said to be an implementation of design patterns, but having built-in language support for classes that encapsulate methods and attributes is very helpful. That is especially true if the language has built-in support for polymorphism.
Scientific Computing. Some languages, such as FORTRAN and MatLab, are designed mainly for scientific computing. Most programming work is business-oriented rather than scientific-oriented, so scientific languages are very beneficial to those that need them.
Intermediate. Some languages are not intended to be used by humans, but typically represent an “intermediate” state between initial compilation and the final compilation for the runtime environment. Key examples include Java bytecode and the .NET Common Language Runtime (CLR), otherwise known as Intermediate Language (IL).
KEY POINT: Intermediate languages are real programming languages, but are not intended for human consumption.
What Most Languages Need
Compiler. Interpreters are similar, so will skip them here. A compiler proceeds through a series of well-defined steps as shown in the figure below. Each step gathers information this is useful to the subsequent step. The first few steps exist to extract the meaning from the incoming program. The last few steps exist to construct a running program from the meaning obtained in the first few steps.
Lexical analysis. The first step above is “lexical analysis”. This simply breaks the text of the program into “tokens”. The “character stream” – the individual text characters – are grouped into tokens when whitespace or special characters are encountered.
For instance, in the “C” language, consider the following line of code.
int value = 100;
The lexical scanner will break that line of code into the following tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).
Parser (syntactic analysis). In this step, the token output from the lexical scanner is checked to ensure that it is syntactically correct. That is, it simply ensures that the tokens are assembled in an allowable and complete order. There is a “grammar” that is specific to the programming language in question that the token sequence is checked against. Programming language parsers are among the most sophisticated algorithms on earth. This makes sense when you consider the features and complexity of modern programming languages and the speed at which they have to run.
KEY POINT: Languages have to have grammars. This is a detailed topic that deserves another post.
Semantic analysis. Thus far, the compiler has generated tokens from our text and ensured that they were in the correct order. In this step, the compiler does its best to ensure that the meaning of the expressions is clear and correct – it checks that the “rules” of the language are being respected. Anyone who has seen “compiler errors” is familiar with semantic analysis. Typical examples would include:
- Type checking. Many languages will not permit you to assign a string to an integer. That mistake would be caught here.
- Required declarations. Many languages will require you to create a variable before you assign a value to it. If you try to assign a value to a variable that has not been declared, it will be caught at this stage.
Type system. A “type system” is the set of rules that pertain to all computer program constructs. It is a subset of the rules considered during the semantic analysis phase. When most developers think of a “type”, they are thinking about variable types and complex object types. What is not so obvious is that all computer program constructs can be said to have types. This includes expressions, functions, and modules. A programming language can broadly be considered as a collection of types.
Modules and Linking. Computer programs are typically constructed with more than one code file. Academic sample programs are often written with a single file, but most practical computer programs can have hundreds or thousands of different source code files. Each of these files is compiled individually. The resultant compilation units are referred to as “modules”. The process of combining the modules into a single program is referred to as “linking”.
Names binding to objects and scoping. When a computer program is running, it loads both executable code and data into memory. When a developer creates the code, he is assigning names to these sections of code and data. For instance, I might have data in the form of a string variable “firstName” and a section of code in the form of a method called “printName”. To actually use either firstName or printName(), both the variable and the method have to be resident in the computer’s memory.
Locations in the computer’s memory are mathematically just numerically sequenced locations. “Name binding” associates the names with the memory locations. “Scoping” limits the portion of the program in which a bound name will be understood. Consider that there might be two functions with a variable “firstName” that have different meanings and values. A scoping rule for functions – and this is typical – would allow names bound within the function to only be understood within the function.
A programming language needs to set aside memory for the various objects it needs to manage. This can get quite complicated – most developers have heard of “garbage collection”, for instance, which allows memory cleanup of objects that no longer have a purpose.
There are three key mechanisms for memory management in most programming languages.
Static allocation. When the compiler knows how much memory will be needed at compile time, the compiler simply sets that memory aside. This is typical for constants and global variables and is very efficient.
Stack allocation. Static allocation is easy to understand, but developers are often confused about the difference between “stack” and “heap”. Most developers are, however, familiar with the concept of variables being passed by reference or being passed by value. Passing by value is for stack objects and passing by reference is for heap objects.
KEY POINT: Stack objects are passed by value. Heap objects are passed by reference.
Programming languages don’t really “need” stack memory. In fact, Lisp and Scheme do not support stack memory. Stack allocation does, however, make languages simpler and more feature-rich in ways that are not obvious.
“Stack” allocation is so-called because the memory is allocated as in a stack data structure – that is, it needs to be structured in a last-in, first-out (LIFO) queue. Stack allocation is typical for locally declared variables.
It is not obvious, but stack allocation allows for recursive algorithms. Recursion is possible because the chain of function calls is also stored on the stack along with the data.
There are a several significant limitations of using stack memory. The copying limitation below is well understood. The other two limitations are are a little less intuitive, but are nonetheless important. These limitations include:
- Copying. When we pass a stack object to another method, a copy of the object is created. Any changes to the object will not be reflected in the stack object of the method that passed the variable. The stack object is “passed by value” rather than “passed by reference”.
- Return values. A stack object cannot be used as a return value for a method. This is simply because once the enclosing function has gone out of scope, the memory for the stack object will be deallocated – the stack object will no longer exist, so it can’t be used as a return value.
- Function assignments. All developers are accustomed to using variables to store data, but variables can be used to store functions as well. There are many advanced programming language features that depend on this ability that are beyond the scope of this post, but functions can only be assigned to heap memory variables.
Heap allocation. Heap memory allocation is the most flexible, but it is also the most expensive in terms of memory and CPU consumption. With heap memory, you can simply allocate memory any time you want to arbitrarily at runtime.
Declaring and using heap memory in most modern programming languages is typically as simple as using the keyword “new”.
var myObject = new Object();
Deallocating heap is a challenge. The challenge is that once you have allocated the memory, you eventually need to dispose of it as well. This is not a challenge for a small computer program, but many programs have to keep track of tens of thousands of heap objects. Most modern languages like Java and C# handle the “garbage collection” of unused objects automatically. That is a monumental computer science achievement.
“Memory Leaks”. Most developers have heard of the term “memory leak” and are justifiably concerned when they hear that phrase. Memory leaks are simply heap memory allocations that we lost track of. Memory leaks can shut down a computer program and even the whole computer pretty quickly.
KEY POINT: Memory leaks result from lost references to heap memory.
Other Common Features of Modern Languages
There are a number of advanced features that are typical of modern programming languages that we have not addressed with this post. These features include classes, subroutines, compiler code optimization, parallel processing, and polymorphism.