A Brief History Lesson

Machine Code

Executable programs are stored in memory as sequences of instructions and data in binary format. Below is a snippet from an executable program as it would look in the main memory of a computer. The line numbers are not part of the program but you can think of them as memory addresses, as each byte is addressable with its distinct address.

01001000
11000111
11000000
00000001
00000000
00000000
00000000
01001000
11000111
11000001
00000001
00000000
00000000
00000000
01001000
00000001
11000001

This type of zeros-and-ones program representation is called machine language, as it is the only language ultimately usable by the hardware. Any program written in a higher-level language (such as C or Java, but even Assembly) first needs to be translated to machine language before it can be executed.

Machine language is target-specific to the CPU architecture. So, for instance, a program compiled for an x86-64 processor is essentially unreadable for an ARM processor. You can think of this like an English speaker trying to read French: while the symbols are the same (zeros and ones), the French words have no valid meaning in English so the text will make no sense.

Binary notation is quite space-inefficient when you have more than two characters at your disposal (which we do here). Thereby, such data is commonly written in the hexadecimal format. Here is the same data as above, only this time in the hexadecimal representation:

48 c7 c0 01 00 00 00
48 c7 c1 01 00 00 00
48 01 c1

Assembly Language

In ancient times (which in computing terms is roughly between 1920 and 1950), programmers had to enter programs into the computer's memory by means of punched cards - small pieces of cardboard that encoded zeros and ones through the absence or presence of holes in specific locations. Obviously, writing computer programs in zeros and ones, or even hexadecimal numbers was a very cumbersome and error-prone task. Modern computer programs easily contain around 10 MiB of machine code, which would amount to 83,886,080 zeros and ones, or 20,971,520 hex digits. Imagine typing that by hand or, even worse, trying to find bugs in such programs.

For these reasons, people switched to assemblers in the 1950s. Assemblers are special computer programs that translate program instructions from a more humanly readable symbolic representation (an assembly language) to machine code. In an assembly language, each instruction has a short mnemonic or nickname associated with it and each number can be represented in decimal or hex, instead of in bits. Just as each architecture has its own machine code, each machine code has its own assembly language (and then possibly even different flavors of that).

Below, you can see the same program snippet as before, only now written in the associated assembly language (and with explanatory comments), with the lines corresponding to the same lines from the (hexadecimal) machine code snippet:

movq    $1, %rax    # move the value 1 into rax
movq    $1, %rcx    # move the value 1 into rcx
addq    %rax, %rcx  # add the contents of rax to rcx

3GLs

This desire prompted the development of so-called Third Generation Languages (3GLs). C, C++, Java, Pascal, Python, Haskell, ... are examples of 3GLs. While assembly languages require writing instructions for a specific hardware architecture, 3GLs abstract the underlying concepts into an even more human-readable and writeable syntax.

Now clearly, programs written in such languages cannot simply be assembled and executed. Another step is needed that translates the higher-level abstractions into specific machine instructions. The tools that do this are called compilers.

Unlike assemblers, compilers are amongst the most complex computer programs in existence. Compiler technology continues to evolve as it has done for over sixty years since Admiral Grace Hopper wrote the first compiler in assembly language. It is often difficult to predict exactly what instructions a compiler will generate when given a particular snippet of 3GL code, and it will almost certainly look different from any human-written Assembly code, due to various optimizations (sorry to disappoint all those that were counting on compilers to ""help"" with their assignments).


Why Assembly (Still) Matters

Now with plenty of 3GLs existing and compiler technologies ever advancing, why should you even learn how to program in Assembly? There are many answers to this question, let's look at some of them:

Who makes the compilers?

Contrary to popular lore, new compilers, virtual machines, and operating system kernels are not passed to us from the heavens. Instead, they have to be conceptualized, built, and refined by (future) engineers, like you. If you, the computer scientists of the future, do not understand the concepts of machine instructions and assembly languages then who will port our kernels to the latest 128-bit CPUs, or develop the next generation of embedded security systems, or the drivers for the newest RTX 42090 Max video card? In a few years, the industry will be looking at you to perform such feats, so better be prepared.

Your programs are weird, let's understand why.

There is another, perhaps even more important reason for you to study Assembly. In the words of Donald E. Knuth, one of the most respected minds in the field:

"Expressing basic methods like algorithms for sorting and searching in machine language makes it possible to carry out meaningful studies of the effects of cache and RAM size and other hardware characteristics (memory speed, pipelining, multiple issue, look aside buffers, the size of cache blocks, etc.) when comparing different schemes."

The point Knuth makes here is that you cannot ever expect to develop proper computer programs if you do not have a basic understanding of how computers work on the lowest level and of how programs are represented there. In fact, there is another priceless quote from Knuth that says it all:

"People who are more than casually interested in computers should have at least some idea of what the underlying hardware is like. Otherwise the programs they write will be pretty weird."

Even if you don't want to learn Assembly, attackers will!

Another interesting point to mention is the use of Assembly in computer exploits. Let's face it, the programs you write are not secure (as will become shockingly apparent to you in the Secure Programming course). However, many vulnerabilities may not be apparent without knowledge of the underlying low-level concepts. Even if you don't understand the concepts, attackers will, and they will use it (e.g., by exploiting stack buffer overflows to overwrite the return address and hijack the program flow).

Whether you want to get into hacking yourself (which is a quite engaging (and profitable) part of computer science), you want to write the next generation of protection mechanisms, or you may simply want to write secure programs - understanding assembly languages and the more fundamental underlying concepts will be essential for all of the above. Even if you want to go in a completely different direction, the lessons you will learn from this course will almost certainly increase your abilities in some way, especially in regard to approaching complex problems.

Last updated