Compiled languages like C, C++, and Rust are converted into machine code that the CPU can execute. There are multiple binary file formats: raw machine code, PE on Windows, ELF on Linux.
Different CPU archtectures have different instruction sets, registers, register sizes, etc.
This workshop will focus on x86 / ia32 / x64 / x86-64 / amd64 (not ia64).
The assembler converts assembly code into machine code. Flat Assembler (fasm) is an x86(64) assembler using intel-style syntax. Their programmer's manual is a great reference for learning assembly: https://flatassembler.net/docs.php?article=manual
mov eax, 100 infinite_loop: jmp infinite_loop
B8 # mov eax 64 00 00 00 # 0x00000064 == 100 EB # jmp FE # -2
mov ebx, mydata add ebx, 2 mov eax, [ebx] hlt mydata: db 0xAA, 0xBB, 0xEF, 0xBE db 0xAD, 0xDE, 0xEF, 0xBE db 0xAD, 0xDE, 0xCC, 0xDD
Defines the order in which bytes of a larger number (e.g. 4-byte integer) are stored into memory. For example, the number 0x11223344:
0x11223344
little endian: 44 33 22 11
big endian: 11 22 33 44
main: mov ebx, mydata .mylabel: mov al, [ebx] inc ebx cmp al, 0 jne .mylabel dec ebx sub ebx, mydata ; what is the value of ebx here? mydata: db 0x48, 0x65, 0x6c, 0x6c db 0x6f, 0x2c, 0x20, 0x57 db 0x6f, 0x72, 0x6c, 0x64 db 0x21, 0x00, 0x00, 0x00
char mydata[] = "Hello, World!"; main() { char *ebx = mydata; while (true) { char al = *ebx; ebx++; if (al == 0) { break; } } ebx--; ebx -= mydata; }
myfunction: ; myfunction assumes ebx and ecx are pointers to data mov edx, 1 .loop: mov ah, [ebx] mov al, [ecx] cmp al, ah jne .break test al, al je .end inc ebx inc ecx jmp .loop .break: xor edx, edx .end: ret ; what is the value of edx here?
bool myfunction(char *ebx, char *ecx) { while (true) { char ah = *ebx; char al = *ecx; if (ah != al) { return false; } if (ah == 0) { return true; } ebx++; ecx++; } }
The stack and heap are dynamically-sized sections of memory.
stack
heap
Function local data, like variables.
int main() { // integer (100) is on the stack int x = 100; // pointer is on the stack, pointing to the string // in a data section char *asdf = "hello, world"; // copies the actual string onto the stack char buf[13]; strcpy(buf, asdf); return 0; }
Control flow information
int add(int x, int y) { return x + y; } int main() { return add(100, 200); }
If add can be called from anywhere, how does it know where to return to? How do we pass 100 and 200 into add?
add
return
100
200
The stack pointer (rsp) and base pointer (rbp) are special registers that point to the ends of the stack.
rsp
rbp
While you can directly manipulated rsp and rbp, you can also use push and pop to use the stack.
push
pop
Most functions begin with a function prologue, a series of instructions that initialize a stack frame. Along with it's inverse the function epilogue to restore a previous frame.
push rbp ; needed to restore the prev frame mov rbp, rsp ; update rbp to a new frame sub rsp, 60 ; reserve space for local vars ... ; local vars are usually referenced relative to rbp ; for example [rbp-8] or [rbp-40] ... mov esp, ebp ; pop the whole stack frame pop ebp ; restore the previous frame
This could be simplified with the x86 instructions enter and leave. In practice, leave is widely used but enter is not.
enter
leave
some_function: push rbp mov rbp, rsp sub rsp, 32 mov rdi, 100 mov rsi, 20 call add_rdi_rsi mov [rbp-8], rax ... leave ret add_rdi_rsi: add rdi, rsi mov rax, rdi ret
The call func instruction is equivalent to push rip; jmp func, which stores the instruction pointer (i.e. pointer to the next instruction) onto the stack.
call func
push rip; jmp func
call add_rdi_rsi ; push rip / push <ptr to mov [rbp-8], rax> mov [rbp-8], rax
The ret instruction is equivalent to pop rip, taking the most recent stack entry and putting it into the instruction pointer. Essentially a jmp back to the stored return address.
ret
pop rip
jmp
On x86-32 Linux function arguments are stored in reverse order on the stack. add(100, 200, 300) would compile to:
add(100, 200, 300)
push 300 push 200 push 100 call add
On x86-64 Linux function arguments are first stored in registers rdi, rsi, rdx, rcx, r8, r9. Beyond 6 arguments they are stored on the stack like 32-bit.
rdi
rsi
rdx
rcx
r8
r9
mov rdi, 100 mov rsi, 200 mov rdx, 300 call add
The order and placement of args, which registers must be preserved (e.g. rbp), and how stack frames are cleaned up are together called a calling convention. These slides apply to most Linux systems, but calling conventions differ across platforms and even binaries.
https://en.wikipedia.org/wiki/X86_calling_conventions
Linux binaries are packaged into ELF files.
$ file /bin/ls /bin/ls: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=2f15ad836be3339dec0e2e6a3c637e08e48aacbd, for GNU/Linux 3.2.0, stripped
These files contain all the necessary data to execute machine code on as a Linux process.
You can examine all the ELF metadata with readelf -a /bin/ls.
readelf -a /bin/ls
Or the assembly code with objdump -D -M intel /bin/ls, but better disassemblers exist.
objdump -D -M intel /bin/ls
When code calls functions from another library, that library can be statically or dynamically linked.
On linux, these shared libraries are stored in shared objects, with the extension .so. These are also ELF files.
.so
The executable ELF defines which libraries and function it needs, and they are linked at runtime. The details of this linking will be important for exploits later, but not yet.