Background

A few weeks ago, I saw Rui Ueyama, author of the mold linker, had published a repo for a university operating systems project called Pintos, along with a course description.

These repos describe a project lineage through Stanford, Johns Hopkins, and other universities. Curious about this history, I looked at similar github repos until I encountered a clone with a more detailed commit history, which included changes by the original course creator Ben Pfaff and another well-known professor at Stanford, John Ousterhout.

This latter clone refers to an upstream that no longer seems available on the web, though the Internet Archive does retain a stub that suggests this clone matches its original upstream as of the last Wayback Machine snapshot in October 2022.

Although I never took the operating systems class at my alma mater, the repos above and associated Stanford course materials bore a resemblance to what I remember hearing of that class in the early aughts. Also, it seemed the Pintos source code could build and run on a personal macOS or Linux system, with no dependencies on university-internal servers or infrastructure. This apparently included the automated tests and even the grading scripts!

Given how convenient it was to run the system, plus copious free time and a preexisting curiosity about this material, I decided to give Pintos a try.

Experience

My implementation of projects 1 through 4:

The individual commits for each project:

  1. threads
  2. user programs
  3. virtual memory
  4. file systems

I suspect these projects would’ve been outside of my individual ability range during college; alternatively, rising to meet these challenges would’ve meant accelerated growth as a systems programmer. This experience reminds me of the wide range of skills and knowledge I noticed at age 20, when the strongest of my peers already demonstrated admirable strength. That a college student could encounter these systems concepts for the first time, then help guide other students through that same work only a year later as a TA, is impressive indeed.

That said, ~20 years of practical on-the-job learning (though little focus on OSes in my case) made these projects more fun than frustrating, especially with the project build system, test suite, and debugging workflow (including gdb scripts) making for a generally smooth, well-polished experience.

One interesting wrinkle: reading software blogs and publications like LWN over the years made some of the underlying concepts familiar, even without having studied operating systems in depth. This meant the Pintos project requirements made sense on first read, without having to consult the lecture notes (except for details like what specific page replacement algorithm was expected, what specific on-disk format to use for file system indirect blocks, etc). This ambiently absorbed familiarity included, at least for me:

  • interrupts and preemptive scheduling
  • busy-waiting versus using primitives like condition variables
  • communicating between userspace and kernel with syscalls
  • mapping user address space to kernel address space (with MMU/TLB help)
  • populating memory via page faults and swap

Applying these concepts after gradually learning them over many years seems easier than applying them immediately after first hearing about them. I think especially in systems programming, there can be a lack of conceptual purity, such that even the most important facts to keep in mind feel like a grab bag, rather than a unified, coherent theory or perspective. In some sense, much of what seems most challenging about a project like Pintos is a result of practical, a posteriori facts about the world of computing. There probably are some core physical properties that dictate why everything is so, but it is not always clear exactly how and why in the midst of debugging.

Of course, therein likely lies the charm of an Operating Systems project in C, in contrast to (for example) a Programming Languages project in Scheme. Perhaps the unique fun of the former, not greater than yet distinct from the latter, arises from the muck of realistic constraints and confusions.

What I Learned

Project 1: Threads

Prior to project 1 (threads), I had never really thought about how a thread is represented within an operating system. Maybe more philosophically, I did not know what a thread “physically” “is”; instead, threads (or processes) existed to me as an abstraction around what runs when and with what resources.

Based on project 1 and related course reading (sections A.1 through A.5), my understanding is that a thread can be viewed at a slightly lower level in the following ways:

A thread maps one-to-one with a page of memory.

Creating a new thread involves allocating a page of memory and writing a series of initial values to that page (like we would initialize any other struct). These initial values include a special series of stack frames that can be imagined as what the CPU would have pushed onto the stack (i.e. written to memory) if it had executed the initial CPU scheduler routines and then yielded, only we synthesize these stack frames directly instead of having the CPU create them as a side effect of running call and ret instructions.

Each thread’s page contains the kernel stack of that thread. This means that while a thread is running, the stack register %esp of the CPU will have a value that lies within the address range of that thread’s page (assuming a kernel stack size limit of 4 kB). In other words, we can map from the currently running thread (abstractly) to its physical location in memory by rounding the current value of %esp down to a page boundary.

While a thread runs, the CPU instruction call pushes the value of the instruction pointer %eip (and other information like function arguments) onto the stack, and the instruction ret pops values off the stack. Calling functions and returning from functions thus writes values continuously to memory, and in this way, the CPU leaves breadcrumbs in the current thread’s page of the specific series of function calls that have led to the current logical state. Conceptually, there is an interleaving of computation and data; the data in memory (pushed onto the stack) contain %eip values that tell us what computations led to the present and what computations come next for a given thread.

To switch from thread T1 to thread T2 means to save the current value of %esp to a fixed offset within T1’s page (treating the page like a struct) and to load a similar value from T2’s page into %esp.

To me, this feels a bit magical. How does thread T2 resume in the right place, just based on having its stack pointer restored? Remember that the CPU, while calling and returning from functions, has already been writing stack frames to memory all the while, including right up until the point that T2 last yielded. This means that restoring T2’s stack pointer brings into view all of these previously stored stack frames, containing instruction pointers, function arguments, and similar.

One subtlety: as part of yielding the CPU, all kernel threads call into a common entrypoint, called switch_threads() in Pintos. As a result, all threads (except the currently running thread) have this same function at the top of their callstack. When the previously running thread calls into switch_threads() to yield the CPU, the newly running thread returns from switch_threads() into the function that last called it, when that thread most recently yielded.

In a sense, it is simple for T2 to resume in the right place because all threads suspend in the same place, switch_threads(). The trick is that every thread can have a unique set of values on the stack (in memory) leading up to that switch_threads() invocation, reflecting that thread’s specific execution history in the form of instruction pointers, function arguments, and similar, saved to memory by call and ret instructions running in the context of that thread.

One neat side effect of each thread “being” a page of memory is that we can see each kernel thread’s present callstack by inspecting its corresponding page. The Pintos gdb scripts in ./src/misc/gdb-macros contain an implementation of this concept, called btthread and btthreadall, corresponding to the bt and bt thread apply all commands of gdb.

Overall, I would say that project 1, moreso the reading material and existing Pintos implementation (and less what I added to Pintos as part of the first project), taught me that threads can be viewed as “just” data in-memory that gets updated automatically (so to speak) by CPU instructions like call and ret, the content of which influences subsequent control flow when loaded back into certain registers (e.g. control flow via %eip, page table mappings via %cr3).

Pretty trippy!

Projects 2-4

Strangely, although project 1 took the least amount of time, it contained the most profound-feeling information. Projects 2 through 4 felt more practical in nature.

Another potentially idiosyncratic experience: I felt project 3 (virtual memory) raised more conceptual gotchas than project 4 (file systems), even though I assume most students find the former less demanding than the latter.

For example, it felt strange to me dealing with the consequences of a contiguous user-space address range, like a user-space buffer pointed-to by an argument to the read() syscall, not necessarily mapping to a contiguous set of pages in kernel space. Maybe this is obvious given the existence of page tables and MMU/TLB mappings, but it still felt weird to handle syscalls like read() with code that ended up looking a bit like scatter-gather logic in kernel-space.

In contrast, the most time-consuming parts of project 4 centered less on anything OS-specific and more on concurrency in general, like how to drain outstanding reads on a buffer in the cache before reusing it to serve a different on-disk sector, or how to dispatch block I/O in response to a cache miss for one thread while still responding to in-memory cache hits for other threads. This general class of problem could seemingly apply to project 3 equally, if one took a sophisticated approach to the page cache, like sharing frames when multiple threads mmap() the same file or integrating the page cache tightly with the buffer cache (both of which I personally eschewed).

Maybe real-world software work leads to more day-to-day familiarization with file system complexities, compared to virtual memory (as distinct from memory usage and memory access patterns in general). I can recall learning about what fsck puts in lost+found after an unclean shutdown, why frequent fsync() usage stressed earlier file systems on spinning rust drives so much, how to use strace to investigate slow I/O, and so on. This might all have had the cumulative effect of making project 4 seem more straightforward than it would have without the intervening years of experience.

I did also have the benefit of learning about the WAFL file system at NetApp. Although I never developed a deep understanding, I did at least encounter how inodes and directories can be represented on-disk, how a file system and a block layer can coordinate, and other higher-level concepts. From an outside view, this seems like it would make a university file systems project easier, but given how superficial my knowledge of WAFL and the ONTAP operating system seemed at the time, I’m not so sure. Also, projects 3 and 4 still took me roughly similar amounts of time to complete; I just spent more time in project 4 on synchronization issues that could affect any concurrent programming task, rather than addressing bugs that arose more clearly from an incomplete understanding of VM-specific or filesystem-specific concepts.

Closing Thoughts

I enjoyed Pintos and benefitted from the exercise. The reading materials, projects, and source repo feel well-designed and high-quality, so much so that someone like me can learn from them even in the absence of a classroom, project teammates, and advising TAs. In my opinion, the professors and students who built it should be proud.