Pintos // Eric Woo

Background

A few weeks ago, I saw Rui Ueyama, author of the mold linker, had published a repo for a university operating systems project called Pintos, along with a course description.

Although I never took an operating systems class in college, the repos above and related Stanford course materials bore a resemblance to what I remember hearing of my alma mater’s OS class in the early aughts.

Also, the Pintos source code could build and run on a personal macOS or Linux system, with no dependencies on university-internal servers or infrastructure. This apparently included the automated tests and even the grading scripts! Given all of the above, plus copious free time, I decided to give the Pintos projects a try.

Experience

My implementation of projects 1 through 4:

Also mirrored here.

I suspect I would have struggled with these projects during college; alternatively, rising to meet these challenges would’ve meant accelerated growth as a systems programmer.

One interesting wrinkle: reading software blogs and publications like LWN over the years made some of the material familiar. Applying concepts like preemptive scheduling (with interrupts) and mapping between address spaces (with MMU/TLB help) after learning them gradually over many years was likely much easier than putting those ideas into practice immediately after learning them, as college students must. I suspect this might be particularly so due to the lack of conceptual purity in systems topics, such that even the most important facts to keep in mind often feel quite disparate. From this perspective, the core challenges of a project like Pintos arise from not entirely intuitive, a posteriori facts about the world of computing, rather than a unified, abstract theory.

Of course, therein likely lies the charm of an Operating Systems project in C, in contrast to (for example) a Programming Languages project in Scheme. Perhaps the unique fun of the former, not greater than yet distinct from the latter, arises from the muck of realistic constraints and confusions.

Closing Thoughts

I enjoyed Pintos. The reading materials, projects, and source repo feel well-designed and high-quality, down to the details of the test suites and debugging workflows. I am grateful these educational materials are available so that someone like me can learn from them outside of the originating university setting.

Appendix

Project 1: Threads

Prior to project 1 (threads), I had never really thought about how a thread is represented within an operating system. Maybe more philosophically, I did not know what a thread “physically” “is”; instead, threads (or processes) existed to me as an abstraction around what runs when and with what resources.

Based on project 1 and related course reading (sections A.1 through A.5), my understanding is that a thread can be viewed at a slightly lower level in the following ways:

A thread maps one-to-one with a page of memory.

Creating a new thread involves allocating a page of memory and writing a series of initial values to that page (like we would initialize any other struct). These initial values include a special series of stack frames that can be imagined as what the CPU would have pushed onto the stack (i.e. written to memory) if it had executed the initial CPU scheduler routines and then yielded, only we synthesize these stack frames directly instead of having the CPU create them as a side effect of running call and ret instructions.

Each thread’s page contains the kernel stack of that thread. This means that while a thread is running, the stack register %esp of the CPU will have a value that lies within the address range of that thread’s page (assuming a kernel stack size limit of 4 kB). In other words, we can map from the currently running thread (abstractly) to its physical location in memory by rounding the current value of %esp down to a page boundary.

While a thread runs, the CPU instruction call pushes the value of the instruction pointer %eip (and other information like function arguments) onto the stack, and the instruction ret pops values off the stack. Calling functions and returning from functions thus writes values continuously to memory, and in this way, the CPU leaves breadcrumbs in the current thread’s page of the specific series of function calls that have led to the current logical state. Conceptually, there is an interleaving of computation and data; the data in memory (pushed onto the stack) contain %eip values that tell us what computations led to the present and what computations come next for a given thread.

To switch from thread T1 to thread T2 means to save the current value of %esp to a fixed offset within T1’s page (treating the page like a struct) and to load a similar value from T2’s page into %esp.

To me, this feels a bit magical. How does thread T2 resume in the right place, just based on having its stack pointer restored? Remember that the CPU, while calling and returning from functions, has already been writing stack frames to memory all the while, including right up until the point that T2 last yielded. This means that restoring T2’s stack pointer brings into view all of these previously stored stack frames, containing instruction pointers, function arguments, and similar.

One subtlety: as part of yielding the CPU, all kernel threads call into a common entrypoint, called switch_threads() in Pintos. As a result, all threads (except the currently running thread) have this same function at the top of their callstack. When the previously running thread calls into switch_threads() to yield the CPU, the newly running thread returns from switch_threads() into the function that last called it, when that thread most recently yielded.

In a sense, it is simple for T2 to resume in the right place because all threads suspend in the same place, switch_threads(). The trick is that every thread can have a unique set of values on the stack (in memory) leading up to that switch_threads() invocation, reflecting that thread’s specific execution history in the form of instruction pointers, function arguments, and similar, saved to memory by call and ret instructions running in the context of that thread.

One neat side effect of each thread “being” a page of memory is that we can see each kernel thread’s present callstack by inspecting its corresponding page. The Pintos gdb scripts in ./src/misc/gdb-macros contain an implementation of this concept, called btthread and btthreadall, corresponding to the bt and bt thread apply all commands of gdb.

Overall, I would say that project 1, more so the reading material and existing Pintos implementation (and less what I added to Pintos as part of the first project), taught me that threads can be viewed as “just” data in-memory that gets updated automatically (so to speak) by CPU instructions like call and ret, the content of which influences subsequent control flow when loaded back into certain registers (e.g. control flow via %eip, page table mappings via %cr3).

Pretty trippy!

Projects 2-4

Strangely, although project 1 took the least amount of time, it contained the most profound-feeling information. Projects 2 through 4 felt more practical in nature.

Another potentially idiosyncratic experience: I felt project 3 (virtual memory) raised more conceptual gotchas than project 4 (file systems), even though I assume most students find the former less demanding than the latter.

For example, it felt strange to me dealing with the consequences of a contiguous user-space address range, like a user-space buffer pointed-to by an argument to the read() syscall, not necessarily mapping to a contiguous set of pages in kernel space. Maybe this is obvious given the existence of page tables and MMU/TLB mappings, but it still felt weird to handle syscalls like read() with code that ended up looking a bit like scatter-gather logic in kernel-space.

In contrast, the most time-consuming parts of project 4 centered less on anything OS-specific and more on concurrency in general, like how to drain outstanding reads on a buffer in the cache before reusing it to serve a different on-disk sector, or how to dispatch block I/O in response to a cache miss for one thread while still responding to in-memory cache hits for other threads. This general class of problem could seemingly apply to project 3 equally, if one took a sophisticated approach to the page cache, like sharing frames when multiple threads mmap() the same file or integrating the page cache tightly with the buffer cache (both of which I personally eschewed).

Maybe real-world software work leads to more day-to-day familiarization with file system complexities, compared to virtual memory (as distinct from memory usage and memory access patterns in general). I can recall learning about what fsck puts in lost+found after an unclean shutdown, why frequent fsync() usage stressed earlier file systems on spinning rust drives so much, how to use strace to investigate slow I/O, and so on. This might all have had the cumulative effect of making project 4 seem more straightforward than it would have without the intervening years of experience.

I did also have the benefit of learning about the WAFL file system at NetApp. Although I never developed a deep understanding, I did at least encounter how inodes and directories can be represented on-disk, how a file system and a block layer can coordinate, and other higher-level concepts. From an outside view, this seems like it would make a university file systems project easier, but given how superficial my knowledge of WAFL and the ONTAP operating system seemed at the time, I’m not so sure. Also, projects 3 and 4 still took me roughly similar amounts of time to complete; I just spent more time in project 4 on synchronization issues that could affect any concurrent software system, rather than addressing bugs that arose more clearly from an incomplete understanding of VM-specific or filesystem-specific concepts.