The Unix History Repository: Its creation, contents, and use.

MSR’s keynote is given by Diomidis Spinellis about the Unix History Repository. Unix is very large system originating from the Bell Laboratories. Many ideas and concepts of modern operations systems were first introduced in Unix. And also the (Open Source) development process behind Unix has been an incubator for many ideas that influence today’s software engineering. Therefore, Unix itself is a uniquely interesting software (eco-)system as a subject for research.

Diomidis wanted to study the evolution of programming style, which led him to start creating a repository that captures the entire history of the Unix project and make it publicly available. When the project started, the first edition of Unix (Nov 1971), now also included in the repository, had only survived in print (in a combination of code and hand-written notes). Meanwhile, it was scanned, assembled, and successfully run on an emulator, which proves that this was actually already a fully-operational operating system.

The repository contains many of the early versions, for which artificial commits were created. The authorship details, including timestamps, names, and email adressess have been manually reconstructed. This allows to link the commits from the history repository to be linked to the actual GitHub accounts of their authors. The oldest code lines that have survived in today’s version of Unix are form the early 1970s.

For the newer versions, the repository mirrors the commits as they are found in the respective version control systems. The repository mirrors the evolution and interdependence of different Unix versions and distributions in branches, but currently mainly follows BSD.

The entire repository and its metadata are published under a MIT-style license, for anybody to use and contribute.

The repository may obviously be used to research software evolution. Diomidis observes that when he first got his hands on the code of early Unix versions, he realized that his idols had been writing code in a way that violated many of the practices he himself had been taught (e.g., expressive names or documentation through comments). In general, the repository allows unique opportunities to study the evolution of the usage of language features, conventions (such as formatting, ifdefs, and macros), and architectural patterns, over the years.

The project is by far not complete. There’s various data collection challenges to be solved, to make the repository even more comprehensive. Diomidis calls to action.