The Silent Llama of Doom

Thoughts from a quiet Llama.



Putting A Lid On IT: Linux Containers

Tue 26 January 2021 by A Scented Llama


I’m a little late to the container party which has been raging, I suppose, since about 2013 when Docker hit it big or alternatively in 2005 when Solaris Zones were a thing or maybe even 2000 when FreeBSD jails were introduced. It’s been going for a long time is what I’m trying to say. We’re in that weird after period when all the snacks are gone and the small talk has been exhausted and people are checking their watches. So, hey, this is probably a good time to arrive right?

There are a ton of blogs about containers but they are all almost universally Docker centric (with some LXC and podman thrown in here and there). In this post I’m going to go over all the lower level details.

What are containers?

This is a surprisingly difficult question to answer. On Linux containers are less a ‘thing’ and more the intersection of a whole lot of things. Explanations like “They’re Virtual Machines without, you know, the virtual machine” don’t help all that much. In order to really understand what containers are we need to understand a little bit about virtualisation.

Virtualisation happens when software lies. Yep - it’s an entire specialisation based on dishonesty. Consider a VM - it’s not a real machine, it’s just some process pretending to be a machine. Dishonest. Our OS even virtualises our processes - we write our programs as if we have a cpu all to ourselves with access to all the memory on the machine. In reality we’re sharing cpu with all the other programs and the OS lies about which memory address we’re currently accessing.

Imagine you’re a process running on some linux machine. Let’s say you want to access a file, how would you do it? Well - you would call open() which is to say you’d ask the operating system nicely if it wouldn’t mind opening the file for you. The OS could return an error - File doesn’t exist. From your perspective the file doesn’t exist but the OS could have lied about that. Maybe it’s really there. There’s no way you could know1.

Or maybe you want to access the network. The OS could lie about that as well. Users? Other processes? It’s all the same thing.

Lying sounds negative but in this case it’s really useful. We can create an entire self contained network just by lying to a process and tricking it into thinking the network exists at all. It’s a virtual network. We can make a process think that a specific folder is the root folder - all that exists in the world. That process will happily run within that subfolder as if it was an entirely separate linux (this by the way was the very first step to containers and is called a chroot. It was first introduced in 1979).

So what is a container? A container is a linux process that is being virtualised (lied to). We can choose which aspects of our system to virtualise - the filesystem, the process tree, the network, users, interprocess communication etc…

In Linux each of these virtualisation types is called a namespace. Personally I don’t like the term because we don’t always have ‘names’ associated with anything but that’s the term so I’ll deal. The word ‘Zones’ from Solaris explains the concept better but Solaris Zones implemented in a very different way.

A Linux namespace is just a ‘view’ of current system resources. If you imagine the process table as an actual SQL table then a process namespace is just the section of that table (a ‘view’) that a process is allowed to see. Linux has 8 namespaces currently: Mount, Process, Network, User, IPC, UTS, Cgroup and Time, each of these virtualises some aspect of the system.

Enough Talk - Code

The best way to explain containers is to demonstrate how they work. I’ll be using Nim here because I like the language and think it needs more exposure. Installation instructions are here. I’ll have a gitlab link that contains implementations in Rust and Python as well.

Let’s get started. Create a new project with a funky name related to ‘containing’ stuff. Realise that all the good one’s are taken and fall back to something lame like ‘capsule’:

$ nimble init capsule

Select the option for creating a ‘binary’ project. This creates our project structure which will basically be one project file and one source file called, unsurprisingly, capsule.nim. The nim file will helpfully come with a pregenerated “Hello world” so we can make sure it works with:

$ nimble run

One thing to note is how wonderfully fast nim compiles. Okay - delete all the “hello world” stuff. We’ll need the folllowing imports:

import os
import posix

These are pretty basic - we need some os utilities for accessing files and some posix only system calls. These are all included in the nim base so you don’t need any dependencies.

We’ll need a root filesystem so let’s use Alpine Linux a tiny distribution built for containers:

$ wget https://dl-cdn.alpinelinux.org/alpine/v3.13/releases/x86_64/alpine-minirootfs-3.13.0-x86_64.tar.gz
$ mkdir rootfs;
$ tar -C rootfs -xf alpine-minirootfs-3.13.0-x86_64.tar.gz

Make sure this rootfs is in the same folder as our src/ for nim. We’ll just hardcode this going forward.

The first thing we’ll need to do is setup the linux system calls that we’ll need for creating the container. Nim allows us to import the definitions directly from C headers so we can just declare them as follows:

proc c_unshare(flag: int): int {.header: "<sched.h>", importc: "unshare"}
proc c_mount(src: cstring,
             target: cstring,
             fs_type: cstring,
             flags: clong,
             data: cstring): int {.header: "<sys/mount.h>", importc: "mount"}
proc c_umount2(target: cstring,
               flags: clong): int {.header: "<sys/mount.h>", importc: "umount2"}
proc c_syscall(number: cint): int {.header: "<syscall.h>", importc: "syscall", varargs.}

We’re also going to need some constants that have been defined by the kernel:

# From sched.h:
const CLONE_NEWNS = 0x00020000      # New mount namespace group
const CLONE_NEWCGROUP = 0x02000000  # New cgroup namespace (4.6+ kernels only)
const CLONE_NEWUTS = 0x04000000     # New utsname namespace
const CLONE_NEWIPC = 0x08000000     # New ipc namespace
const CLONE_NEWUSER = 0x10000000    # New user namespace
const CLONE_NEWPID = 0x20000000     # New pid namespace
const CLONE_NEWNET = 0x40000000     # New network namespace

And some more extra magic numbers:

# Magic numbers
const SYS_PIVOT_ROOT = 155
const MNT_DETACH = 2
# from sys/mount.h
const MS_BIND = 4096
const MS_REC = 16384

SYS_PIVOT_ROOT is the system call number for the pivot_root system call. This is the call that makes a chroot work. For some reason it’s not included in libc so we have to call it directly with the syscall function.

One more small thing:

template fail_if(cond: typed, reason: string) =
  if cond:
    echo reason
    if errno != 0:
        echo posix.strerror(errno)
    system.quit(1)

Templates are a Nim feature. They’re a bit like C’s #define templates except they’re much safer. Since we’ll be calling a lot of system calls this template will check if the system call failed, will output a user specified failure reason and will additionally output the libc error if there was one. It will then kill the process because we can’t do anything if any of our system calls fail. (Why not just make a fail_if function? You could but then you wouldn’t be able to demonstrate a cool Nim feature!)

We can now create an unshare procedure (Nim still uses the pascal nomenclature ‘procedure’. Nim also has functions but functions are different in that they are restricted from having side effects):

proc unshare(flag: int) =
  fail_if(c_unshare(flag) < 0):
    "Unable to unshare namespace"

This simply calls the libc unshare system call and will fail if anything goes wrong (i.e. the system call returns -1). The template just makes things slightly neater - the actual generated code is:

proc unshare(flag: int) =
  if(c_unshare(flag) < 0):
    echo "Unable to unshare namespace"
    if errno != 0:
        echo posix.strerror(errno)
    system.quit(1)

How to start a container

We’re now in a position to start a new container. How do we do that? There are actually two methods: we can use the clone system call to create a new process or we can fork a new process. If we use clone we can create a new process with new namespaces directly. When we fork a process the child process will share all the namespaces of its parent. To create new namespaces we need to call unshare to, well, unshare them. For this example I’ll be using the unshare method (which you may have guessed since we haven’t imported the clone system call).

Here we go:

unshare(CLONE_NEWUSER or
        CLONE_NEWUTS or
        CLONE_NEWNS or
        CLONE_NEWIPC or
        CLONE_NEWPID or
        CLONE_NEWNET)

This is just a shorthand for unsharing all the namespaces. The unshare takes a list of bit flags so we can set all the flags we want when calling it in one fell swoop.

N.B:

We’re using the user namespace functionality here. In general you need root privileges to mess around with namespaces because of various security concerns. However if we virtualise the available users we can be ‘root’ in our namespace which allows us to unshare the rest. On older linux kernels, apparently, you needed to make sure that you were in the usernamspace first like this:

    unshare(CLONE_NEWUSER)
    unshare(CLONE_NEWUTS or CLONE_NEWNS or CLONE_NEWIPC or CLONE_NEWPID or CLONE_NEWNET)

Newer kernels know to first unshare the usernamespace so the ordering isn’t important anymore. Something to remember: even though you are now ‘root’ in your new namespace it’s a limited root. As far as Linux is concerned you still have the exact same privileges your normal user had except you’re now allowed to do a little more pretending.

Okay - so our process is now dissociated from its parent namespace. We can now run a command in the new namespace by forking:

let pid = posix.fork()

if pid == 0:
    let mypid = posix.getpid()
    # Note: the $ operator is nim's tostring operator, & is string concatenation:
    echo "I'm in a container! My PID is: " & $mypid
else:
   # Parent just waits for the child:
   var status: cint = 0
   fail_if(posix.waitpid(pid, status, 0) < 0):
       "Failed to wait for child"

The output will be:

I'm in a container! My PID is: 1

What happened here is the following:

Wait! Why wasn’t our parent process the first process in the new namespace? We unshared the pid namespace didn’t we?

This is a good question. The new pid namespace only comes into effect for child processes. I suspect this is because pulling the rug out from under a running process could cause too many problems. The exception is documented in the kernel source under: include/linux/nsproxy.h where it says:
 * The pid namespace is an exception -- it's accessed using
 * task_active_pid_ns.  The pid namespace here is the
 * namespace that children will use

What happens if we try to access the filesystem?

if pid == 0:
  let mypid = posix.getpid()
  echo "I'm in a container! My PID is: " & $mypid

  for kind, path in os.walkDir("/"):
    echo path

This works fine. What gives? Didn’t we start a new mount namespace? Well - we did. The mount namespace is only for mount points not the filesystem. We need to switch to a new root filesystem to make a proper container. This is were the pivot_root system call comes in. Let’s create our own chroot function which will use the pivot_root system call:

proc chroot(new_root: string) =
  fail_if(bind_mount(new_root, new_root) < 0):
    "Failed to bind to new filesystem"
  fail_if(posix.chdir(new_root) < 0):
    "Failed to change to new folder"

  if not os.dirExists("oldroot"):
    os.createDir("oldroot")

  # We're ready to pivot:
  fail_if(c_syscall(SYS_PIVOT_ROOT,
                    new_root,
                    "oldroot") < 0):
    "Unable to pivot_root"

  fail_if(c_umount2("oldroot", MNT_DETACH) < 0):
    "Unable to detach old root"

  os.removeDir("oldroot")
  fail_if(posix.chdir("/") < 0):
    "Failed to switch to new root filesystem"

Whoa, that’s a lot. First up we need a bind_mount call so let’s quickly write that:

proc bind_mount(src: string, dst: string): int =
   return c_mount(src, dst, nil, MS_BIND or MS_REC, nil)

What is a bind mount? A bind mount let’s us take one part of our directory tree and ‘binds’ it to another part. We use the mount system call with the MS_BIND flag. We also use MS_REC to make sure we bind all submounts too. It’s really useful and we can use it to allow a container to access folders in the host.

You may ask: what’s the point of binding a folder to itself? The first step here is bind_mount(new_root, new_root)? Why? The best answer I can give is: I don’t know. It’s a restriction on the pivot_root call. The man page states: new_root and old must not be on the same filesystem as the current root. Generally we are on the same filesystem so to fake things we bind the root folder to itself. Once it’s bound Linux will see it as different and will allow pivot_root to continue. I really have no idea why this is necessary.

After the bind we immediately change to the new root directory. Again this is best practice according to the pivot_root man page. Now we get to the pivot part of pivot_root. We create a folder called oldroot then run the pivot_root system call (note: we have to run it directly with c_syscall(SYS_PIVOT_ROOT) because there is no wrapper in libc. Don’t ask me why not. Shrugs).

This switches our current root filesystem to oldroot and makes the newroot directory (hardcoded to “rootfs”) our new root filesystem. We can now unmount oldroot which will unbind it effectively removing our access to the old filesystem entirely and we can then delete the now empty oldroot folder. For good measure we make sure everything works by switching to the new root.

User mappings

So, we’re done? We’ve got a rootfs, we have a chroot and we’ve unshared our namespaces: from hereon things should work… right?

Not exactly. There’s one last wrinkle: user mappings. We’re running a usernamespace which means that none of the user information in the parent namespace applies anymore. So how can our process know what to do about file permissions and things? We have to tell it.

Linux provides a simple mechanism to let our process know what’s going on. All processes have an entry in the linux /proc virtual folder. So we can access properties for our process through the filesystem (a virtual filesystem - more lying - you’re not really accessing files you’re accessing kernel data structures. Lying is really useful).

We can find our entry by getting our current pid (note: again this is why the pid namespace can’t change our pid retroactively). We can do this:

... snip ...
let parent_pid = posix.getpid()
unshare(CLONE_NEWUSER or
        CLONE_NEWUTS or
        CLONE_NEWNS or
        CLONE_NEWIPC or
        CLONE_NEWPID or
        CLONE_NEWNET)
... snip ...

Now we can find where our process will be represented: it’ll be under the /proc/$parent_pid folder.

Btw: this is how the 'ps' command line program works. It scans over /proc. Now you know.

To set the mappings we simply need to write to three files:

For the last file we have to write the string “deny”. This is a security measure I’ll explain in later posts. The userspace mapping is straight forward and consists of three numbers seperated by spaces. The first number is the starting id in the new namespace. We are only allowed to map a single user since we’re not running as root. So our map will be 0 1000 1 which means we’ll map root to our current user and have no other mappings. Again I’ll go into more detail in a subsequent post.

We can implement this like so:

let parent_pid = posix.getpid()
let uid = posix.getuid()
let gid = posix.getgid()
unshare(CLONE_NEWUSER or
        CLONE_NEWUTS or
        CLONE_NEWNS or
        CLONE_NEWIPC or
        CLONE_NEWPID or
        CLONE_NEWNET)

writeFile("/proc/" & $(parent_pid) & "/setgroups", "deny")
# Writes: 0 1000 1 to gid_map
writeFile("/proc/" & $(parent_pid) & "/gid_map", "0 " & $(gid) & " 1\n")
# Writes: 0 1000 1 to uid_map
writeFile("/proc/" & $(parent_pid) & "/uid_map", "0 " & $(uid) & " 1\n")

And now we can do the chroot:

let pid = posix.fork()

if pid == 0:
  let mypid = posix.getpid()
  echo "I'm in a container! My PID is: " & $mypid

  fail_if(posix.setgid(0) < 0):
    "failed to become root"
  fail_if(posix.setuid(0) < 0):
    "failed to become root"

  chroot(absolutePath("rootfs"))
  for kind, path in os.walkDir("/"):
    echo path
else:
  var status: cint = 0
  fail_if(posix.waitpid(pid, status, 0) < 0):
    "Failed to wait for child"

We added calls to setuid and setgid to make our mappings kick in. By setting our uid and gid we switch to being the root user in the namespace. Since we’re now a pretend root user we’re allowed to access system calls like mount and pivot_root so we can now call chroot and it will work.

Of course we don’t just want to list files so let’s run a real proper container by executing a command:

... snip ...
chroot(absolutePath("rootfs"))
var args_array = allocCStringArray(["/bin/sh"])
var env_array = allocCStringArray(["PATH=/bin:/sbin:/usr/bin"])
fail_if(posix.execve("/bin/sh", args_array, env_array) < 0): "Unable to exec in container"
... snip ..

We now have a shell that’s, basically, running in alpine linux!

Let’s run a command:

/ # ps
PID   USER     TIME  COMMAND

Huh. Shouldn’t there be a process in there? Well - we need to mount /proc otherwise ps can’t scan it. This is easy to fix by mounting /proc in the chroot call:

... snip ...
fail_if(c_mount("proc", "proc", "proc", 0, nil) < 0): "Unable to mount /proc"
fail_if(c_umount2("oldroot", MNT_DETACH) < 0): "Unable to detach old root"
... snip ...

Depending on what you need you may have to mount a few other necessary folders. I’m not going to get into it here. With this you have a basic container running Alpine linux.

I suspect you might need a stiff drink after all that.

Code (in Nim, Rust and Python) is on Gitlab.


  1. Well, this isn’t exactly true. Nothing’s perfect and there’s a suprising amount you can do to get around the OS. Read up about Spectre to see some of the crazy things that are possible.