How are Unix pipes implemented?

By <ams@toroid.org>

, updated

This article is about how pipes are implemented the Unix kernel. I was a little disappointed that a recent article titled “How do Unix pipes work?” was not about the internals, and curious enough to go digging in some old sources to try to answer the question.

What are we talking about?

Pipes are “perhaps the single most striking invention in Unix” — a defining characteristic of the Unix philosophy of composing small programs together, and a familiar sight in the Unix shell:

$ echo hello | wc -c
6

This functionality depends on the kernel providing a system call named pipe, which is described by the pipe(7) and pipe(2) manual pages:

Pipes provide a unidirectional interprocess communication channel. A pipe has a read end and a write end. Data written to the write end of a pipe can be read from the read end of the pipe.

A pipe is created using pipe(2), which returns two file descriptors, one referring to the read end of the pipe, the other referring to the write end.

Tracing the execution of the command above shows the creation of the pipe and the flow of data through it from one process to another:

$ strace -qf -e execve,pipe,dup2,read,write \
    sh -c 'echo hello | wc -c'
execve("/bin/sh", ["sh", "-c", "echo hello | wc -c"], …) pipe([3, 4]) = 0
[pid 2604795] dup2(4, 1) = 1 [pid 2604795] write(1, "hello\n", 6) = 6
[pid 2604796] dup2(3, 0) = 0 [pid 2604796] execve("/usr/bin/wc", ["wc", "-c"], …) [pid 2604796] read(0, "hello\n", 16384) = 6 [pid 2604796] write(1, "6\n", 2) = 2

The parent process calls pipe() to obtain connected fds, one child writes to one fd and another reads the same data from the other fd. (The shell uses dup2 to "rename" fds 3 and 4 to match stdin and stdout.)

Without pipes, the shell would have to write the output from one process to a file and arrange for the next process to read from it, at the expense of some overhead and some disk space — but pipes provide more than just the convenience of avoiding temporary files:

If a process attempts to read from an empty pipe, then read(2) will block until data is available. If a process attempts to write to a full pipe (see below), then write(2) blocks until sufficient data has been read from the pipe to allow the write to complete.

This important property, along with the POSIX requirement that writing up to PIPE_BUF (at least 512) bytes into a pipe must be atomic, allows processes to communicate with each other over a pipe in a way that regular files (which have no such guarantees) could not support.

With a regular file, a process could write all its output and hand it over to the next process, or they could operate in lock-step by using an external signalling mechanism (like a semaphore) to let each other know when they were done reading or writing. Pipes eliminate all this complexity.

What are we looking for?

A little hand-waving helps to imagine how pipes might work. You would need to allocate a memory buffer and some state, have functions to add and remove data from the buffer and some glue to invoke them through read and write operations on file descriptors, and sprinkle some locking on top to implement the special behaviour described above.

We are now equipped to interrogate the kernel source under bright lights to confirm or deny this vague mental model, but always prepared to run into any surprises along the way.

Where do we look?

I'm not sure where my (reprinted) copy of the famous “Lions book” with the source code of Unix 6ed is, but thanks to The Unix Heritage Society, we can start by looking at even older Unix sources online.

(Browsing the TUHS archives is like being in a museum — not only does it provide a rare glimpse at our shared history, it also makes me appreciate the years of effort that went into recovering all of these materials bit by bit from old tapes and printed documentation, and acutely conscious of the parts that are still lost.)

Once we've satisfied our curiosity about the ancient history of pipes, we can take a look at some modern kernels just for the contrast.

Incidentally, pipe is system call number 42 in the sysent[] table in Unix. Coincidence?

Traditional Unix kernels (1970–1974)

I can't find any trace of pipe(2) in PDP-7 Unix from January 1970, nor in First edition Unix from November 1971, nor in the incomplete sources of Second edition Unix from June 1972.

TUHS confirms that Third edition Unix from February 1973 was the first version to include pipes:

The third edition of Unix was the last version with a kernel still written in assembly code, but is the first version to include pipes. For much of 1973, the existing Third Edition was maintained and improved, while the kernel was rewritten in C to become the Fourth Edition of Unix.

Update: A reader on HN was kind enough to dig up a scanned copy of the 1964 memo in which Doug McIlroy proposed the idea of “coupling programs like garden hose”.

McIlroy pipe memo

Brian Kernighan's book, “Unix: A History and a Memoir”, also talks about the history of pipes and mentions this memo “…that hung on the wall in my office at Bell Labs for 30 years”. Here's an interview with McIlroy, and some more history from a 2014 paper by McIlroy:

When Unix came to be, my fascination with coroutines led me to badger its author, Ken Thompson, to allow writes in one process to go not only to devices but also to matching reads in another process. Ken saw that was possible. As a minimalist, though, he wanted every system feature to carry significant weight. Did direct writes between processes offer a really major advantage over writing to a temporary file in one process and then reading it in the other? Not until I made a specific proposal with a catchyname, ‘‘pipe’’, and shell syntax to connect processes via pipes, did Ken finally exclaim, ‘‘I’ll do it!’’

And he did do it. In one feverish evening Ken modified both kernel and shell, and fixed several standard programs so they could take input from standard input (potentially piped) as well as named files. The next day brought a heady explosion of applications. By the end of the week, department secretaries were using pipes to send text-processor output to the printer spooler. Not long after, Ken replaced the original API and shell syntax for pipes with cleaner conventions that have been used ever since.

Unfortunately, the source code for the 3E kernel is no longer available, and although we do have the C source code for a version of the Fourth edition Unix kernel from November 1973, it predates the official release by a few months, and is missing the pipe implementation. It is sad that the origin of such an iconic Unix feature is lost to us, perhaps forever.

We do have the pipe(2) manpages from both releases, so we can start by looking at the documentation from 3E (with certain words underlined “by hand” with a string of literal ^H backspaces followed by underscores!). This proto-pipe(2) written in assembly returned only one file descriptor, but provided the basic functionality that we expect:

The pipe system call creates an I/O mechanism called a pipe. The file descriptor returned can be used in both read and write operations. When the pipe is written, the data is buffered up to 504 bytes at which time the writing process is suspended. A read on the pipe will pick up the buffered data.

By next year, along with a rewrite of the kernel into C, the 4E pipe(2) had assumed its modern form, with a prototype of "pipe(fildes)":

The pipe system call creates an I/O mechanism called a pipe. The file descriptors returned can be used in read and write operations. When the pipe is written using the descriptor returned in r1 (resp. fildes[1]), up to 4096 bytes of data are buffered before the writing process is suspended. A read using the descriptor returned in r0 (resp. fildes[0]) will pick up the data.

It is assumed that after the pipe has been set up, two (or more) cooperating processes (created by subsequent fork calls) will pass data through the pipe with read and write calls.

The shell has a syntax to set up a linear array of processes connected by pipes.

Read calls on an empty pipe (no buffered data) with only one end (all write file descriptors closed) return an end-of-file. Write calls under similar conditions are ignored.

So the earliest surviving pipe implementation is from Fifth Edition Unix in June 1974, but it's nearly identical to what was in the next release, which also added some comments, so we might as well skip 5E too.

Sixth Edition Unix (1975)

As so many others have done in the past, Sixth Edition Unix from May 1975 is where we start reading the Unix code. Sixth Edition sources are much more widely available than earlier versions, thanks largely to the Lions book:

For many years, the Lions book was the only Unix kernel documentation available outside Bell Labs. Although the license of 6th Edition allowed classroom use of the source code, the license of 7th Edition specifically excluded such use, so the book spread through illegal copy machine reproductions.

You can still buy a reprint of the book, featuring cover art with shifty-looking students at a photocopier, but thanks to Warren Toomey (who went on to start TUHS), you can also download a PDF of the 6E source. I can't resist a quick glimpse at the effort that went into making this PDF available:

Over 15 years ago I had produced a replica of the Lions' source code listing since I was unhappy with the quality of my n-th generation copy. There was no TUHS and I had no access to any old source code. But in 1988 I discovered an old 9-track tape being discarded of a PDP11 backup. It was hard to determine what it was running, but it did have an intact /usr/src/ tree of which most of the files were timesamped 1979, even at that time it seemed ancient. So it was either 7th edition or a derivative like PWB, which I believe it was.

I used this as a basis and hand edited the source back into 6th edition form. Some code was completely the same, some required the easy edit of changing the modern += token into the archaic =+. Others needed to just remove casts, while some had to be completely retyped, but not that much.

Now, decades later, we can just read the Sixth Edition source code online at TUHS, from an archive contributed by Dennis Ritchie.

Aside: at first glance, what is most striking about this pre-K&R C code is how narrow it is. It's not often that I can include code listings without extensive editing to fit the relatively narrow body of my web site.

There's an instructive comment near the top of /usr/sys/ken/pipe.c (and yes, there's also a /usr/sys/dmr):

/*
 * Max allowable buffering per pipe.
 * This is also the max size of the
 * file created to implement the pipe.
 * If this size is bigger than 4096,
 * pipes will be implemented in LARG
 * files, which is probably not good.
 */
#define	PIPSIZ	4096

The buffer size has not changed from 4E days, but here we can already see beyond the documented public interface — ancient pipes used a file to provide the backing storage!

As for LARG files, they correspond to the ILARG inode flag used by the "large addressing algorithm" to handle indirect blocks to support larger filesystems. If Ken says it would probably not be good to use them, I'm happy to take his word for it.

Here's the actual pipe system call:

/*
 * The sys-pipe entry.
 * Allocate an inode on the root device.
 * Allocate 2 file structures.
 * Put it all together with flags.
 */
pipe()
{
    register *ip, *rf, *wf;
    int r;

    ip = ialloc(rootdev);
    if(ip == NULL)
        return;
    rf = falloc();
    if(rf == NULL) {
        iput(ip);
        return;
    }
    r = u.u_ar0[R0];
    wf = falloc();
    if(wf == NULL) {
        rf->f_count = 0;
        u.u_ofile[r] = NULL;
        iput(ip);
        return;
    }
    u.u_ar0[R1] = u.u_ar0[R0]; /* wf's fd */
    u.u_ar0[R0] = r;           /* rf's fd */
    wf->f_flag = FWRITE|FPIPE;
    wf->f_inode = ip;
    rf->f_flag = FREAD|FPIPE;
    rf->f_inode = ip;
    ip->i_count = 2;
    ip->i_flag = IACC|IUPD;
    ip->i_mode = IALLOC;
}

The comment is quite clear about what's going on, but the code takes a bit of effort to follow, partly because of how the system call parameters and return values are passed in and out through the per-process “struct user u” and the registers R0 and R1.

We try to allocate an inode on disk using ialloc() and two files in memory using falloc(). If all goes well, we set flags to identify the files as two ends of a pipe, point them at the same inode (whose reference count becomes 2), and mark the inode as modified and in-use. (Note the calls to iput() in the error paths, to decrement the reference count of the new inode.)

pipe() must return the read and write file descriptor numbers via R0/R1. falloc() returns a pointer to a file structure, but also "returns" the fd via u.u_ar0[R0]. So the code saves the read fd in r and assigns the write fd directly from u.u_ar0[R0] after the second call to falloc().

The FPIPE flag that we set during pipe creation controls the behaviour of the rdwr() function in sys2.c, which invokes the custom pipe I/O routines for pipes:

/*
 * common code for read and write calls:
 * check permissions, set base, count, and offset,
 * and switch out to readi, writei, or pipe code.
 */
rdwr(mode)
{
	register *fp, m;

	m = mode;
	fp = getf(u.u_ar0[R0]);
        /* … */

	if(fp->f_flag&FPIPE) {
		if(m==FREAD)
			readp(fp); else
			writep(fp);
	}
        /* … */
}

Next in pipe.c is the readp() function to read from a pipe, but it's easier to follow the implementation by looking at writep() first. Once again, the oddities of the argument passing convention add some complexity, but we can gloss over some of the details.

writep(fp)
{
    register *rp, *ip, c;

    rp = fp;
    ip = rp->f_inode;
    c = u.u_count;

loop:
    /* If all done, return. */

    plock(ip);
    if(c == 0) {
        prele(ip);
        u.u_count = 0;
        return;
    }

    /*
     * If there are not both read and write sides of the
     * pipe active, return error and signal too.
     */

    if(ip->i_count < 2) {
        prele(ip);
        u.u_error = EPIPE;
        psignal(u.u_procp, SIGPIPE);
        return;
    }

    /*
     * If the pipe is full, wait for reads to deplete
     * and truncate it.
     */

    if(ip->i_size1 == PIPSIZ) {
        ip->i_mode =| IWRITE;
        prele(ip);
        sleep(ip+1, PPIPE);
        goto loop;
    }

    /* Write what is possible and loop back. */

    u.u_offset[0] = 0;
    u.u_offset[1] = ip->i_size1;
    u.u_count = min(c, PIPSIZ-u.u_offset[1]);
    c =- u.u_count;
    writei(ip);
    prele(ip);
    if(ip->i_mode&IREAD) {
        ip->i_mode =& ~IREAD;
        wakeup(ip+2);
    }
    goto loop;
}

On entry, we want to write u.u_count bytes into the pipe. First we acquire a lock on the inode (see below for more about plock/prele).

Next, we check the inode's reference count. So long as the two ends of the pipe remain open, the reference count must remain 2. We hold one reference (from rp->f_inode), so if the count is less than 2, it must mean that the reader closed their end. In other words, we're trying to write into a closed pipe, which is an error. 6E Unix is when the EPIPE error code and SIGPIPE signal were first introduced.

Even if the pipe is still open, it may be full. In that case, we release the lock and go to sleep hoping that another process will create space in the pipe by reading from it. When we are woken up, we loop back to the start and try to acquire the lock and start another write cycle.

If the pipe does have free space, we use writei() to write out some data. The inode's i_size1 (which may be 0 for an empty pipe) points to the end of whatever data it already contains. If we have enough data to write, we can fill the pipe from i_size1 up to PIPESIZ. Then we release the lock and try to wake up any process that is waiting to read from the pipe. Finally, we loop back to the start to see if we managed to write out as many bytes as we wanted. If not, we start another write cycle.

The inode's i_mode is normally used to hold r/w/x permissions, but for pipes the IREAD and IWRITE bits are used to indicate that some process is waiting to read from or write to the pipe, respectively. The process sets the flag and calls sleep(), expecting a matching wakeup() to be called by some other process in future.

The real magic happens in sleep() and wakeup(), which are implemented in slp.c, the source of the infamous “You are not expected to understand this” comment. Fortunately, we don't have to understand the code, just take a quick look at some other comments:

/*
 * Give up the processor till a wakeup occurs
 * on chan, at which time the process
 * enters the scheduling queue at priority pri.
 * The most important effect of pri is that when
 * pri<0 a signal cannot disturb the sleep;
 * if pri>=0 signals will be processed.
 * Callers of this routine must be prepared for
 * premature return, and check that the reason for
 * sleeping has gone away.
 */
sleep(chan, pri) /* … */

/*
 * Wake up all processes sleeping on chan.
 */
wakeup(chan) /* … */

A process that calls sleep() for a particular channel can be woken up later by another process calling wakeup() for the same channel. writep() and readp() coordinate their operations through matching pairs of these calls. (Note that pipe.c always uses PPIPE as the priority when calling sleep, so the sleeps are interruptible by signals.)

Now we have everything we need to understand the readp() function:

readp(fp)
int *fp;
{
    register *rp, *ip;

    rp = fp;
    ip = rp->f_inode;

loop:
    /* Very conservative locking. */

    plock(ip);

    /*
     * If the head (read) has caught up with
     * the tail (write), reset both to 0.
     */

    if(rp->f_offset[1] == ip->i_size1) {
        if(rp->f_offset[1] != 0) {
            rp->f_offset[1] = 0;
            ip->i_size1 = 0;
            if(ip->i_mode&IWRITE) {
                ip->i_mode =& ~IWRITE;
                wakeup(ip+1);
            }
        }

        /*
         * If there are not both reader and
         * writer active, return without
         * satisfying read.
         */

        prele(ip);
        if(ip->i_count < 2)
            return;
        ip->i_mode =| IREAD;
        sleep(ip+2, PPIPE);
        goto loop;
    }

    /* Read and return */

    u.u_offset[0] = 0;
    u.u_offset[1] = rp->f_offset[1];
    readi(ip);
    rp->f_offset[1] = u.u_offset[1];
    prele(ip);
}

It may help to read this function from the bottom upwards. The "read and return" branch is the common case, when there is some data in the pipe. If so, we use readi() to read as much data as available from the current read f_offset onwards, and update the offset accordingly.

On a subsequent read, the pipe is empty if the read offset has reached the inode's i_size1. We reset the positions to 0 and try to wake up any process that wants to write into the pipe. We know that writep() goes to sleep on on ip+1 when the pipe is full. Now that the pipe is empty, we can wake it up and allow it to resume its write loop.

If there's nothing to read, readp() can set the IREAD flag and go to sleep on ip+2. We know that writep() will wake it up when it manages to write some data into the pipe.

It's worth a quick look at the comments for readi() and writei() to realise that, other than having to pass in the parameters through “u”, we can treat them as ordinary IO functions that take a file, a position, a memory buffer, and a count of bytes to read or write.

/*
 * Read the file corresponding to
 * the inode pointed at by the argument.
 * The actual read arguments are found
 * in the variables:
 *	u_base		core address for destination
 *	u_offset	byte offset in file
 *	u_count		number of bytes to read
 *	u_segflg	read to kernel/user
 */
readi(aip)
struct inode *aip;
/* … */

/*
 * Write the file corresponding to
 * the inode pointed at by the argument.
 * The actual write arguments are found
 * in the variables:
 *	u_base		core address for source
 *	u_offset	byte offset in file
 *	u_count		number of bytes to write
 *	u_segflg	write to kernel/user
 */
writei(aip)
struct inode *aip;
/* … */

As for the "conservative" locking, readp() and writep() hold a lock on the inode until they finish or yield (i.e., call wakeup). plock() and prele() are straightforward, with another matched set of sleep and wakeup calls to enable us to wakeup any process that wants a lock that we have just released:

/*
 * Lock a pipe.
 * If its already locked, set the WANT bit and sleep.
 */
plock(ip)
int *ip;
{
    register *rp;

    rp = ip;
    while(rp->i_flag&ILOCK) {
        rp->i_flag =| IWANT;
        sleep(rp, PPIPE);
    }
    rp->i_flag =| ILOCK;
}

/*
 * Unlock a pipe.
 * If WANT bit is on, wakeup.
 * This routine is also used to unlock inodes in general.
 */
prele(ip)
int *ip;
{
    register *rp;

    rp = ip;
    rp->i_flag =& ~ILOCK;
    if(rp->i_flag&IWANT) {
        rp->i_flag =& ~IWANT;
        wakeup(rp);
    }
}

I could not understand at first why readp() did not call prele(ip) before calling wakeup(ip+1). The first thing writep() does in its loop is plock(ip), which would deadlock if readp() still held its lock, so the code must have the right effect somehow. A quick glance at wakeup() showed that it only marks sleeping processes as runnable, and leaves it to a future sched() to actually start the process. So readp() calls wakeup(), releases its lock, sets IREAD, and calls sleep(ip+2) all before writep() resumes its loop.

That's all there is to pipes in 6E. Simple code, far-reaching consequences.

Seventh Edition Unix from January 1979 was a major new release (after a gap of four years) with many new applications and kernel features. It also featured widespread changes due to the use of K&R C with typecasts and unions and typed pointers to structs, but the pipe code hardly changed at all. We can skip it without missing anything exciting.

Xv6, a simple Unix-like kernel

Xv6 is a kernel inspired by Sixth Edition Unix, but written in modern C to run on x86 processors. It's simple to read and understand but, unlike the Unix sources from TUHS, you can also compile and modify and run it on something other than a PDP 11/70. For this reason, it is widely used by universities in Operation Systems courses. The source code is available on Github.

It has a clear and concise pipe.c implementation backed by a buffer in memory instead of an inode on disk. I reproduce only the definition of “struct pipe” and the pipealloc() function here:

#define PIPESIZE 512

struct pipe {
  struct spinlock lock;
  char data[PIPESIZE];
  uint nread;     // number of bytes read
  uint nwrite;    // number of bytes written
  int readopen;   // read fd is still open
  int writeopen;  // write fd is still open
};

int
pipealloc(struct file **f0, struct file **f1)
{
  struct pipe *p;

  p = 0;
  *f0 = *f1 = 0;
  if((*f0 = filealloc()) == 0 || (*f1 = filealloc()) == 0)
    goto bad;
  if((p = (struct pipe*)kalloc()) == 0)
    goto bad;
  p->readopen = 1;
  p->writeopen = 1;
  p->nwrite = 0;
  p->nread = 0;
  initlock(&p->lock, "pipe");
  (*f0)->type = FD_PIPE;
  (*f0)->readable = 1;
  (*f0)->writable = 0;
  (*f0)->pipe = p;
  (*f1)->type = FD_PIPE;
  (*f1)->readable = 0;
  (*f1)->writable = 1;
  (*f1)->pipe = p;
  return 0;

 bad:
  if(p)
    kfree((char*)p);
  if(*f0)
    fileclose(*f0);
  if(*f1)
    fileclose(*f1);
  return -1;
}

pipealloc() sets the stage for the rest of the implementation, comprising piperead(), pipewrite(), and pipeclose() functions. The actual system call, sys_pipe, is a wrapper implemented in sysfile.c. I recommend reading the code in its entirety. It's about the same level of complexity as the 6E Unix source code, but much easier to go through, therefore more satisfying.

Linux 0.01

It's possible to find the Linux 0.01 source code, and it's instructive to look at its pipe implementation in fs/pipe.c too. It uses an inode to represent the pipe, but it's written in modern C. Having worked our way through the 6E code, it's not difficult to follow. Here's the write_pipe() function:

int write_pipe(struct m_inode * inode, char * buf, int count)
{
    char * b=buf;

    wake_up(&inode->i_wait);
    if (inode->i_count != 2) { /* no readers */
        current->signal |= (1<<(SIGPIPE-1));
        return -1;
    }
    while (count-->0) {
        while (PIPE_FULL(*inode)) {
            wake_up(&inode->i_wait);
            if (inode->i_count != 2) {
                current->signal |= (1<<(SIGPIPE-1));
                return b-buf;
            }
            sleep_on(&inode->i_wait);
        }
        ((char *)inode->i_size)[PIPE_HEAD(*inode)] =
            get_fs_byte(b++);
        INC_PIPE( PIPE_HEAD(*inode) );
        wake_up(&inode->i_wait);
    }
    wake_up(&inode->i_wait);
    return b-buf;
}

Even without looking up the structure definitions, we can understand the use of the inode reference count to check if a write should raise SIGPIPE. Other than operating byte-by-byte, it's easy to map this function to the ideas we saw earlier. Even the sleep_on/wake_up logic doesn't seem entirely foreign.

Modern kernels: Linux, FreeBSD, NetBSD, OpenBSD

I had a quick look at some modern kernels, and none of them uses a disk-backed implementation any more (not surprisingly). Linux has its own implementation, of course, but although the three modern BSD kernels all have implementations derived from code written by John S. Dyson, they too have diverged from each other over the years.

Reading through fs/pipe.c (on Linux) or sys/kern/sys_pipe.c (on *BSD) now takes real commitment. The code now needs to be concerned with performance and support for features like vectored and asynchronous IO, and the details of memory allocation, locking, and kernel configuration all vary considerably. Not at all the sort of thing a university would want for an introductory course.

Nevertheless, I took some satisfaction in being able to recognise a few ancestral patterns (raising SIGPIPE and returning EPIPE when writing to a closed pipe, for example) in each of these very different modern kernels. I may never see a PDP-11 computer, but there are still useful lessons to learn from code that predates my own existence by a few years.

The Linux Kernel Implementation of Pipes and FIFOs” by Divye Kapoor in 2011 is an overview of how pipefs (still) works in Linux. This recent Linux commit illustrates how pipes support a model of interaction that goes beyond what temporary files allow, as well as how much further pipes have evolved from the "very conservative locking" in the 6E Unix kernel.