The Unix File System

Only One Kind of File: Byte Streams

A file is a stream (a linear array) of 8-bit bytes.

Any byte or contiguous sequence of bytes may be read or written, regardless of any underlying block, track or cylinder structure in the hardware. A single read or write can access any number of bytes: there is no maximum size, and it needn't correspond to the underlying disk organization.

An file opened by a process is identified by a file descriptor which is private to that process. Associated with each file descriptor is a file pointer which records the current location in the file. A read will read bytes starting at the file pointer, and will increment the file pointer by the number of bytes read. A write will write bytes to the location specified by the file pointer and likewise increment it. The file pointer can be adjusted to any location (even beyond the end of the file) by a seek operation.

Text files and binary files are the same. Whether or not a program can handle binary files just depends on that particular program. All files are stored the same way (as a stream of bytes): text files, database files, executable programs, source code, etc.

No Record Orientation That You Don't Impose Yourself

Unix files have no record orientation. A program can impose any record orientation it likes: to work with 80-column records, simply read and write 80-byte chunks. However, very few Unix programs use fixed-length records, because they're so inflexible, and it's nearly as easy to deal with variable length lines in text files.

Text Files

Text files are simply a byte stream where a particular ASCII value, the newline or linefeed character (octal 012; decimal 10) is used to separate lines. Note that, conventionally, Unix does not use carriage return / linefeed pairs to separate lines. Just a single newline is used, which makes programming much simpler. Devices that require the concept of a carriage return (terminals, printers) simply map newline to whatever is required.

End of File

There is no end of file character in a Unix file. The file system records the length of every file and simply reports an end of file state when end of file is reached.

Directories and Filenames

Files don't have names, they have I-numbers. I-numbers index an array of inodes stored in each file system. Inodes contain useful information about files:

Type of file
- Plain file
- Directory
- Character-special
- Block-special
- Named pipe or fifo
- Symbolic link
- Socket
Link count
UID
GID
Owner permissions
Group permissions
Other permissions
Size in bytes
Last access time
Last modification time
Last inode change time
Data block pointers

A directory is a mapping from I-number to a filename. Directories are stored in ordinary byte-stream files, but they are marked as directories in the inode so that only the kernel can write to them (any process can read a directory).

Since directories are ordinary files, they have inodes and hence I-numbers. This means a directory can contain the I-number of another directory. This provides a graph-structured file system.

Since general cycles in the file system would make file system traversal difficult, the kernel disallows them. However, the kernel actually maintains two restricted types of cycles in each directory:

.: A link to the current directory
..: A link to the parent directory

The kernel assures that every directory contains two entries, one mapping the filename . to the directory's own I-number, and one mapping the filename .. to the directory's parent's I-number. This means that every directory can be reached by traversing other directories in the file system; in other words, the graph of the file system is connected.

There is a distinguished directory called the root directory; all other directories are descendants of this directory. The name of the root directory is /.

Pathnames

Users usually identify files not by I-number, but rather by pathname. A pathname is conceptually a sequence of one or more directory names, always starting with the root directory, and ending (optionally) with the name of a non-directory. (If the pathname doesn't end with a non-directory, it names a directory rather than a plain file.) These directory names and the final optional file name are components of the pathname. The pathname specifies a path to the file of interest from the root, naming explicitly every directory that must be traversed to reach it.

Pathnames are spelled with slashes (/) separating the path components. Since the name of the root directory is /, a leading slash is taken to name the root directory, and all the other slashes are separtors.

To simplify the use of pathnames, there is an alternate form called relative pathnames (the form already described is called absolute pathnames). A relative pathname is any pathname which does not start with a slash. Since it doesn't start with a slash, it doesn't (necessarily) start at the root: how do we decide where it starts? Simple: a relative pathname starts at the current working directory.

Current Working Directory

Every process has a current working directory (CWD), which is used to resolve relative pathnames. The CWD of a process is inherited by it's child processes, but any process can change its CWD at will. The login program sets the CWD of the shell to the user's home directory. Other than this behavior, and the fact that a user's home directory is arranged to be owned by and writable by that user, there is nothing special about the home directory (though certain programs may assume that they can find certain files (typically configuration files) there).

Hard Links

If the presence of an I-number in a directory names that file, what's to prevent an I-number from appearing in that directory multiple times, or, from appearing in other directories? Nothing: Unix files can have many names. Each directory entry is called a link, and so a file can have many links.

Special File Descriptors

Every process already has three files open when it is created. Hence, it has three file descriptors in use. These open files are called standard input (stdin), standard output (stdout), and standard error (stderr). This allows a style of program which functions as a simple filter, reading bytes from standard input, processing them, and writing the result to standard output. Standard error is used to report errors, so that they aren't mixed up with the output.

No program is required to be written in acordance with this model: a process is free to ignore these open files and process named files instead. But many Unix programs are written as filters and this allows them to be connected in a pipeline, standard output to standard input, to build complex applications out of a simple chain of filters.

Files as Abstractions

A file is an abstraction through which we can access data on disk. The kernel requires us to use a small number of system calls to access files; these system calls define the file abstraction. The basic operations on files are:

Create
Remove
Open
Close
Read
Write

Creating A File

A file is created by creating its inode. This can only be done via the kernel (through a system call). The kernel insists that a pathname (absolute or relative) be specified when creating an inode, so creating an inode also creates a link in a directory. This means that every file has at least one name.

Removing A File

There is no way to directly remove an inode. The kernel provides a system call to remove a link from a given directory. This does not necessarily remove the inode or free the data blocks associated with the file, because the file may have more then one name and thus there may still be other links to the file in other directories.

For this reason the kernel maintains a link count in the inode, which serves as a reference count for the inode and data blocks. Whenever a file is unlinked, it's link count is decremented. When the link count goes to zero, the inode and data blocks are freed.

This is another reason why general cycles are not allowed in the file system: a cycle would throw off the link count and result in unused structures that couldn't be freed.

Since directories contain the all important links, they can't be simply removed (unlinked) like plain files. A special system call must be used to remove a directory, and it will only succeed if the only links in the directory are . and ...

Opening a File

A file is opened by specifying a pathname and a mode. The mode can be any of:

read only
write only
read / write

An optional parameter allows the file to be created if it doesn't exist. In addition, one can specify truncation on open, synchronous writes, etc. The open system call returns a file descriptor.

Closing a File

All file descriptors are automatically closed when a process exits, but the close system call can also be called explicitly. It takes a file descriptor as a parameter. Closing a file frees locks, shuts down network connections, etc, as appropriate.

Reading a File

The read system call takes a file descriptor, a pointer to a buffer, and a count of the number of bytes to read. Any number of bytes can be read with one read. If you are reading from a pipe, fifo, or socket, the read will (by default) block if there are not yet as many bytes to read as you requested.

Writing a File

The write system call takes the same arguments as the read system call, but assumes the buffer contains data to be written and writes as many bytes as requested.

Access Permissions

Every inode has a set of access permissions associated with it. Unix access permissions are very simple; the kernel doesn't support anything like access control lists (ACLs). This simplicity makes it easy to program and understand file permissions, and easy to predict the ramifications of combining different levels of access with different owners -- something which is difficult on more complex systems. ACLs, when needed, are implemented with user mode code.

There are three levels of access permissions:

user
group
other

These are combined with three possible actions:

read
write
execute

These combine to make 512 possible combinations, which is already a lot: imagine the complexity of systems that use more elaborate access controls.

User permissions specify what the owner of the inode can do to the file, group permissions specify what anyone in the group of the inode can do, and other permissions cover everybody else.

Read access allows you to read the contents of the files data blocks, write access allows you to modify the contents of those data blocks, and execute permission allows you to execute the contents of a file as a program. (It's possible to be able to execute a file without being able to read it.)

Some operating systems also have the notion of file create and file delete access: in Unix, this is simply controlled by permissions on the containing directory. To do either operation requires write access to the containing directory (because creating or removing a link really does require write access to the data blocks of the directory file).

Similarly, Unix doesn't need list directory access, because this is controlled by read permission on the directory itself. However, execute permission doesn't make sense for a directory (which can never contain executable code), so this bit is overloaded with another meaning: it controls the ability to use a directory as a component of a pathname.

These Unix file permissions (sometimes known as the file mode) are stored in the inode as nine bits:

UUUGGGOOO
rwxrwxrwx

a read, a write and an execute bit for each of user, group and other. The ls program (when invoked with the -l option) actually displays the permissions as rwxrwxrwx with hyphens replacing the bits that are off. The permission bits are also often expressed as three octal digits, since it's so easy to compute the bits that way. (Honest!)

File System as Name Space

A name space is a mapping from names (strings) to objects of some kind. A file system is a name space, where the names map to files on the disk. But operating systems need name spaces for other things as well: processes, network connections, terminals, tape drives, printers, etc. All these things need names or identifiers. One of the most innovatove and successful aspects of the design of Unix is the use of the file system as a unifying name space for things other than disk files.

In Unix, all devices exist in the file system, traditionally located under /dev. In addition, some interprocess communication mechanisms (fifos and Unix domain sockets) are located in the file system. This means that the file abstraction can be used to manipulate these devices, simplifying programming. In Unix, writing programs to manipulate disk files, tape drives, or network connections are made as much alike as possible.

Special Files

Earlier I said that Unix only has one type of file. But the inode has bits for the file type. Now we see how this apparent contradiction is resolved. The other types of files are actually other objects in the file systems name space, which can be be manipulated with the file abstraction. So even though these things are not disk files, but tape drives, terminals and network connections, they can be manipulated as files by opening and closing them, and reading and writing them.

Character Special Files

Character special files represent devices that can be accessed a byte at a time, like a terminal, disk partition, tape drive, etc. These are also known as raw devices.

Block Special Files

Block special files represent devices that can be accessed by blocks, which are buffered by the kernel. These are also known as cooked devices (because they're not raw). These are typically disk partitions.

Note that disk partitions are represented in three ways under Unix: as a collection of plain files in the file system, as character special devicess, and as block special devices. You can use whatever method of access is approproate. Usually, we access the disk partition through the file system, exploited the kernel-provided name space, caching, access control, etc. But disk backups are usually performed by opening the raw device for each disk partition and dumping it to tapes, to gain extra speed by bypassing the overhead of going through the kernel. In addition, special purpose file systems (say for a high performance DBMS) can be built on a disk partition and accessed (portably) via character or block special files.

Note that the permission to access the raw disk partition directly is controlled, simply and elegantly, by the usual Unix access permissions on the special files.

Named Pipes (FIFOs)

A named pipe or fifo is an extremely simple and elegant interprocess communication (IPC) mechanism. When opened for writing by a process, it causes that process to block until another process opens the same fifo for reading. Then the two processes have successfully rendezvoused and whatever bytes are written to the fifo by the writer can be read by the reader. The bytes aren't stored on the disk, but rather in kernel buffers. This makes Unix IPC extremely simple.

Keith Waclena
The University of Chicago Library
This page last updated: Mon Aug 8 17:46:07 CDT 1994
This page was generated from Extended HTML by xhtml.