The Unix File System
Only One Kind of File: Byte Streams
A file is a stream (a linear array) of 8-bit bytes.
Any byte or contiguous sequence of bytes may be read or written,
regardless of any underlying block, track or cylinder structure in the
hardware. A single read or write can access any number of bytes:
there is no maximum size, and it needn't correspond to the underlying
disk organization.
An file opened by a process is identified by a file
descriptor which is private to that process. Associated with
each file descriptor is a file pointer which records the
current location in the file. A read will read bytes
starting at the file pointer, and will increment the file pointer by
the number of bytes read. A write will write bytes to the
location specified by the file pointer and likewise increment it. The
file pointer can be adjusted to any location (even beyond the end of
the file) by a seek operation.
Text files and binary files are the same. Whether or not a program
can handle binary files just depends on that particular program. All
files are stored the same way (as a stream of bytes): text files,
database files, executable programs, source code, etc.
No Record Orientation That You Don't Impose Yourself
Unix files have no record orientation. A program can impose any
record orientation it likes: to work with 80-column records, simply
read and write 80-byte chunks. However, very few Unix programs use
fixed-length records, because they're so inflexible, and it's nearly
as easy to deal with variable length lines in text files.
Text Files
Text files are simply a byte stream where a particular ASCII value,
the newline or linefeed character (octal 012; decimal 10) is used to
separate lines. Note that, conventionally, Unix does not use carriage
return / linefeed pairs to separate lines. Just a single newline is
used, which makes programming much simpler. Devices that require the
concept of a carriage return (terminals, printers) simply map newline
to whatever is required.
End of File
There is no end of file character in a Unix file. The file system
records the length of every file and simply reports an end of file
state when end of file is reached.
Directories and Filenames
Files don't have names, they have I-numbers. I-numbers
index an array of inodes stored in each file system.
Inodes contain useful information about files:
- Type of file
- Plain file
- Directory
- Character-special
- Block-special
- Named pipe or fifo
- Symbolic link
- Socket
- Link count
- UID
- GID
- Owner permissions
- Group permissions
- Other permissions
- Size in bytes
- Last access time
- Last modification time
- Last inode change time
- Data block pointers
A directory is a mapping from I-number to a filename.
Directories are stored in ordinary byte-stream files, but they are
marked as directories in the inode so that only the kernel can write
to them (any process can read a directory).
Since directories are ordinary files, they have inodes and hence
I-numbers. This means a directory can contain the I-number of another
directory. This provides a graph-structured file system.
Since general cycles in the file system would make file system
traversal difficult, the kernel disallows them. However, the kernel
actually maintains two restricted types of cycles in each directory:
- .
- A link to the current directory
- ..
- A link to the parent directory
The kernel assures that every directory contains two entries, one
mapping the filename . to the directory's own I-number, and one
mapping the filename .. to the directory's parent's I-number.
This means that every directory can be reached by traversing other
directories in the file system; in other words, the graph of the file
system is connected.
There is a distinguished directory called the root
directory; all other directories are descendants of this directory.
The name of the root directory is /.
Pathnames
Users usually identify files not by I-number, but rather by
pathname. A pathname is conceptually a sequence of one or
more directory names, always starting with the root directory, and
ending (optionally) with the name of a non-directory. (If the
pathname doesn't end with a non-directory, it names a directory rather
than a plain file.) These directory names and the final optional file
name are components of the pathname. The pathname
specifies a path to the file of interest from the root, naming
explicitly every directory that must be traversed to reach it.
Pathnames are spelled with slashes (/) separating
the path components. Since the name of the root directory is
/, a leading slash is taken to name the root
directory, and all the other slashes are separtors.
To simplify the use of pathnames, there is an alternate form called
relative pathnames (the form already described is called
absolute pathnames). A relative pathname is any pathname
which does not start with a slash. Since it doesn't start
with a slash, it doesn't (necessarily) start at the root: how do we
decide where it starts? Simple: a relative pathname starts at the
current working directory.
Current Working Directory
Every process has a current working directory (CWD), which is used to
resolve relative pathnames. The CWD of a process is inherited by it's
child processes, but any process can change its CWD at will. The
login program sets the CWD of the shell to the user's home
directory. Other than this behavior, and the fact that a user's
home directory is arranged to be owned by and writable by that user,
there is nothing special about the home directory (though certain
programs may assume that they can find certain files (typically
configuration files) there).
Hard Links
If the presence of an I-number in a directory names that file, what's
to prevent an I-number from appearing in that directory multiple
times, or, from appearing in other directories? Nothing: Unix files
can have many names. Each directory entry is called a
link, and so a file can have many links.
Special File Descriptors
Every process already has three files open when it is created. Hence,
it has three file descriptors in use. These open files are called
standard input (stdin
),
standard output (stdout
), and
standard error (stderr
).
This allows a style of program which functions as a simple
filter, reading bytes from standard input, processing them,
and writing the result to standard output. Standard error is used to
report errors, so that they aren't mixed up with the output.
No program is required to be written in acordance with this model: a
process is free to ignore these open files and process named files
instead. But many Unix programs are written as filters and this
allows them to be connected in a pipeline, standard output
to standard input, to build complex applications out of a simple chain
of filters.
Files as Abstractions
A file is an abstraction through which we can access data on disk.
The kernel requires us to use a small number of system calls to access
files; these system calls define the file abstraction. The basic
operations on files are:
- Create
- Remove
- Open
- Close
- Read
- Write
Creating A File
A file is created by creating its inode. This can only be done via
the kernel (through a system call). The kernel insists that a
pathname (absolute or relative) be specified when creating an inode,
so creating an inode also creates a link in a directory. This means
that every file has at least one name.
Removing A File
There is no way to directly remove an inode. The kernel provides a
system call to remove a link from a given directory. This does not
necessarily remove the inode or free the data blocks associated with
the file, because the file may have more then one name and thus there
may still be other links to the file in other directories.
For this reason the kernel maintains a link count in the
inode, which serves as a reference count for the inode and data
blocks. Whenever a file is unlinked, it's link count is decremented.
When the link count goes to zero, the inode and data blocks are freed.
This is another reason why general cycles are not allowed in the file
system: a cycle would throw off the link count and result in unused
structures that couldn't be freed.
Since directories contain the all important links, they can't be
simply removed (unlinked) like plain files. A special system call
must be used to remove a directory, and it will only succeed if the
only links in the directory are
. and
...
Opening a File
A file is opened by specifying a pathname and a mode. The mode can be
any of:
- read only
- write only
- read / write
An optional parameter allows the file to be created if it doesn't
exist. In addition, one can specify truncation on open, synchronous
writes, etc. The open system call returns a file descriptor.
Closing a File
All file descriptors are automatically closed when a process exits,
but the close system call can also be called explicitly. It takes a
file descriptor as a parameter. Closing a file frees locks, shuts
down network connections, etc, as appropriate.
Reading a File
The read system call takes a file descriptor, a pointer to a buffer,
and a count of the number of bytes to read. Any number of bytes can
be read with one read. If you are reading from a pipe, fifo, or
socket, the read will (by default) block if there are not yet as many
bytes to read as you requested.
Writing a File
The write system call takes the same arguments as the read system
call, but assumes the buffer contains data to be written and writes as
many bytes as requested.
Access Permissions
Every inode has a set of access permissions associated with it. Unix
access permissions are very simple; the kernel doesn't support
anything like access control lists (ACLs). This simplicity makes it
easy to program and understand file permissions, and easy to predict
the ramifications of combining different levels of access with
different owners -- something which is difficult on more complex
systems. ACLs, when needed, are implemented with user mode code.
There are three levels of access permissions:
These are combined with three possible actions:
These combine to make 512 possible combinations, which is already a
lot: imagine the complexity of systems that use more elaborate access
controls.
User permissions specify what the owner of the inode can do
to the file, group permissions specify what anyone in the
group of the inode can do, and other permissions cover
everybody else.
Read access allows you to read the contents of the files
data blocks, write access allows you to modify the contents
of those data blocks, and execute permission allows you to
execute the contents of a file as a program. (It's possible to be
able to execute a file without being able to read it.)
Some operating systems also have the notion of file create
and file delete access: in Unix, this is simply controlled
by permissions on the containing directory. To do either operation
requires write access to the containing directory (because creating or
removing a link really does require write access to the data blocks of
the directory file).
Similarly, Unix doesn't need list directory access, because
this is controlled by read permission on the directory itself.
However, execute permission doesn't make sense for a directory (which
can never contain executable code), so this bit is overloaded with
another meaning: it controls the ability to use a directory as a
component of a pathname.
These Unix file permissions (sometimes known as the file
mode) are stored in the inode as nine bits:
UUUGGGOOO
rwxrwxrwx
a read, a write and an execute bit for each of user, group and other.
The ls
program (when invoked with the -l
option) actually displays the permissions as rwxrwxrwx
with hyphens replacing the bits that are off. The permission bits are
also often expressed as three octal digits, since it's so easy to
compute the bits that way. (Honest!)
File System as Name Space
A name space is a mapping from names (strings) to objects
of some kind. A file system is a name space, where the names map to
files on the disk. But operating systems need name spaces for other
things as well: processes, network connections, terminals, tape
drives, printers, etc. All these things need names or identifiers.
One of the most innovatove and successful aspects of the design of
Unix is the use of the file system as a unifying name space for things
other than disk files.
In Unix, all devices exist in the file system, traditionally located
under /dev
. In addition, some interprocess communication
mechanisms (fifos and Unix domain sockets) are located in the file
system. This means that the file abstraction can be used to
manipulate these devices, simplifying programming. In Unix, writing
programs to manipulate disk files, tape drives, or network connections
are made as much alike as possible.
Special Files
Earlier I said that Unix only has one type of file. But the inode has
bits for the file type. Now we see how this apparent contradiction is
resolved. The other types of files are actually other objects in the
file systems name space, which can be be manipulated with the file
abstraction. So even though these things are not disk files, but tape
drives, terminals and network connections, they can be manipulated as
files by opening and closing them, and reading and writing them.
Character Special Files
Character special files represent devices that can be
accessed a byte at a time, like a terminal, disk partition, tape
drive, etc. These are also known as raw devices.
Block Special Files
Block special files represent devices that can be accessed
by blocks, which are buffered by the kernel. These are also known as
cooked devices (because they're not raw). These are
typically disk partitions.
Note that disk partitions are represented in three ways under Unix: as
a collection of plain files in the file system, as character special
devicess, and as block special devices. You can use whatever method
of access is approproate. Usually, we access the disk partition
through the file system, exploited the kernel-provided name space,
caching, access control, etc. But disk backups are usually performed
by opening the raw device for each disk partition and dumping it to
tapes, to gain extra speed by bypassing the overhead of going through
the kernel. In addition, special purpose file systems (say for a high
performance DBMS) can be built on a disk partition and accessed
(portably) via character or block special files.
Note that the permission to access the raw disk partition directly is
controlled, simply and elegantly, by the usual Unix access permissions
on the special files.
Named Pipes (FIFOs)
A named pipe or fifo is an extremely simple and
elegant interprocess communication (IPC) mechanism. When opened for
writing by a process, it causes that process to block until another
process opens the same fifo for reading. Then the two processes have
successfully rendezvoused and whatever bytes are written to the fifo
by the writer can be read by the reader. The bytes aren't stored on
the disk, but rather in kernel buffers. This makes Unix IPC extremely
simple.
Keith Waclena
The University of Chicago Library
This page last updated: Mon Aug 8 17:46:07 CDT 1994
This page was generated from Extended HTML by xhtml.