This section is divided into three parts. The first describes the metadata standards for digital objects which librarians have developed during the past decade, some of which they are still developing. The second considers which of these standards might be relevant to the five types of materials described in the charge, because the types of materials on which librarians have so far focussed are often of a different type (for example, so-called cultural materials), so some analysis is necessary before trying to apply standards originating in one domain to materials originating in another. Finally, we present a framework for establishing the core metadata elements needed for archiving University materials. These last two sections reflect the understanding that, while some metadata[1] are essential, there is a cost to metadata creation or capture which must be contained.
[1] "Data" is a plural noun in standard English. In computer circles it is often used as a singular noun. Here we adopt the standard English usage for "data," and by extension, for "metadata." [Group, we need a footnote like this, but you get to decide whether you want to use it as a singular or as a plural, and we alter the note accordingly. But we should be consistent.]
Metadata is data about data. A good example is a library catalog card, which contains data about the nature and location of a book: it is data about the data in the book referred to by the card.
The content combined with its metadata is often called a content package.
http://en.wikipedia.org/wiki/Metadata
Metadata has progressed a good deal from the days of the library catalog card. Those working with digital libraries today commonly recognize the following kinds of metadata:
Descriptive metadata most closely resemble the bibliographic data found on a library catalog card.
Preservation, rights, and technical metadata (defined below) are often grouped together under the overarching term "administrative metadata," which can also include other kinds of metadata, such as digital provenance, which describes the history of a digital document, including its migration to new formats.
Structural metadata describe how a set of digital objects should be combined to form a compound digital object, for example, how individual page images should be combined to form a digital book (if the book were scanned page by page), or how audio tracks should be combined to make a recording. The definition of metadata above added "The content combined with its metadata is often called a content package." In practice, standards for structural metadata also include content packaging information as well.
The traditional standard for representing machine-readable bibliographic data is MARC (MAchine-Readable Cataloging), which describes both an exchange format (a syntax) and a markup specification (a semantics). The modern digital library replaces the MARC syntax with XML (Extensible Markup Language), and has introduced new descriptive metadata standards for digital materials. A brief introduction to some of the more important of these follows.
For some projects the Library finds it necessary to create customized descriptive metadata element sets. However, in these cases it, too, creates mappings between these custom elements to unqualified Dublin Core, to facilitate the exchange of metadata, for example, via OAI-PMH.
The Library's Non-MARC Metadata Working Group maintains information on other descriptive metadata standards and proposals which are also of importance or interest to the digital library community. Of these, one is of especial interest to us. One of the creators of Dublin Core, John Kunze, simplified that standard even further, as follows:
Though it is not a standard, but rather a methodically articulated proposal, this Electronic Resource Citation format, or ERC, which Kunze describes in more detail in A Metadata Kernel for Electronic Permanence, is worthy of our attention as presenting a cost-effective, core descriptive metadata element set which may prove serviceable enough for archival description.
Preservation metadata record information required for the preservation of digital objects. Core preservation metadata record information not recorded by another applicable standard (i.e., descriptive, rights, structural, or technical). A core preservation metadata set is currently being defined by the PREMIS (PREservation Metadata: Implementation Strategies) working group, sponsored by OCLC and the Research Libraries Group (RLG). The final draft standard is expected at the end of 2004.
PREMIS is also considering what kinds of rights information might need to be included in a core preservation metadata element set. The digital library community is several years away (at least) from a rights expression language suitable for its purposes. (ODRL and XrML are too narrowly focussed on digital media and commercial publishing interests.) In the absence of any applicable standard, simple local rights expression languages may be developed to address local needs.
Three standards for structural metadata currently have "mindshare" in the Library community.
Technical metadata answer the question, What kind of digital object is this? Possible answers might be, TIFF, ASCII, etc. Technical metadata should also give more precise information about the kinds of formats a digital object contains. For example, there is more than one kind of TIFF, PDF, etc. PREMIS is defining a core set of technical metadata for preservation purposes. The Library is also working to define its technical metadata requirements, which should be available soon. It will be compared to the PREMIS core set when that is available.
Because identifying and recording technical metadata can be expensive, automatic extraction is attractive as a way to keep costs in hand. JHOVE (JSTOR/Harvard Object Validation Environment), a tool for the automatic extraction of technical metadata from digital objects, is being developed to address this need.
The charge asks us to consider five types of materials:
We will apply the five types of metadata to these five types of materials, to see which are applicable.
Archives whose purpose is the preservation of digital materials differ with respect to how much descriptive metadata to record about deposited materials (hereafter, "deposits"). A minimalist approach is taken by the oldest of these digital archives, that of Harvard University. It asserts that full descriptive metadata should reside elsewhere, for example, in the university's online catalog. The archive keeps only minimal descriptive metadata, sufficient to help match up deposits to a canonical description in an external catalog, to help answer the question, from the archive's perspective, "What object is this?" should that question arise for some reason. We agree with this approach. We assert that good bibliographic description is important, but we also assert that what constitutes good bibliographic description may go beyond what an archive needs. A reasonable, core descriptive metadata element set is provided by Kunze's Electronic Resource Citation (ERC), which maps easily to unqualified Dublin Core for interoperability (e.g., the export of metadata records into another system), but which at the same time identifies a "core Dublin Core" for the purposes of archiving University materials. In looking at the five types of materials, it would seem that ERC's "who," "what," "when," and "where," or (in Dublin Core terms), "creator," "title," "date," and "identifier," apply to all of them. However, having defined, or required, core descriptive metadata elements for deposits does not thereby disallow fuller descriptive metadata elements from being provided if they exist; for example, one can easily imagine future deposits of instructional materials being already provided with full LOM descriptions. What the archive can do in these cases is extract the core elements it needs for its purposes (for example, using the LOM to Dublin Core mappings which the LOM standard defines), and store the full descriptive metadata together with the deposit as part of a content package.
The PREMIS core preservation metadata element set is currently under construction and due to be released as a final draft by the end of 2004. The group's membership is international in scope, it is being sponsored by both of the big bibliographic utilities (RLG and OCLC) in the United States, and it has reviewed all earlier work done in this area with a view to bringing it to conclusion. We must therefore look to PREMIS when defining our core preservation metadata elements, because PREMIS will set the standard in this area.
Because no applicable rights metadata element sets exist, we will have to construct one. We discuss this in the next part. We regard a definition of rights as fundamental to archiving all five types of materials.
With the exception of instructional materials, we do not think that structural metadata standards are important for the types of materials identified in the charge at this time, though they are important for many traditional digital library materials. In addition, they might become important in the future, as digital archives mature and as one of these standards is required for the dissemination of content packages (data and metadata); they are certainly unavoidable today for archives whose purpose it is to serve the digital library community. For instructional materials, IMS may become more immediately important. [Chad, please weigh in on this with a yay or nay.] Structural metadata might become more immediately important if archiving University materials becomes so successful so quickly that a demand is created for this activity to expand beyond its original scope, for example, if there is pressure for it to serve the purposes of an institutional repository to facilitate scholarly communication (something which today is served by systems such as D-Space from MIT, or GNU EPrints from the University of Southampton). Though simple documents, such as PDF files, which constitute the bulk of what institutional repositories contain today, do not need content packaging for their dissemination, compound or complex documents do.
The PREMIS group is defining core technical metadata in addition to core preservation metadata. This work needs to be tracked, because technical metadata are crucial for the survival of deposits over the long term. Our position is that, given the considerable cost of manually inputting technical metadata, and the insurmountable cost of requiring that these be provided up front for all deposits, the archive should rely on automatic extraction of technical metadata for its deposits, using tools such as JHOVE.
The following is informed (though not exclusively) by discussions in the PREMIS group, and also by discussions in a subgroup of the Library's Archiving Group charged to specify the functional requirements for a Library archive, which are also informed by PREMIS. Because PREMIS itself is informed by prior work, and because the following is a synthesis which includes our own thinking, we do not credit the origin of any idea in this part.
The archive is viewed as involving the following entities:Agents and objects participate in events. Events occur at definite times. Events are enabled by system functions or toolsets, and are governed by policies. Events may be likened to verbs; we identify two event modalities (or adverbs), "may" and "must" (i.e., optional and required). Policies determine what events may, must or may not occur. An archive should record both policies, to determine what events may, must or must not occur, and occurrences (e.g., this event involved this agent and this object at this time).
The four core events involving agents outside the archive are:
Upon deposit, the following events may occur:
After deposit, the following events either may or must periodically occur, as indicated:
Events may occur as follows, as determined by policy:
Agents should be recorded as structured elements consisting of at least these components: name (forename; surname); title (i.e., something that designates a function within an organizational unit); organizational unit (or affiliation). At the time of deposit, these values should correspond to (and be able to be validated against) a University directory, which is also archived with sufficient periodicity to allow tracking changes to these values for any agent, to allow future events to take place as expected.
From the perspective of the archive, the following types of agent exist:
Groups (affiliations) may include faculty, instructor, researcher, staff, student, organizational unit, etc. An agent's type determines its rights to participate in specified archival events involving specified archival objects.
There are three kinds of object in the archive: data, metadata, and meta-metadata. Data are the objects themselves. Metadata describe properties of objects; they may include records of events involving those objects in the archive. Meta-metadata define who created, not the data (objects), but the metadata describing the objects, when they were created, and whose intellectual property they are. Meta-metadata are not only important for answering questions about metadata, but also to prevent the unauthorized use of metadata should an archive begin exchanging metadata with others, for example, using OAI-PMH.
Data objects participate in events. They need persistent (i.e., system-independent) identifiers to allow them to be unambiguously referred to by descriptive metadata. This is especially important if the primary finding aid (i.e., the catalog) for digital objects is not necessarily the archive itself, as we are recommending. Several well-recognized schemes for persistent identification exist, e.g., CNRI's The Handle System. However, locally created unique identifiers are also acceptable using simple mechanisms such as HTTP server-side redirects, which one report suggests scale well.
Rights policies specify:
To the extent that policies exist and are recorded programatically by the archive, for example, as actionable metadata, to that extent archival events can occur automatically. Otherwise, intervention by an archive administrator is required to determine whether an event may, must or must not occur.
The archive should be able to refer to a record of the reasons behind policies, to allow questions to be answered about them, and to inform any policy-review process.
Some relationships are implicitly recorded by an archive. For example, a record that an agent and an object participated in an event implies a relationship between the agent and the object. However, some relationships need to be explicitly recorded.
For example, if by policy successive editions of a document (such as the University statutes) are to be archived, then it is important to record which is the latest version. Alternatively, if an identical document exists in two formats, e.g., PDF and ASCII text, then it is important to record that fact. This implies a relationship element in metadata.
Turning this framework into a core metadata element set is a next step in the process of archiving University materials.