Cataloging Overview

Reed Hepler; David Horalek

1 Cataloging Overview

Cataloging is perhaps the most exciting subject you could read about. You probably howled with delight when you found you could register for this course, right? Just kidding. I know that even the mention of cataloging and tools used in the process are probably coma-inducing to most people. However, the specifics of cataloging are necessary for librarians, educators, and others to learn. If we know the details of how to catalog, we will know how to help others find the specific items they are looking for. If they are looking for groups of resources, we can help them find several options that fulfill certain criteria.

The goal of this textbook is to explain the details of cataloging in a concise, thorough manner. Interspersed with these explanations will be interactive assessments and experiences in which you can test and demonstrate your knowledge. The average textbook is around 50,000 to 100,000 words. This textbook is around 25,000 words. Even when I update it or add sections, I want to remain in that range. At the most, I will change it to be 30,000 words. There is nothing more pointless than including excessive detail. This is an introductory textbook, and it provides links to more detailed sources of information if you would like to browse those.

Describing, categorizing, and classifying information resources is a detailed process that can involve up to three or four schemata at once. Essential to the best practices of cataloging is the use of controlled entries when creating records. This concept is called controlled vocabulary, and it influences virtually every decision of the cataloger.

Use of Controlled Vocabularies

One way to encourage good names for a given resource domain is to establish a controlled vocabulary. This is like a fixed or closed dictionary that includes the terms that can be used in a particular domain. A controlled vocabulary shrinks the number of words used and leaves behind a set of words with precisely defined meanings and rules governing their use. The results of using a set of specified terms include the elimination of undesirable associations and the removal of synonyms and homonyms. In other words, the possible entries in a field are streamlined to be as precise and thorough as possible while also avoiding any confusion.

A controlled vocabulary is not simply a set of allowed words; it also includes their definitions and often specifies rules by which the vocabulary terms can be used and combined. Different domains can create specific controlled vocabularies for their own purposes, but the important thing is that the vocabulary be used consistently throughout that domain.

For bibliographic resources important aspects of vocabulary control include determining the authoritative forms for author names, uniform titles of works, and the set of terms by which a particular subject will be known. In library science, the process of creating and maintaining these standard names and terms is known as authority control.

When evaluating what name to use for an author, librarians typically look for the name form that is used most commonly across that author’s body of work while conforming to rules for handling prefixes, suffixes and other name parts that often cause name variations. For example, a name like that of Johann Wolfgang von Goëthe might be alphabetized as both a “G” name and a “V” name, but using “G” is the authoritative way. “See” and “see also” references then map the variations to the authoritative name. The Library of Congress has specific names and formats used for the majority of authors. These are called Name Authorities and will be discussed later.

Controlled Vocabularies and Content Rules

Content rules are similar to controlled vocabularies because they also limit the possible values that can be used in descriptions. Instead of specifying a fixed set of values, content rules typically restrict descriptions by requiring them to be of a particular data type (integer, Boolean, Date, and so on). Possible values are constrained by logical expressions (e.g., a value must be between 0 and 99) or regular expressions (e.g., must be a string of length 5 that must begin with a number). Content rules like these are used to ensure valid descriptions when people enter them in web forms or other applications.

Vocabulary Control as Dimensionality Reduction

In most cases a controlled vocabulary is a subset of the natural or “uncontrolled” vocabulary, but sometimes it is a new set of invented terms. The goal of a controlled vocabulary is to reduce the number of descriptive terms assignable to a resource. Thus, controlled vocabulary and content rules bring about a phenomenon called “dimensionality reduction.” In other words, they are used to help catalogers in reducing the number of components in a description. These facets of catalog records are driven by a plethora of pretentious-sounding processes, theories, and paradigms.

These terms, and the processes of using them, might sound imposing. Indeed, they are computationally complex, but they all have the same simple concept at their core: the features or properties that describe a resource are often highly correlated. For example, a document that contains the word “religion” is more likely to contain the words “faith” and “ritual” than a document that does not. Similar correlations exist among the visual features used to describe images and the acoustic features that describe music. Dimensionality reduction techniques analyze the correlations between resource descriptions to transform a large set of descriptions into a much smaller set of uncorrelated ones.

Here is an oversimplified example that illustrates the idea. Suppose we have a collection of resources, and every resource described as “big” is also described as “red,” and every “small” resource is also “green.” This perfect correlation between color and size means that either of these properties is sufficient to distinguish “big red” things from “small green” ones, and we do not need clever algorithms to figure that out. But if we have thousands of properties and the correlations are only partial, we need the sophisticated statistical approaches to choose the optimal set of description properties and terms, and in some techniques the dimensions that remain are called “latent” or “synthetic” ones because they are statistically optimal but do not map directly to resource properties.

Creating Resource Descriptions

Resource descriptions can be created by professionals, by the authors or creators of resources, by users, or by computational or automated means.

From the traditional perspective of library and information science with its emphasis on bibliographic description, these modes of creation imply different levels of description complexity and sophistication; Taylor and Joudrey, in their classic book The Organization of Information, suggest that professionals create rich descriptions, untrained users at best create structured ones, and automated processes create simple ones.^[1]

This classification reflects a disciplinary and historical bias more than reality. “Simple” resource descriptions are “no more than data extracted from the resource itself… the search engine approach to organizing the web through automated indexing techniques.”^[2]

A better notion of levels of resource description is one based on the amount of interpretation imposed by the description, an approach that focuses on the descriptions themselves rather than on their methods of creation.

Professionally-created resource descriptions, author- or user-created descriptions, and computational or automated descriptions each have strengths and limitations that impose tradeoffs. A natural solution is to try to combine desirable aspects from each in hybrid approaches. For example, the vocabulary for a new resource domain may arise from tagging by end users but then be refined by professionals, lay classifiers may create descriptions with help from software tools that suggest possible terms, or software that creates descriptions can be improved by training it with human-generated descriptions.

Resource Description by Professionals

Before the web made it possible for almost anyone to create, publish, and describe their own resources and to describe those created and published by others, resource description was generally done by professionals in institutional contexts. Professional indexers and catalogers described bibliographic and museum resources after having been trained to learn the concepts, controlled descriptive vocabularies, and the relevant standards. In information systems domains professional data and process analysts, technical writers, and others created similarly rigorous descriptions after receiving analogous training. We have called these types of resource descriptions institutional ones to highlight the contrast between those created according to standards and those created informally in ad hoc ways, especially by untrained or undisciplined individuals.

Resource Description by Authors or Creators

The author or creator of a resource can be presumed to understand the reasons why and the purposes for which the resource can be used. And, presumably, most authors want to be read, so they will describe their resources in ways that will appeal to and be useful to their intended users. However, these descriptions are unlikely to use the controlled vocabularies and standards that professional catalogers would use.

Resource Description by Users

Today’s web contains a staggering number of resources, most of which are primary information resources published as web content, but many others are resources that stand for “in the world” physical resources. Most of these resources are being described by their users rather than by professionals or by their authors. These “at large” users are most often creating descriptions for their own benefit when they assign tags or ratings to web resources, and they are unlikely to use standard or controlled descriptors when they do so. The resulting variability can be a problem if creating the description requires judgment on the tagger’s part. Most people can agree on the length of a particular music file but they may differ wildly when it comes to determining to which musical genre that file belongs. Fortunately most web users implicitly recognize that the potential value in these “Web 2.0” or “user-generated content” applications will be greater if they avoid egocentric descriptions. In addition, the statistics of large sample sizes inevitably leads to some agreement in descriptions on the most popular applications because idiosyncratic descriptions are dominated in the frequency distribution by the more conventional ones.

We are not suggesting that professional descriptions are always of high quality and utility, and socially produced ones are always of low quality and utility. Rather, it is important to understand the limitations and qualifications of descriptions produced in each way. Tagging lowers the barrier to entry for description, making organizing more accessible and creating descriptions that reflects a variety of viewpoints. However, when many tags are associated with a resource, it increases recall while decreasing precision. This causes immense problems when catalogers and other professionals seek to categorize and classify resources. Ofttimes, socially-produced tags are simply ignored and replaced with standardized, controlled metadata terms.

Evaluating Resource Descriptions

When professionals create resource descriptions in a centralized manner, which has long been the standard practice for many resources in libraries, there is a natural focus on quality at the point of creation to ensure that the appropriate controlled vocabularies and standards have been used. However, the need for resource description generalizes to resource domains outside of the traditional bibliographic one, and other quality considerations emerge in those contexts.

Resource descriptions in private sector firms are essential to running the business and in interacting efficiently with suppliers, partners, and customers. Compared to the public sector, there is much greater emphasis on the economics and strategy of resource description. What is the value of resource description? Who will bear the costs of producing them? Which of the competing industry standards will be followed? Some of these decisions are not free choices as much as they are constraints imposed as a condition of doing business with a dominant economic partner, which is sometimes a governmental entity.

For example, a firm like Wal-Mart with enormous market power can dictate terms and standards to its suppliers because the long-term benefits of a Wal-Mart contract usually make the initial accommodation worthwhile. Likewise, governments often require their suppliers to conform to open standards to avoid lock-in to proprietary technologies.

In both the public and private sectors there is increased use of computational techniques for creating resource descriptions because the number of resources to be described is simply too great to allow for professional description.

A great deal of work in text data mining, web page classification, semantic enrichment, and other similar research areas is already under way and is significantly lowering the cost of producing useful resource descriptions. Some museums have embraced approaches that automatically create user-oriented resource descriptions and new user interfaces for searching and browsing by transforming the professional descriptions in their internal collections management systems. Google’s ambitious project to digitize millions of books has been criticized for the quality of its algorithmically extracted resource descriptions, but we can expect that computer scientists will put the Google book corpus to good use as a research test bed to improve the techniques.

Web 2.0 applications that derive their value from the aggregation and interpretation of user-generated content can be viewed as voluntarily ceding their authority to describe and organize resources to their users, who then tag or rate them as they see fit. In this context the consistency of resource description, or the lack of it, becomes an important issue, and many sites are using technology or incentives to guide users to create better descriptions.

Copy Cataloging

Evaluating and improving upon existing descriptions and catalog records may seem to be a waste of time. However, it is a large part of modern cataloging operations. Copy cataloging is the practice of creating a new cataloging record based on a preexisting catalog record. This could be a record created by someone at your own institution or a record obtained form someone else. This is contrasted with original cataloging, which occurs when a cataloger creates a completely new record for an item. Copy cataloging has become increasingly mainstream as libraries develop networks and consortiums. The proliferation of services by OCLC has also affected this trend. OCLC provides access to preexisting catalog records, from catalogers can choose one that best meets their needs. Copy cataloging is a major boon to library workers as they will not have to fill all of the MARC records out manually. Ofttimes, only a few tweaks are needed to make the preexisting record compatible with the item in one’s own library holdings. For example, the format, edition number, and call number may have to be changed in a MARC record, but that may be the only changes necessary.

Test Your Knowledge

Types of Catalogs and Databases

Before we explore the intricacies of cataloging schemata and tools, we must first understand why we need specialized controlled vocabulary and authority records. After all, couldn’t we just put all this information on a website so that a search on a search engine or database will turn it up? Well, yes and no. Search engines occasionally pick up library catalog listings, especially if those have been created based on standardized practices. However, for use in the library system, a record must be placed into a specific type of database.

Another major difference between search engines and databases and catalogs is that the latter two have a wide range of delimiters, whereas a search engine may only have delimiters regarding the format of a result, time of publication, or author metadata. Boolean operators, (AND, OR, NOT, etc.) are even more important when using search engines, although they are also occasionally useful in databases and catalogs.

The major types of catalogs, databases and related services include:

*Integrated Library System: An ILS, also called a Library Management System (LMS), is a service that combines an Open Public Access Catalog (OPAC) with circulation and patron information as well as financial information when necessary. There are multiple facets of this system, and each of these services may be provided by individual providers or multiple vendors. ILS systems can also provide connections to other services, such as databases and links to consortia. While Card Catalogs only had three search options, the ILS adds a fourth option: specific metadata such as whole or partial call numbers can be used to search for books.

*Union Catalog: This type of catalog is exemplified by OCLC’s WorldCat. A union catalog is simply an OPAC that is used by multiple libraries who are functioning as a consortium and have fluid collections that can be loaned to each other. This type of catalog facilitates loaning items from one library to another, such as OCLC’s Inter-Library Loan service.

*Discovery System: This system is a combination of a catalog and a search engine. CSI Library uses a Discovery System that interfaces with our catalog for multiple resources. For our library holdings we have a Discovery System hosted by EbscoHost. For our Open Access materials, we have a system provided by SirsiDynix.

*Standard Database: A regular database holds items with copious amounts of data, including creation, title, author, subject, and other metadata. This database may or may not be open. It does not include information about anything other than the data it holds, although site traffic data may be recorded by third parties. Often, the collections in these databases have not been curated. This means they have not been specifically selected, standardized and/or evaluated.

*Card Catalog: This was the initial iteration of a database. It was completely physical with no digital facets. There were only three types of searches one could perform using a card catalog. A user could search for works by Title, by Author, and by Subject. For a time, this was all information that was necessary for a thorough and accurate library search. However, in the increasingly modernizing world, a new iteration of catalog was necessary. This type of catalog was often accompanied by a shelf list, which listed all books in a library by order of their call number.

This work was adapted from the sections “Naming Resources” and “The Process of Describing Resources” from The Discipline of Organizing by Robert J. Glushko. This work has a Creative Commons Attribution-NonCommercial 4.0 International License.

Joudrey, D. N., Taylor, A. G., & Wisser, K. M. (2018). The Organization of Information (4th ed.), Libraries Unlimited, 184-186. ↵
Joudrey, D. N., Taylor, A. G., & Wisser, K. M. (2018). The Organization of Information (4th ed.), Libraries Unlimited, 184. ↵

License

Icon for the Creative Commons Attribution-NonCommercial 4.0 International License

Cataloging with MARC, RDA, and Classification Systems Copyright © 2023 by College of Southern Idaho is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License, except where otherwise noted.