ltr: Vol. 46 Issue 1: p. 5
Chapter 1: Library Data in a Modern Context
Karen Coyle

Abstract

This chapter of “Understanding the Semantic Web: Bibliographic Data and Metadata” explores the history of library data and where it stands in a modern context. The rise of a new information environment—the World Wide Web—has revealed the downside of the long history that libraries have with metadata. The question that we must face, and that we must face sooner rather than later, is how we can best transform our data so that it can become part of the dominant information environment that is the Web.


The larger the library is, the more you must distinguish the books from each other, and consequently the more fully and more accurately you must catalogue them… When I come to a great and national library, where I have the editions or works of “Abelard,” I have a right to find those editions and works so well distinguished from each other that I may get exactly the particular one which I want.

—Sir Anthony Panizzi1

We can trace the origins of modern library cataloging practice back to the 1830s and Anthony Panizzi's 91 rules. Panizzi's singular insight was that a large catalog needed consistency in its entries if it was to serve the user. The years that followed brought waves of change that transformed the world socially, technologically, and intellectually. These changes were matched by a related evolution of libraries and library catalogs. The card catalog came about at the time of the industrial revolution, which was marked by a great increase in the production of printed materials. The true mechanization of the catalog was not possible until much more recent times, when advanced computer technology allowed the creation of the Online Public Access Catalog (OPAC) in the 1980s. Some might say that the term OPAC already sounds quaint to the ears of twenty-first-century librarians.

With each era, conceptual changes to the catalog have come in response to related changes in the catalog's context. Some changes in cataloging rules have addressed the new types of material that libraries must catalog, for instance, the changes that came with the emergence of recorded sound and films. Changes in the workflow of cataloging have been necessary to respond to the increased production of information resources. Technology itself has offered opportunities for change.

If there is one constant, it is that throughout these nearly two centuries, the modern library has continually transformed itself in an effort to respond to the needs of its contemporary user.

Today, we face another significant time of change that is being prompted by today's library user. This user no longer visits the physical library as his primary source of information, but seeks and creates information while connected to the global computer network. The change that libraries will need to make in response must include the transformation of the library's public catalog from a stand-alone database of bibliographic records to a highly hyperlinked data set that can interact with information resources on the World Wide Web. The library data can then be integrated into the virtual working spaces of the users served by the library.

If all of this sounds otherworldly and vague, it is because there is no specific vision of where these changes will lead us. The crystal ball is unfortunately shortsighted, in no small part because this is a time of rapid change in many aspects of the information ecology. The few things that are certain, however, point to the Web, and its eventual successors, as the place to be. For libraries, this means yet another evolutionary step in the library of our catalog: from metadata to metaDATA.


Defining Metadata

The most common definition of metadata is “data about data.” This short, catchy definition is worthy of a successful advertising campaign. Unfortunately, it doesn't really help us understand metadata, and is actually somewhat incorrect. A more useful definition is decidedly less snappy, but can help us understand the helpful role that metadata can play in facilitating information access. In fact, a functional definition gives us a viable roadmap for our own studies of metadata utility and quality.

So here it goes—metadata is constructed, constructive, and actionable:

  • Constructed: Metadata is not found in nature. It is entirely an invention; it is an artificiality.
  • Constructive: Metadata is constructed for some purpose, some activity, to solve some problem. The proliferation of metadata formats that seem similar on the surface is often evidence of different definitions of needs or of different contexts. We may dream of a universal set of metadata for some set of things, like biological entities, printed books, or a calendar of events, but are likely to be disappointed in practice.
  • Actionable: The point of metadata is to be useful in some way. This means that it is important that one can act on the metadata in a way that satisfies some needs.

From this rather lengthy definition, it is undoubtedly evident that the creation of good, functional metadata depends greatly on an understanding of the potential uses of the metadata and of the needs that the metadata must be designed to satisfy. It's not uncommon for people to approach the creation of metadata as a philosophical activity, attempting to define some kind of perfect universe for the thingsto be described. Metadata developed on theoretical, religious, or philosophical principles may be intellectually pleasing, but is unlikely to get the job done. Instead, the metadata that we find ourselves using every day is the metadata that we can use to accomplish some task. For example, figure 1 shows the earth.

Figure 2 is how we see the earth with the metadata of longitude and latitude.

The use of longitude and latitude is so familiar to us that it's almost easy to forget that the earth does not really have lines running along its axes. There are no lines marking points on the earth. Longitude and latitude were invented because these measurements were essential for the navigation of a vast ocean that provided no visual points of reference that humans could use. Longitude and latitude are a good example of constructed and constructive data. This metadata is also actionable; initially you had to have a clear sky and a sextant. Today we are fortunate to have sophisticated global positioning systems to tell us, with considerable accuracy, where on the planet we are currently located, yet these systems still use the planetary metadata that was developed over two thousand years ago.

There are other navigation systems, however, that aren't based on longitude and latitude. As a matter of fact, in terms of earthly location they are fairly inaccurate. Yet, they serve their users.

Figure 3 is a typical subway map. If you were to superimpose this map over the city it represents, you'd find that the subway map isn't “true,” in the sense that it is neither to scale nor are the stations located where they would be on a map based on longitude and latitude. This, however, isn't a defect of the subway map, because that isn't the purpose or function of the map. The map is intended to help us navigate the subway lines, often underground. We need to know where to change from one line to another, and in which direction to take the train. These maps leave out a great number of details that a geographer would consider essential in a map of the area. And yet they perform their job incredibly well, to the point that one can arrive in a city for the first time, perhaps even with only a limited understanding of the local language, and find one's way. These maps are a good example of functionality in metadata.

Metadata can also serve the function of substituting for something we cannot otherwise work with. The examples in figure 4 and 5—baseball statistics and a visualization of human DNA (Figure 5 on next page)—show how metadata can represent an otherwise intangible thing or concept. In the case of the baseball statistics, this metadata makes it possible to characterize a game, a player, or even an entire season and to make comparisons from one such representation to another. If you've ever spent time with enthusiasts of the game, you know that this seemingly abstract reduction of the game to fractions and percentages can be every bit as real to those fans as the very game itself. This metadata, as opposed to the experience of the game itself, provides concrete measurements that can answer burning questions like who the best player on the team might be. As for the DNA example, although we can be sure that our genetic material is not composed of differently shaded ovals, the microscopic size of the genome makes any communication about it impossible without a contrived representation.

MetaDATA

While longitude and latitude were useful even in ancient times, today's metadata must be in a form that can be processed by computers, and the sense that it is “actionable” really needs to be interpreted as being “actionable by electronic machines.” Even when the final goal is to display the data to humans in an understandable form, the data will undergo some machine processing on the way to its destination on a screen on in printed form. This need to be manipulated by a computer puts constraints on how the metadata is constructed. Machine-actionable metadata, however, provides possibilities that cannot be achieved with pre–computer era metadata that was designed to be read and interpreted by humans. Take a look at the two maps in figures 6 and 7.

Although they cover the same area and have approximately the same features, the functionality a user can get from them differs greatly. The map in figure 6 is a printed road map. I can use it to find my way from one city to another by reading the map image. Beyond that, though, this map is essentially inert. The map in figure 7 looks much like the map in figure 6, but what we see here is only one possible display. The map in figure 7 has machine-actionable metadata behind it. That allows the addition of features and gives users the ability to reuse it in ways that cannot be done with the paper map. The paper map always looks the same, with the same information. The machine-actionable map, however, can be used to create any number of different images, such to display all of the hotels in the downtown area (see figure 8) or to show bicycle paths or walking tours. These features can be presented because they all make use of the underlying layers of metadata.

The details that make this map so useful are generally hidden from human users. Figure 9 is an example of those details from an open source map service.

Not only can we create different displays when our metadata is in a machine-actionable form, but we are beginning to explore new possibilities in the ways that we can deliver the necessary information to the user. Since driving with a map on your lap and reading it while navigating the roads is far from ideal, new map services have developed that know where you are by using global positioning. Some even speak the directions to the driver, who then can follow them without taking her eyes off the road. This is an excellent example of basing functionality on the needs of the user and the context in which the data will be used.

Libraries and Metadata

It is fortunate for those who have the use of a library if their number is so small and their character so high that they can be admitted to the shelves and select their books on actual examination. As that is often not the case, a catalogue becomes necessary, and, even when it is the case, if the books are so numerous there must be some sort of guide to insure the quick finding of any particular book. The librarian can furnish some assistance, but his memory, upon which he can rely for books in general use, is of no avail for those which are sometimes wanted very much, although not wanted often.

—Charles Ammi Cutter2

Although the examples here are mainly about maps and navigating, the principles are the same when applied to other kinds of data, including bibliographic data. There is no question that libraries were among the earliest of social institutions to understand the function and value of metadata. There is evidence that even in the days of scroll-based libraries, some metadata was affixed to the end of each scroll on a tag that helped mark the location of the item when it was sought.3

Library bibliographic metadata has a number of functions: it acts as an inventory of the library's holdings; it aids in the discovery of those holdings in libraries large enough that the collection is not entirely known to the user; it acts as a surrogate for the item itself, which is often stored on a shelf with only its spine visible or in closed stacks. In addition, library cataloging practices over the years have developed methods for the identification of named persons, places, and topics.

Library metadata began as the library catalog, a finding aid for librarians and users. In the middle of the nineteenth century, the library catalog thinker like Charles Jewett had relatively limited requirements for the library catalog:

A catalog of a library is, strictly speaking, but a list of the titles of the books, which it contains.4

Later in that century, Charles Ammi Cutter saw the catalog not only as a list, but as a tool for answering information questions. Cutter had a lengthy set of questions that he wished the catalog to answer:

  • 1st. Has the library such a book by a certain author?
  • 2nd. What books by a certain author has it?
  • 3rd. Has it a book with a given title?
  • 4th. Has it a certain book on a given subject?
  • 5th. What books has it on a given subject?
  • 6th. What books has it in a certain class of literature?
  • 7th. What books have you in certain languages?

These are especially impressive because there was not a technology, beyond the book or card catalog, to help libraries provide these services. The one advantage that the developers of the library catalogs had in that day was that these were the only functions that the catalog would address, and the interface (the card) would have only human readers as its users. Our requirements became more detailed in the twentieth century, both because of the growth of libraries and the need for new technologies to serve our users and also because of the increased complexity of library management and the need to automate many library tasks.

In 1876, when Cutter wished for a catalog that would answer the question “What books have you in certain languages?” he could not have anticipated the need to filter one's retrieved set by language in order to reduce the number of items retrieved from thousands to “only” three or four hundred. It is clearly no longer sufficient to limit searches to author, title, and subject only, and successful searching is definitely not achieved solely through an alphabetical list of headings. Narrowing down a search today is as important as retrieving catalog records representing the holdings of the library.

The phenomenon of “information overload” was a fact of life before computer systems became inexpensive enough to be used in institutions like libraries. Had it not been for the computer, it is unlikely that libraries could have even begun to handle the explosion of information resources that occurred in the second half of the twentieth century. To be sure, by that time the contents of libraries had long outgrown the memory capacity of librarians.

To help users navigate this much more populous and fluid information landscape, library catalogs have been adding functionality that Cutter would not have even dreamed of. Selecting “a few good books” out of a catalog of millions of items is something no user would have the time to do. To help users get to the right resources, libraries are adding facets to narrow searches; ranking results to show users the most likely items first; adding book covers, tables of contents, and reviews that will give the user more information about the item than the facts in the catalog record; and using other techniques. Libraries have also tried to find ways to integrate their systems with the catalog information resources that have traditionally been treated as separate, such as the searching of abstracting and indexing services. All of these have put pressure on the catalog record, pushing it to perform functions it was not consciously designed to do.

Although the public catalog was designed to serve the user of the library, other information has always been used by librarians to manage the business of the library, including catalog production. Separate shelf lists and authority catalogs, rarely if ever seen by users, were an essential part of the management of libraries—especially of large libraries. Another type of catalog was used to track the receipt of serial issues, and yet another was needed for delivery and receipt of other materials. As these functions became automated, the catalog record ceased having a separate existence in the public catalog and became a part of library management systems. By the end of the twentieth century, the library record had to satisfy the needs of users, and in addition it had to provide support for a number of systems functions.

The integration of a variety of automated systems into a single library system has placed new demands on the record that represents the item held by the library, some of which are unrelated to satisfying user needs. The end result has been that the catalog record has taken on some system functions at the same time that it has had to respond to more complex user services. In addition to the purposes outlined by Cutter, library metadata has to interoperate with the library management data elements and systems functions, such as acquisitions and fund accounting, serials control and check in, and circulation systems.

Design for Sharing

The rise of a new information environment—the World Wide Web—has revealed the downside of the long history that libraries have with metadata. Library metadata methods were developed long before the advent of computer processing of metadata, and therefore library metadata, like the printed map in figure 6, was designed to be read and interpreted by human beings without any intervention by machines. It also was designed to basically stay the same throughout its existence, not to be recombined with other data.

In spite of this legacy of pre-computer practice, the question that we must face, and that we must face sooner rather than later, is how we can best transform our data so that it can become part of the dominant information environment that is the Web. This is a radical change in the context for library metadata, yet it is a logical extension of the design for sharing that has been a principle of library cataloging.

An important function of modern cataloging has been the sharing of catalogs and cataloging between libraries. The cataloging rules of the nineteenth and twentieth centuries evolved from institutional-specific rules to a modern concept of a widely used standard for sharable data that would facilitate the exchange of catalog information between libraries. In the nineteenth century, libraries printed book catalogs that could be given or sold to other libraries. Users could consult these catalogs to discover works held by other libraries. This was perhaps the first phase of remote access to library catalogs. The book catalog was portable and could be issued in multiple copies. Unfortunately, it was expensive to produce, since printing in those times meant setting type, and it quickly fell out of date. Adding new entries meant either issuing supplements outside of the order of the main catalog or reprinting the entire catalog with the new content inserted in its proper order.

The card catalog solved the update problem that the book catalog had suffered: new entries could be added anywhere in the catalog in their correct place by interfiling cards. However, the card catalog was not reproducible, so it was no longer possible to distribute copies of catalogs to other libraries.

This isolation of the library card catalog remained a problem for about one hundred years. It was only when the physical cards became electronic records in a database, and that database was connected to a global network, that libraries were able to achieve both goals: flexibility of update and remote access. Groups of libraries using the same data standards and cataloging rules were able to create union catalogs representing the holdings of multiple libraries. One such union catalog, WorldCat, has achieved the distinction as the world's largest database of library bibliographic data and holdings information.

Sharing of data among libraries has created great efficiencies in catalog production, and it has also expanded the available universe of resources for library users. Libraries remain, however, as an information environment separate from the Web. This makes a difference because the Web is where the majority of information seekers live, work, and play. It is also increasingly the environment where new information is created. Many information resources developed today will never be published in the traditional print-on-paper sense of that term. Users have less and less incentive to leave the Web and enter the library, either physically or by visiting a library catalog online.

The important question now is: how can the library catalog move from being “on the Web” to being “of the Web”? The linked data technology that has developed out of the semantic Web provides an interesting path to follow. It is specifically designed to facilitate the sharing of information on the Web, much in the way that the Web itself was developed to allow the sharing of documents. The library must become intertwined with that rich, shared, linked information space that is the Web. Rather than creating data that can be entered only into the library catalog, we need to develop a way to create data that can also be shared on the Web. This requires that we expand the context for the metadata that we create.

We are fortunate in the sense that we are in a position of having a large body of data that has been developed with sharing in mind, and also that the early developers of library cataloging codes, such as Anthony Panizzi, understood the value of consistency and the application of rules. Because of this situation, we are better positioned than some professions to redefine our data to be used in a complex and rich data environment such as the World Wide Web.

The Web as Context

The library catalog has been the sole context for library data since its inception. It is not a coincidence that we call the creation of library bibliographic data “cataloging,” that is, the creation of the catalog. The result has been a uniform set of metadata designed for the catalog's purposes: identifying the library's holdings, supporting management of those holdings, and providing entry and discovery points for librarians and nonlibrarian users.

There is an unmistakable need for libraries to know what they own as well as the current whereabouts of each item in their inventory, and the catalog is the basis for these functions. The use of the library catalog by information seekers, however, is diminishing, by all accounts. When journal article information became available online as a library service, users jumped at the chance to have easy access to this data, and soon more searches were being done in these databases than in the library's traditional catalog.6 It's not that one resource replaces the other, but that users have a finite amount of time and attention; new information sources that gain favor take up a certain quantity of the users’ information-seeking energies. Regardless of the inherent value of library-owned materials, there are only twenty-four hours in a day, and the time for study, research, and recreation does not expand as more information becomes available. The famed “information overload” is a time problem.

For a variety of reasons, users favor the Web as an information platform over the library. Studies show that users like the simple search options, and in particular they are pleased by the instant gratification that moves them directly from search to resources without having to even move their fingers from the keyboard. They also find great value in the social aspects of the Web, not so much for finding dates for Friday night, but in getting an idea of which resources might be best for them. One can question the quality of the ranking that users are presented with, but rather than face many screens of undifferentiated results, users are grateful that Internet search engines give them ranked results. The ranking is based on algorithms that are trade secrets, but the user knows that the first page is what nearly “everyone” would consider to be the key resources for their keyword query. When looking for a “good read,” a search on Amazon will turn up the best sellers out of the retrieved set. Services like Facebook, YouTube, and Flickr all allow users to create and view popularity ratings for resources and to write comments and reviews. All of these help users select from among a large number of retrieved items.

There are a number of social networking sites organized around books, such as LibraryThing, Goodreads, and BookMooch, each a kind of MySpace for the bookish set. In some cases the data has been derived from library bibliographic records, but it is just as likely to have come from nonlibrary sources such as Amazon.com. Amazon gets its data from publishers and booksellers, not from libraries. Some sites, such as Google Book Search, combine data from a variety of different sources, merging some descriptive data from libraries with the marketing data received from publishers (blurbs, author biographies).

Across these sites and many others, the Web is virtually awash in bibliographic data, and users who frequent certain Web sites are accustomed to seeing bibliographic data in contexts far from the library catalog. The New York Times bestseller list is online, as are the Web sites of publishers and authors. Libraries may have the greatest number of titles and the rare materials, but there is plenty of overlap in content between the library and Web, and between the library catalog and information on the Web. In addition, there is nonbibliographic data that could be related to bibliographic data. For example, the name “Herman Melville” and the fact that he wrote Moby Dick are facts that are not limited to the data in library catalogs; it is also found in encyclopedias, online discussions of American literature, and the course reading lists of classes of colleges and universities that can be found online.

Although there is an overlap of data, there is very little direct connection between the library catalog and the Web. Bibliographic citations online, such as those in the reference sections of Wikipedia entries, may link to a library's holdings. For example, if you retrieve bibliographic data, perhaps on Google, Open Library, or Goodreads, that represents a book, you can use that as a launch point to find the book in a library by using WorldCat. You can't, however, move easily from a statement in an essay about Abraham Lincoln to a list of books about Lincoln, much less a list of relevant books in your local library (let alone a list of resources that are on the shelf and currently available). Imagine if an online search on J. K. Rowling or Harry Potter could become an entry point into the library, and the visibility that could provide for libraries.

In return, library data could enrich bibliographic entries on the Web. Libraries are the only community with control over names, distinguishing between authors with the same or similar names and bringing together variant name forms. The addition of birth and death dates, once needed only to disambiguate similar names, is now essential information for an analysis of copyright status. Library data also facilitates the gathering of different editions around the concept of a work through the use of uniform titles. All told, the data that exists today in library catalogs could enhance the Web.

Change Happens

The need to change does not mean that what you are doing is wrong. Instead, it often means that something in your environment has changed, something that you cannot control. The change addressed by library cataloging pioneers like Panizzi and Cutter was that as the rate of publishing was greatly increasing, scholars and readers could no longer know everything that was available. The catalog was needed to help these users. At one time, the idea of a search by topic was unheard of, but it became necessary for catalogs to address so that users could find unknown items without help of librarian (“Give me a good book on …”). The change that we must address is that the Web is increasingly the source of information for searchers and researchers, and that the library needs to be interconnected with that web of data.


Notes
1. Great Britain, Parliament, Parliamentary Papers (Commons), 31 January–15 August 1850, vol. 33 (Accounts and Papers, vol. 1), No. 425, 1850, “Communications Addressed to the Treasury by the Trustees of the British Museum, With Reference to the Report of the Commissioners Appointed to Inquire Into the Constitution and Management of the British Museum,” 247, quoted in Jon R. Hufford, “The Pragmatic Basis of Catalog Codes: Has the User Been Ignored?” Texas Tech University, Libraries Faculty Research, 2007, http://dspace.lib.ttu.edu/bitstream/handle/2346/510/fulltext.pdf?sequence=1 (accessed Nov. 28, 2009).
2. Charles Ammi Cutter, “Library Catalogues,” in Public Libraries in the United States of America. Their History, Condition, and Management: Special Report, Department of the Interior, Bureau of Education, Part I, 526–622 (Washington, DC: Government Printing Office, 1876), 526.
3. Charles Jewett, quoted in Henry Petroski: Henry Petroski, The Book on the Bookshelf, 1st ed. (New York: Alfred A. Knopf, 1999), 28.
4. Charles Coffin Jewett, On the Construction of Catalogues of Libraries, 2nd ed. (Washington, DC: Smithsonian Institution, 1853), 10.
5. Cutter, “Library Catalogues,” 527.
6. Rosalie Lack and John Ober, California Digital Library: Key Indicators of Collections and Use, July 1, 2001–June 30, 2002. (Oakland, CA: California Digital Library, 2002). Available at www.cdlib.org/about/publications/fy01-02cdl_statsprofile.pdf (accessed Nov. 28, 2009).

Figures

[Figure ID: fig1]
Figure 1 

Map of the earth with no Metadata.



[Figure ID: fig2]
Figure 2 

Map of the earth with Metadata—latitude and longitude.



[Figure ID: fig3]
Figure 3 

Boston subway map.



[Figure ID: fig4]
Figure 4 

Baseball as metadata [source: www.baseball-reference.com].



[Figure ID: fig5]
Figure 5 

DNA as metadata [source: http://genomics.energy.gov].



[Figure ID: fig6]
Figure 6 

Printed map.



[Figure ID: fig7]
Figure 7 

Online map.



[Figure ID: fig8]
Figure 8 

Google Map with hotels.



[Figure ID: fig9]
Figure 9 

OpenStreetMap.org display of details.



Article Categories:
  • Information Science
  • Library Science

Refbacks

  • There are currently no refbacks.


Published by ALA TechSource, an imprint of the American Library Association.
Copyright Statement | Privacy Policy