lrts: Vol. 57 Issue 4: p. 213
Quality Issues in Vendor-Provided E-Monograph Records
Stacie Traill

Stacie Traill is Cartographic and Electronic Resources Cataloger, University of Minnesota Libraries, Minneapolis, Minnesota; trail001@umn.edu
Early versions of the study results reported in this paper were presented at the ALCTS Cataloging and Metadata Management Section Cataloging and Classification Interest Group meeting at the 2012 American Library Association Midwinter Meeting and at the ALCTS Publisher-Vendor-Library Relations Interest Group meeting at the 2012 American Library Association Annual Conference.

Abstract

As e-book batchloading workloads have increased, the quality of vendor-provided MARC records has emerged as a major concern for libraries. This paper discusses a study of record quality in e-monograph record sets undertaken at the University of Minnesota with the goal of improving and increasing the efficiency of preload editing processes. Through the systematic analysis of eighty-nine record sets from nineteen different providers, librarians identified the most common errors and the likely effect on access. They found that while some error types were very common, specific errors are often unique and complex, making devising a set of broadly applicable strategies to correct them difficult. Based on these results, the author identifies future challenges for maintaining quality in batchloaded record sets and suggests several possible directions for improving record quality.


As libraries expand their electronic collections, many find that the most effective and practical means of providing catalog access to these collections is through batchloading MARC records provided by vendors or publishers into local catalogs. As batchloading becomes more common and libraries share their experiences, certain themes and focuses of discussion have emerged. One is the challenge of incorporating batchloading into existing technical services and systems workflows. Another is navigating the mechanics of record editing and loading processes. A third strand running throughout the batchloading literature is the issue of record quality. General discussions of the topic usually include a least a brief discussion of concerns about record quality, and most case studies of batchloading projects identify multiple quality issues found and addressed as part of the project.

At the University of Minnesota Libraries (UML), experiences have been much the same as those at other institutions. Librarians learned how to manipulate MARC records in batch and determined how to train staff and design workflows to accommodate batchloading. However, poor record quality continued to trouble catalog and authority control librarians. Years of providing feedback on record quality to vendors had yielded mostly discouraging results. Although librarians had largely mastered the processes for correcting certain kinds of critical problems, dramatic increases in batchloading work indicated a strong need to develop more efficient and systematic batch editing processes. To that end, technical services managers charged a small group of two catalogers and one systems librarian to identify the most common issues, their prevalence, and their effect on access, with the goal of creating a streamlined set of local guidelines for batch editing MARC records for e-resources. Managers wished to understand which areas of the record required careful checking and which data could be safely assumed acceptable most of the time. Catalogers also wished to identify and track problems that were uncorrectable at the point of initial editing and loading but which were candidates for later maintenance, update, or enhancement. To address those questions, catalogers initiated a systematic study of record quality in vendor-provided e-resource records. This paper describes how catalogers analyzed record sets, and it outlines the results of their analysis, describing in detail many of the errors they discovered. The paper also discusses how the study’s findings affected batchloading workflows at UML. The author enumerates several challenges to maintaining quality in batchloaded records and anticipates future challenges and opportunities to arise from evolving cataloging standards and library discovery tools.


Literature Review

Record quality is a frequent topic in the literature on e-books and the batchloading of e-book record sets. Wu and Mitchell provided a detailed overview of issues surrounding mass management of e-book records.1 One major quality issue they discussed at length is the inconsistent use of identifiers, particularly in the context of the provider-neutral record. Wu and Mitchell also noted that cataloging standards varied widely between record providers and that the adoption of the provider-neutral record standard by record providers had been slow.

Luther’s overview of the universe of book metadata (including e-books) discussed the myriad purposes served by book metadata and serves as a useful reminder that library standards do not meet the needs of all communities.2 This is important context for her discussion of metadata quality, which alluded to how varying purposes can explain differing quality standards on the parts of publishers, vendors, and libraries. Luther emphasized the difficulty of measuring metadata quality: “In the current discovery environment, it is difficult to measure what is not found and extremely difficult to quantify the impact and cost of poor, incomplete, or missing metadata on business and collection analysis decisions that ultimately affect consumers.”3

Minĉić-Obradović summarized the state of bibliographic control for e-books.4 Her chapter in a 2011 monograph includes a brief discussion of two frequently observed quality issues in vendor-provided records: misleading identifiers and invalid MARC coding. Offering an example of quality improvements in records from a specific publisher, Minĉić-Obradović discussed the positive effects on the quality of Springer’s MARC records after they contracted with OCLC to replace the records.

In a 2007 article reporting the results of a survey of how academic libraries provided web access to e-book collections, Dinkelman and Stacy-Bates discussed the importance of providing catalog access to e-books, emphasizing the importance of making a simple, format-based search limit available for e-books.5 Although the authors found that 94 percent of libraries surveyed provided this type of limit, they cited record quality issues as a barrier to creating consistent, reliable format limits in catalogs.

Rossman, Foster, and Babbitt offered a broad overview of MARC record and catalog access issues for e-books.6 In their list of questions librarians should routinely ask vendors about MARC records, the authors identified many quality concerns: use of authority control, presence of Library of Congress Subject Headings and call numbers, specificity of subject terms, presence of table of contents notes, and availability of corrected and updated records.

In a pair of papers on the topic of batchloading issues and practices in academic libraries, Mugridge and Edmunds addressed record quality from two slightly different angles. In their 2009 overview of batchloading advantages, challenges, and workflows, the authors noted the difficulties inherent in balancing record quality and timely improvement to access.7 They observed that few record sets are perfect and that some errors are difficult or impossible to correct during preload editing. In their 2012 survey of batchloading practices in large research libraries, Mugridge and Edmunds reported on the effects of batchloading work on staffing, workflows, and quality.8 They found that 76.5 percent of survey respondents had rejected record sets because of quality issues. Some of the reasons respondents gave for rejecting record sets included lack of authority control or subject access, bad data that would have been difficult or impossible to resolve through automation, incomplete title fields, character encoding errors, right-to-left text orientation errors, records lacking unique identifiers, nonstandard cataloging practices, and invalid URLs.

Two of the themes of Mugridge and Edmunds’ work recur in several case studies that discuss specific record quality issues libraries found in preparing and loading records from a particular provider or collection: serious concerns about poor or nonexistent authority control in vendor-provided records and the sentiment that minimal-level access is preferable to having no access at all. Martin and Mundle described the process of editing and loading e-book records for a collection of Springer e-book titles at the University of Illinois at Chicago.9 They outlined strategies for record review and the types of problems they found, noting that many record-quality issues were “enduring and difficult to solve.”10 In addition to the presence of name and subject headings in unauthorized forms, major quality issues they found included bad and nonfunctional URLs and the presence of print version identifiers.

Beall described a similar project in which 100,000 low-quality records for freely available e-books were loaded into the University of Colorado Denver’s local catalog.11 He noted several issues with the initial quality of the records, particularly in the realm of authority control, many of which arose because the records had been derived from non-MARC metadata: qualifiers and dates were missing from name headings and all subfields other than subfield $a were missing from subject headings. Beall discussed the effect of missing or bad data on the catalog, including split heading files and problems with diacritics, but concluded that some catalog access was better than no access.

Sanchez, Fatout, and Howser described the analysis and cleanup of NetLibrary records in preparation for loading into the catalog at Texas State University-San Marcos.12 The authors observed numerous quality issues based on deviation from established in-house cataloging standards. Although the authors were able to resolve many problems before loading, they noted some ongoing authority control issues.

Authority control in batchloaded records is the central concern of Finn’s article, in which she described how the Newman Library at Virginia Tech conducts authority control processing before batchloads are completed.13 Finn noted that the quality of record sets varies widely and that authority control problems are very common.

Preston wrote about the OhioLINK Database Management & Standards Committee’s (DMSC) cooperative e-book cataloging projects.14 While this was a case study of a manual e-book cataloging project rather than a batchloading project, Preston noted that “concerns about bibliographic record accuracy, retrievability, and adherence to cataloging standards”15 were among the reasons that DMSC opted not to use vendor-supplied records. These concerns included a lack of Library of Congress Subject Headings (LCSH) and Medical Subject Headings (MeSH), name headings not in authorized forms, the presence of print version ISBNs, serials cataloged as monographs, and the cataloging of reproductions (before 2009) as if they were born-digital.

Record quality is a concern for libraries well beyond the realm of e-books and batch processing. Several papers that discussed quality standards for catalog records and metadata more generally are helpful in providing a broader context for the present study. Studies of quality in traditional cataloging offer an interesting point of comparison. In a 2005 survey of academic libraries, Lam found that the vast majority of respondents viewed the quality of outsourced cataloging as generally good in terms of accuracy, consistency, adequacy of access points, and timeliness.16

El-Sherbini evaluated the quality of Program for Cooperative Cataloging (PCC) BIBCO records in the Ohio State University (OSU) catalog. Like many libraries, OSU uses the services of an authority control vendor (in this case, Backstage Library Works) to verify and correct headings. El-Sherbini analyzed the changes made by the authority control vendor during post–cataloging authority processing.17 She found that the majority of corrections could be viewed as minor and did not affect catalog retrieval, including changes to punctuation, diacritics, and spaces. El-Sherbini also identified corrections that might affect access, including indicators, subfields and delimiters, tags, spelling errors, and forms of subject headings. She found that a very small number of records were affected by these issues and concluded that the overall quality of PCC records was high.

Discussions of metadata quality outside the realm of traditional cataloging also have some relevance for quality evaluations of MARC metadata. Bruce and Hillman proposed a set of broadly relevant metadata quality measurements and metrics: these are completeness, accuracy, provenance, conformance to expectations, logical consistency and coherence, timeliness, and accessibility.18

In a 2008 paper, Hillman compared quality evaluation for non-MARC metadata to that for MARC metadata.19 She noted that most problems identified in quality studies of MARC records were either typographical errors or outdated headings. Hillman argued that non-MARC metadata quality should not be assumed to be the same as in MARC metadata but should “instead be based on criteria more closely tied to the functionality sought for applications using metadata,” meaning that there is “no one answer to the quality question.”20

Finally, some recent literature inquires more broadly into the concepts of record quality and quality measurement. In a 2008 article, Bade discussed the concept of a “perfect bibliographic record,” observing that it is hard to define record quality in any absolute sense.21 The author suggested that libraries should consider the following in developing quality criteria: “1. What data elements are useful for the kind of library research performed here in this particular institution? 2. How much, and which elements of that necessary information can this institution afford to support?”22

Hider and Tan examined how catalog record quality might be assessed through research into catalog use.23 The authors proposed that quality can be assessed either “impressionistically” or “systematically,” or through a combination of both approaches.24 Impressionistic assessment relies on catalog users’ self-reported behaviors and preferences while systematic assessment relies on algorithmic or expert evaluation of user behavior and errors in bibliographic records. The authors noted that standardization is a key element in catalog effectiveness. Through survey results, Hider and Tan found that both libraries and library patrons believed that most elements of catalog records were useful for identification and selection. They concluded with a call for “evidence-based cataloging,” in which localized and detailed evidence provide the means to measure the effectiveness of cataloging practices.25


Method

The project group devised evaluation rubrics based on two widely adopted current standards for e-book records: the Program for Cooperative Cataloging’s (PCC) MARC Record Guide for Monograph Aggregator Vendors,26 and the PCC’s Provider-Neutral E-Monograph MARC Record Guide.27 Based on these documents, two checklists were created: one that included specific fields with PCC and local expectations for content in each field (appendix A), and one that listed generic issues in conflict with PCC and local standards that staff had identified while working with record sets before the formal beginning of the study (appendix B). During the analysis, catalogers also maintained a list of specific problems found in individual records. Finally, original, unedited files for each record set were archived for later reference.

Catalogers evaluated record sets using MarcEdit and Excel as part of normal preload editing processes. They identified some problems whenever they were present, such as problems that affected all records in a particular set, or certain critical errors affecting a subset of records, such as missing URLs; the fields and values that received this level of analysis are indicated in the specific field checklist with a mark in the “Full check” column. Catalogers identified some problems that typically did not affect all records in a set, such as errors in authorized forms of name and subject headings, or simple typographical errors, through selective spot checks of individual records within a set; those fields and values are indicated with a mark in the “Spot check” column. “Full check” fields and values were those that could be checked programmatically by machine with relative ease, while “spot check” fields and values required the cataloger to review individual records. Between July 2011 and August 2012, catalogers analyzed eighty-nine record sets from nineteen different providers, with the number of records per set varying between a handful and several thousand. Most sets had between 100 and 1,000 records. Most record sets were for e-books, but some sets for monographic electronic items in other formats were included, such as scores, sound recordings, and video recordings.

Catalogers divided the problems they discovered into three categories: errors or omissions that could affect access (e.g., missing or incorrect access points, identifiers, or linking entry fields); errors that were unlikely to affect access (e.g., erroneous physical description, misleading 5XX notes); and critical errors, those errors which required resolution before records could be loaded (e.g., MARC encoding problems, missing or bad URLs/URIs). Catalogers also noted usage of obsolete coding and field tags. Table 1 shows how librarians categorized the various types of errors.

Some error types within each category are more serious than other types. The seriousness of the error does not necessarily correlate with the level of effort necessary to correct it, as the discussion of findings will demonstrate.

Findings

All of the eighty-nine record sets exhibited at least one error. About one-fifth displayed critical errors, while the vast majority of sets displayed at least one access error. A few sets exhibited only “other errors,” those deemed unlikely to affect access, though very few of the sets fell into this group.

Based on the large number of sets exhibiting access errors and other errors, most sets clearly had more than one type of error. Thirteen sets showed all three types of errors.

Discussion of each error category follows, along with some of the more notable and interesting specific errors and the steps catalogers took to correct them.

Critical Errors

This category contained errors that were “show stoppers,” problems that meant the records could not be loaded without correction. Many of these were MARC coding errors that would affect indexing. In one set, no indicators were present in any MARC field. This held true for every record in the set. The set was large enough that it was not feasible to make the corrections locally, and the library did not load the set until the vendor corrected the errors. In another set, most indicators had been replaced by punctuation marks, a problem which again appeared in every record in the set. Catalogers and systems staff could not determine exactly what might have caused this issue, so correcting it was challenging. A third set contained a large number of seemingly random invalid MARC field tags, indicators, and subfield values, present in about 30 percent of the records in the set. The only way to correct these problems was to fix each individual record. Since this was a relatively small set (fewer than 200 records), it was possible to do this, but in a much larger set, making such corrections would involve an inordinate amount of time and effort.

Other critical errors affected only a small number of records in each set. In one set, 8 out of more than 700 total records were missing any kind of system number or unique identifier. Although supplying locally devised identifiers was a simple solution to the immediate problem, this is a less-than-ideal choice in view of longer-term maintenance, which often requires using the unique identifier as a match point when records need to be selectively deleted or overlaid with updated versions. In a handful of other sets, the length of one or more records exceeded 22,000 bytes, the record size limit of the library’s ILS. These sets all consisted of records for either online sound recordings or video recordings, formats for which longer records are common. In each of these cases, however, the excessive record lengths were the product of poor cataloging choices: a number of loosely related titles had been combined in a single bibliographic description. These records had hundreds of 7XX fields and URLs, making them unusable in most library catalogs. Librarians had no choice but to remove the problem records from the sets before loading and to report the issue to the record providers. Catalogers decided that the only real option for providing meaningful access to these titles was to manually catalog each separate work included in the problem records.

Finally, missing, broken, or misleading URLs also qualified as critical errors. Some URL problems affected every record in a given set, while others were specific to individual records. In two sets, all of the URLs were badly formatted and nonfunctional. A brief investigation into the structure of title-level “permalinks” given on the provider’s website yielded an easily implemented fix for the problem. In the URLs for two other sets, the presence of unencoded non-ASCII characters caused link failure in local systems. Properly encoding the URLs solved the problem. While correcting the problems in these cases was not difficult, the corrections were only successful because the existing URLs were “mostly correct,” and their errors fell into recognizable patterns. In another set, all URLs were bad, but catalogers could not identify a pattern of errors common to all of the records that would have made batch correction possible. The only solution in this case was to correct the URLs one at a time. Finally, one set lacked URLs entirely. The records in this set had clearly been derived from records for print versions of the books, but the provider had neglected to add links before distributing the records. This set was reported to the provider for repair and reissue.

In two sets, most URLs were present and accurate, while a very small number of records lacked URLs entirely. In two other sets of provider-neutral records—which included multiple URLs for various providers—a small number of records in each set had no URL remaining once catalogers had removed links for providers to which the library did not have access. In all of these cases, identifying those records and supplying URLs manually before loading was a relatively simple, though time-consuming, matter.

Errors that May Affect Access

Errors that had the potential to affect discovery and retrieval made up this category. While the presence of such errors would not prevent records from being loaded, catalogers felt that these errors should be corrected before loading if possible, or noted for possible post–load correction if not. Although many of these errors were simple typographical errors in access points, a number of other subcategories emerged over the course of the analysis: these included problems with identifiers, crosswalking and record derivation errors, misapplication of cataloging rules, MARC coding errors, and omissions or inconsistencies likely to result in misleading or incomplete catalog retrieval.

One of the most common types of access errors was incorrect use of identifiers. A number of other studies on batchloading and e-book records have addressed the difficulties in ensuring that each record in a set has at least one accurate identifier correctly coded, and the problems that can arise when records contain bad identifiers. In Wu and Mitchell’s 2010 article on batch management of e-book records, they noted that “lack of a reliable identifier to collocate equivalent manifestations on an automated basis” is “a significant obstacle to full adoption of the provider neutral standard at the local level.”28 Martin and Mundle cited confusion between print version identifiers and e-version identifiers as a substantial problem that blocked loading and caused overlay hazards in batchloading at their institution.29

In a large number of sets, ISBNs for both print and e-books were coded in the MARC 020 subfield $a. In cases where ISBN qualifiers were routinely supplied, this could be corrected in batch with a high degree of confidence. When no qualifiers were present, correcting the problem was very difficult. Many sets also included OCLC numbers for print version records when e-version records had been derived from those records. This is obviously problematic for reporting holdings to OCLC and any kind of batch maintenance that relies on accurate OCLC numbers. Omissions of various identifiers also occurred frequently. In a few sets, linking entry fields (MARC 776) were present on at least some records, but they did not include an identifying number, or they included identifiers for multiple discrete bibliographic entities (e.g., ISBNs for both print and e-books in the same 776 field).

Another type of access error was present in a handful of sets where vendors had generated MARC records by crosswalking, or converting, metadata used in their internal systems into MARC. These are not cataloging errors per se since the original non-MARC records presumably conformed to the vendor’s own standards, but rather issues resulting from the imperfect translation of the original metadata to MARC that could inhibit access in a MARC-based catalog. Catalogers saw several examples of this. In one set, geographic data that was present in the provider’s internal metadata (which they had also made available) was not present in the MARC records based on that metadata, even though it could have been mapped to a MARC 043 field (or perhaps to a geographic subdivision of a subject heading). In another set, all subject and descriptor terms were from unspecified, presumably internal, controlled vocabularies. To complicate matters, each term was preceded by an alphanumeric code that was meaningless outside the provider’s internal repository. Subject terms in this set were also both too specialized and not descriptive enough for a general catalog, including very specialized discipline-specific terms and lacking more general relevant terms from LCSH or MeSH. Finally, in all of the record sets that fell into this category, name headings did not appear in authorized forms. Although automated or outsourced authority processing could be expected to correct many of these, a large number of headings would either be changed in error, or would be unable to be matched and corrected by these methods.

The derivation of e-version records from older print version MARC records, a process that can produce similar (if less severe) problems to crosswalking from other metadata schemes, resulted in a related type of error. In several sets consisting primarily of materials published and cataloged in the pre–Anglo-American Cataloguing Rules, 2nd ed. (AACR2) era, catalogers found a number of obsolete subject headings and subdivisions that had apparently been carried over from print version records for those titles. These errors fell into the category of those that could reasonably be corrected only in post–load authority processing. In the same sets, some records also used obsolete MARC coding.

A type of access error seen mostly in sets for non-book materials appeared to arise from misunderstanding or misapplication of cataloging rules. In some sets for streaming video, many records incorrectly gave the director or producer as main entry, when title main entry would have resulted from proper application of AACR2. In a set of records for music scores, uniform titles, if they were present at all, appeared in MARC 7XX fields rather than in the MARC 240, required for proper name/title indexing under personal name main entry. Another set of records for scores was missing form/genre subdivisions to indicate whether the resource was a score, a score and parts, or parts only. These missing subdivisions would have caused collocation issues in the traditional library catalog, and would have caused incorrect format faceting in the library’s discovery layer, Ex Libris’s Primo.

Incorrect or missing MARC coding in fields 006/007/008 is another type of access error that catalogers found frequently. Like the missing form/genre subdivisions discussed above, missing or incorrect values in certain positions of the fixed fields causes system-specific issues for format limiting and faceting. In several sets, at least one record was missing the 007 field for electronic resources. In some sets, the 006 field was missing from all records, while in other sets, the 006 field supplied was for textual materials rather than for electronic resources. Finally, in one set of records for streaming video, the 008/33 value necessary to indicate that video recordings were the type of visual material represented was absent, causing the library catalog and discovery system to interpret the format of the included titles as books rather than videos.

One other type of omission was counted as an access error: the lack of a Library of Congress classification number in the MARC 050 or 090. In many sets, this information was present on some records in a set but not all. Although e-books do not require a call number for shelf placement, many discovery systems rely on Library of Congress call number information for search faceting. The absence of this data means that a user who narrows search results via facets could inadvertently exclude relevant results because their records lack the requisite data to populate that facet.

Catalogers placed one issue in the category of access errors that is not strictly an error, but rather an inconsistency: in a number of sets, entries for the same series title were traced on some records but untraced on others. According to standards, either choice is acceptable, depending on local preference, but a mix of traced and untraced for the same series headings within a single set is problematic in library catalogs and discovery systems that index series titles because mixed practices produce inconsistent and incomplete search results.

Both the scope and the potential effect of access errors varied widely. Within a set of several thousand records, the effect of a set-wide omission is much greater than that of a few missing fields or values. On the other hand, consistency makes such problems easier to identify, and often, to fix. In many cases, these errors were actually omissions of data that catalogers considered necessary to full-level cataloging records, such as subject headings or format-specific coding. Omissions of data that could be expected to be different for each title, such as Library of Congress call numbers or ISBNs, were generally not difficult to identify, but were among the most difficult errors to correct. Finally, some errors fell into a gray area: they might affect access or not depending on local preferences, system implementations, and user needs. In these cases, catalogers chose a category based on local circumstances but recognized that other libraries might differ.

Errors Unlikely to Affect Access

In this category, catalogers placed all other identified errors that did not clearly fall into either of the other two categories. One type of identifier problem was not categorized as an access error, though a case could be made for doing so: inconsistent treatment of digital object identifiers (DOIs). In some sets, DOIs were given as URLs. This is a commendable practice, since DOIs are permanent and can be expected to provide greater stability than typical URLs. However, in a small number of sets, although many or most records had DOIs appearing in MARC field 024, those DOIs were not given as URLs. Instead, the URLs supplied in MARC field 856 were typical URLs presumed not to have the same level of stability as the DOIs for the same titles. Ideally, when DOIs exist, they should be given in both the MARC 024 and in URL form in the MARC 856. URL maintenance is a substantial ongoing workload in most libraries, and making use of all available tools to reduce that workload is highly desirable.

Another problem that fell into the gray area between access errors and other errors is the absence of linking entry fields. The most useful and relevant of these fields for e-books is MARC 776, which provides a link to a bibliographic record for the print version of a title, ideally via a record identifier such as an OCLC number or a LCCN. Almost all of the record sets evaluated in this study were missing this element, either in whole or in part. Although a lack of linking entry fields has a negligible effect on access in many discovery systems at present (including those currently in use at the University of Minnesota), the gradual move toward relationship-entity models means that linking entry fields will likely become more important soon. Linking entry fields as they are commonly used in e-book cataloging offer one way to collocate related manifestations of the same work. Including them in current bibliographic records is one small way of preparing records for a future beyond MARC, since linked data models that may succeed MARC rely on record identifiers to pull together information from various sources to offer more comprehensive and interlinked descriptions of works, authors, and other entities.

A number of errors deemed unlikely to affect access arose as a result of partial or imperfect implementation of the provider-neutral record. A large number of sets that were otherwise compliant with provider-neutral standards included entries for provider names or series. While many libraries (including the University of Minnesota) still opt to include this information in e-book records, full adherence to the provider-neutral guidelines would exclude it. Similarly, some record sets included publisher and date information in the MARC 260 for that provider’s specific version of an e-book, rather than the original publisher and date as required by the provider-neutral standard. Although this study did not count this as an access error, Wu and Mitchell noted that they had observed “a user preference for seeing the original publisher and date information in the publication area.”30 Most of the sets that had provider-specific information in the MARC 260 included publisher and date of the original publication in the MARC 533, according to the practice of cataloging electronic reproductions that dominated e-book cataloging before the implementation of the provider-neutral record for monographs.31 The record sets analyzed exhibited a mixture of former, current, and ad hoc practices in the MARC 300. Although the provider-neutral standard’s recommended phrase “1 online resource” was frequently seen in 300 subfield $a, it was often not used consistently throughout a set, and was missing entirely from many other sets, usually in favor of the older recommended usage “1 electronic resource.” Another physical description error observed was the direct transcription of the MARC 300 field from the print version record, often including even subfield $c (dimensions), which is inappropriate for e-books.

The presence of obsolete MARC 5XX note fields was another error type deemed not likely to affect access. A large number of sets exhibited this error. Not surprisingly, these were usually sets that failed to follow provider-neutral guidelines (or that followed them imperfectly). Finally, a large number of sets also included the obsolete MARC 440 field tag for series headings. Since most systems still index 440, catalogers did not consider this to be an access error, though it was generally corrected to valid coding as a 490/830 field pair.


Discussion

Catalogers made a number of general observations about their findings as they conducted analysis and editing of record sets. Over the course of the study, it became clear that e-book vendors were slowly adopting the provider-neutral record. With some exceptions, record sets evaluated later in the study were more likely to make at least some attempt to adhere to the standard. Although many types of errors appeared whether records were provider-neutral or not, gradually expanding use of the standard meant that the variety of errors narrowed and became more predictable, enabling more efficient preload editing. It is clear that the effect of the provider-neutral standard has been a positive on the quality of vendor-created records as well as those created by library catalogers.

Catalogers were surprised by the relatively small number of truly critical errors they found. Based on prior experience and informal conversations with colleagues at other institutions, there was a perception that many more record sets were critically flawed than turned out to be the case. Even for sets with critical errors, catalogers found that most could be fixed without excessive effort. Only four of the sets evaluated during the study were rejected entirely for loading. In these four cases, other means were explored to provide title-level access for the record sets in question.

If the rarity of critical errors was a pleasant surprise, both the variety and frequency of access errors was an unpleasant one. In particular, access errors that were usually identifiable only through spot checks, such as unauthorized forms of names and subject headings, and typographical errors in titles and names, were troubling, since these errors were typically both the most difficult to find and to correct. Catalogers had little confidence that spot-checking found all or even most of these errors, especially in larger record sets. Moreover, although the prevalence of identifier errors had been anticipated, the difficulty in accurately identifying and re-coding print version identifiers in batch was a particularly vexing problem. Since accurate identifiers are critical for long-term catalog and collection management, this problem demands a substantial amount of cataloger time and attention. However, on the positive side, the variety of access errors encountered in fixed field coding helped to refine and expand local checklists and editing procedures, increasing catalogers’ confidence that coding errors for various formats would always be discovered and corrected before loading.

Inconsistencies in record sets from the same providers, though they are not errors in and of themselves, represent another significant problem. Catalogers confirmed what they had casually observed before the study, which is that successive record sets from the same vendor, even for the same collection, do not exhibit consistent errors. Consistency is very helpful for the most efficient and accurate processing and flexibility in workload distribution. When records from the same provider do not display the same problems from set to set, libraries are forced to reevaluate each new set. It should be noted that some inconsistencies are the result of the gradual adoption of the provider-neutral standard, an unquestionably positive development, but others are not related to provider-neutral changes. The unpredictable nature of problems found even in record sets from the same vendor supported catalogers’ assertion that new sets always needed their evaluation before loading.

Ultimately, catalogers concluded that there is no meaningful way to generalize about the most common errors across the full range of record sets. The wide variety of errors and inconsistencies of practice, though somewhat improved by wider adoption of the provider-neutral standard, mean that it is very hard to predict what errors one will find in any given record set. This is not to say that the records of many individual providers do not exhibit identifiable characteristics and typical errors, but there is very little that applies across the board. Despite these challenges, catalogers at UML were still able to improve and refine local processes for record set editing based on the results of the study. Although catalogers and systems librarians had long worked from a pre–load set editing checklist, the results of this study provided ample data to inform a thorough revision and expansion of that checklist (appendix C). The data also supported continuing the time-consuming practice of spot-checking some records in each set. Having an inventory of previously observed issues allowed catalogers to document strategies for identifying and fixing the most egregious problems. Additionally, catalogers have documented errors common to particular vendors, which helps to focus analysis and editing efforts for new sets from the same vendor on the most likely problems. Finally, less critical problems affecting access that catalogers could not easily fix before loading are now routinely documented for potential retrospective correction or record upgrades, if and when they are possible.


Conclusion

This study offers a “worm’s-eye view” of the quality issues in e-book record sets, focusing on detailed evaluation of discrete elements in individual records. Viewing the results from a broader vantage point suggests a number of strategies that libraries might pursue to address these issues. One lesson learned is that more and better-coordinated communication with record providers could help improve their offerings. Unfortunately, experience has shown that not all vendors and providers are interested in making the kinds of improvements to their record sets that libraries want, nor do all libraries convey a consistent set of needs to record providers. The typical current flow of communication, where vendors create and distribute records, libraries locally edit and upload those records, and then sometimes give the vendor feedback about problems in the records, has not proven especially effective in actuating large-scale improvements to record quality. Martin and Mundle observed that “vendors are attempting to automate record creation as much as possible, and changes at the title-level are improbable. The key for efficiency for both libraries and vendors will be to create a high-quality description of each e-book that can be reused and repurposed by any number of libraries to create quality catalog records.”32 The kind of collaborative effort Martin and Mundle hint at is a promising way forward that libraries and vendors should pursue. Libraries understand their specific needs better than vendors, and perhaps it is not realistic to expect vendors to meet exacting library standards when they are generally offering record sets for no additional charge beyond the price of the content. This is not to say that vendors should not meet a minimum standard. The PCC’s “MARC Record Guide for Monograph Aggregator Vendors” provides an excellent starting point, yet the standard could prove too difficult for some vendors to meet. When vendors are unable or unwilling to meet a minimum standard for their records, libraries should consider organizing a formalized, wide-scale repository or clearinghouse for the sharing of record sets that have been edited to meet a baseline standard. As a starting point, record sets could be shared within cons78041ortial or regional groups of libraries. Eventually, a national effort along the lines of the Program for Cooperative Cataloging (PCC) could manage such a clearinghouse. Another possible path for OCLC subscribers is the WorldShare Metadata service, a relatively new service that automatically provides locally tailored sets of records and a shared environment for their maintenance for the collections a library has activated in the WorldShare knowledge base. Although records are not yet available for all collections, and the service is too new for its long-term effectiveness to be known, it has the potential to help individual libraries maintain the desired level of quality in their e-book records.

The growing level of adoption by vendors of the provider-neutral standard is encouraging. However, major changes in cataloging standards are coming soon. The implementation of Resource Description and Access (RDA) is already a reality for many libraries, and will be so in many more within the next year. But, because RDA training and implementation is a resource-intensive activity, and because OCLC will not require libraries to contribute RDA records, it is possible that some libraries will choose to continue cataloging under AACR2 rules. It is not obvious how the provider-neutral model will align with RDA, whose basic principles seem to disallow provider-neutral cataloging. Fortunately, the PCC has already done much work toward reconciling the provider-neutral standard with RDA.33 Nevertheless, as libraries saw with the original provider-neutral standard, widespread implementation is likely to take years. In the meantime, catalogers are likely to see a mix of AACR2 and RDA practices in vendor-provided e-book records. An issue not directly related to RDA, but to standards in general, is the proliferation of identifier systems, many of which libraries, publishers, vendors, and retailers may come to rely on as they move toward an environment in which linked data plays a central role. Luther addressed this in her overview of the book metadata landscape, proposing exploration of expanded use for the International Standard Text Code (ISTC) and International Standard Name Identifier (ISNI) standards.34 If these standards come into common usage, libraries must strongly consider including them in bibliographic and authority records.

Other nascent trends indicate that batchloading may become a less important activity for libraries soon; it already has for some. These are the generation and extraction of bibliographic records from ERM knowledge bases, and the presence of title-level metadata for e-monograph collections in web-scale discovery systems such as Serials Solutions’ Summon Service and Ex Libris’s Primo Central, which offer unified indexing across metadata for many types of library resources from a variety of repositories and sources. Wu and Mitchell noted that the use of records derived from their library’s ERM knowledge base had streamlined the University of Houston’s batchloading workflows, but they also noted that many of those records contained very minimal bibliographic information.35 Preexisting records in web-scale discovery systems might also contain minimal information, though in some cases, these systems may have better, more complete, metadata than that available in the MARC records provided by some vendors. The balancing act between providing minimal access and full cataloging is one with which libraries are very familiar. The questions that librarians must answer when implementing these solutions are the same for batchloading as they have always been for traditional cataloging: what is gained in terms of efficiency and cataloger time? What is lost in terms of access and standardization? How important for user discovery needs is the additional access provided by full-level cataloging?

It is hard to overstate the value of the library community’s hard work on standards for e-monograph records. But the growing complexity and variety of locally implemented systems, from back-end ILSs, ERMs, and link resolvers to front-end OPACs and discovery systems, means that those standards can serve only as a starting point. Each library must determine what it needs for its own discovery tools. The plethora of options in catalog and discovery systems means that functionality and dependencies even for something as simple as a format limit can vary widely. General studies on metadata and record quality point to the importance of contextual and local applications in any evaluation of quality. Although standards are an excellent and necessary starting point, there is no one-size-fits-all definition of record quality. Libraries must consider widely accepted standards in tandem with the needs of their own users and discovery systems as they make choices for evaluation of record sets and local record enhancement.


References
1. Annie Wu and Anne M Mitchell,  "“Mass Management of E-Book Catalog Records: Approaches, Challenges, and Solutions, ”,"  Library Resources & Technical Services  (2010)   54, no. 3:  164–74.
2. Judy Luther,  "“Streamlining Book Metadata Workflow: A White Paper prepared for the National Information Standards Organization (NISO) and OCLC Online Computer Library Center, Inc.”"(Baltimore: NISO, 2009), accessed October 4, 2012, www.niso.org/publications/white_papers/StreamlineBookMetadataWorkflowWhitePaper.pdf
3. Ibid., 1
4. Ksenija Minĉić-Obradovi, E-books in Academic Libraries (Oxford: Chandos Publishing, 2011)
5. Andrea Dinkelman and Kristine Stacy-Bates,  "“Accessing E-books through Academic Library Web Sites, ”,"  College & Research Libraries  (2007)   68, no. 1:  45–58.
6. Doralyn Rossman, Amy Foster,  and Elizabeth P Babbitt,  "“E-book MARC Records: Do They Make the Mark?”,"  Serials: The Journal for the Serials Community  (2009)   22, no. 3:  S46–S50,  doi:10.1629/22S46
7. RebeccaMugridge LRebeccaMugridgeL ,  Edmunds Jeff,  "“Using Batchloading to Improve Access to Electronic and Microform Collections, ”,"  Library Resources & Technical Services  (2009)   53, no. 1:  53–61.
8. RebeccaMugridge LRebeccaMugridgeL ,  Edmunds Jeff,  "“Batchloading MARC Bibliographic Records, ”,"  Library Resources & Technical Services  (2012)   56, no. 3:  155–70.
9. KristinMartin EKristinMartinE ,  Mundle Kavita,  "“Cataloging E-Books and Vendor Records: A Case Study at the University of Illinois at Chicago, ”,"  Library Resources & Technical Services  (2010)   54, no. 4:  227–37.
10. Ibid., 232
11. Jeffrey Beall,  "“Free Books: Loading Brief MARC Records for Open-Access Books in an Academic Library Catalog, ”,"  Cataloging & Classification Quarterly  (2009)   47, no. 5:  452–63,  doi:10.1080/01639370902870215
12. Elaine Sanchez, Leslie Fatout,  and Aleene Howser,  "“Cleanup of NetLibrary Cataloging Records: A Methodical Front-End Process, ”,"  Technical Services Quarterly  (2006)   23, no. 4:  51–71.
13. Mary Finn,  "“Batch Load Authority Control Clean-Up Using MarcEdit and LTI, ”,"  Technical Services Quarterly  (2009)   26, no. 1:  44–50.
14. Carrie A Preston,  "“Cooperative E-Book Cataloging in the OhioLINK Library Consortium, ”,"  Cataloging & Classification Quarterly  (2011)   49, no. 4:  257–76.
15. Ibid., 261
16. Vinh-The Lam,  "“Quality Control Issues in Outsourcing Cataloging in United States and Canadian Academic Libraries, ”,"  Cataloging & Classification Quarterly  (2005)   40, no. 1:  101–22,  doi:10.1300/J104v40n01_07
17. Magda El-Sherbini,  "“Program for Cooperative Cataloging: BIBCO Records: Analysis of Quality, ”,"  Cataloging & Classification Quarterly  (2010)   48, no. 2–3:  221–36,  doi:10.1080/01639370903535726
18. Thomas Bruce and Diane Hillman. “The Continuum of Metadata Quality: Defining, Expressing, Exploiting, ” in Metadata in Practice, ed. Diane I. Hillmann and Elaine L. Westbrooks (Chicago: ALA Editions, 2004), 238–56
19. Diane I Hillmann,  "“Metadata Quality: From Evaluation to Augmentation, ”,"  Cataloging & Classification Quarterly  (2008)   46, no. 1:  65–80,  doi:10.1080/01639370802183008
20. Ibid., 69
21. David Bade,  "“The Perfect Bibliographic Record: Platonic Ideal, Rhetorical Strategy or Nonsense?”,"  Cataloging & Classification Quarterly  (2008)   46, no. 1:  109–33,  doi:10.1080/01639370802183081
22. Ibid., 129
23. Philip Hider and Kah-Ching Tan,  "“Constructing Record Quality Measures Based on Catalog Use, ”,"  Cataloging & Classification Quarterly  (2008)   46, no. 4:  338–61,  doi:10.1080/01639370802322515
24. Ibid., 339
25. Ibid., 360
26. Program for Cooperative Cataloging, MARC Record Guide for Monograph Aggregator Vendors, 2nd ed., prepared by Becky Culbertson, et al. (Washington, D.C.: Program for Cooperative Cataloging, 2009), accessed January 9, 2013, www.loc.gov/aba/pcc/sca/documents/FinalVendorGuide.pdf
27. Program for Cooperative Cataloging, Provider-Neutral E-Monograph MARC Record Guide, prepared by Becky Culbertson, Yael Mandelstam, and George Prager (Washington, D.C.: Program for Cooperative Cataloging, 2009), accessed January 9, 2013, www.loc.gov/aba/pcc/bibco/documents/PN-Guide.pdf
28. Wu and Mitchell, “Mass Management of E-Book Catalog Records, ” 168
29. Martin and Mundle, 233
30. Wu and Mitchell, “Mass Management of E-Book Catalog Records, ” 168
31. See Library of Congress Rule Interpretation (LCRI) 1.11a
32. Martin and Mundle, “Cataloging E-Books and Vendor Records, ” 235
33. Program for Cooperative Cataloging, Provider-Neutral E-Resource MARC Record Guide: P-N/RDA Version, Jan. 1, 2013, revision (Washington, D.C.: Program for Cooperative Cataloging, 2012), accessed January 10, 2013, www.loc.gov/aba/pcc/scs/documents/PN-RDA-Combined.docx
34. Luther, “Streamlining Book Metadata Workflow, ” 16
35. Wu and Mitchell, “Mass Management of E-Book Catalog Records, ” 173
Appendix A. Specific Field Checklist for Record Evaluation

Appendix B. General Issues Checklist for Record Evaluation
  • Are there miscellaneous character errors?
  • Are there errors in vernacular characters, diacritics, special characters?
  • Does record correctly identify the same work? Does it correctly identify the same expression (edition)?
  • Are identifiers (e.g. DOIs, ISBNs, OCLC numbers) present? Do they correctly identify the work and edition? Do they identify the electronic version?
  • Is a data element provided that serves as a record identifier? Is it unique within the set? Is it consistent between iterations of the same record?
  • Are name headings authorized?
  • Are subject headings provided? Are they authorized? Which thesauri? Does the coding correctly represent the source vocabularies?
  • Are series entries provided? Are they traced? Are they correctly authorized? Are correct ISSNs provided? Are correct volume numbers provided?
  • Correct coding for source vocabulary, etc.
  • Is MARC coding valid?
  • How were records derived? (e.g., crosswalked from vendor database, cloned from existing copy, etc.) Does mapping to MARC include all relevant data?
  • Are records consistent with other sets from the same provider?

Appendix C. University of Minnesota Editing Guidelines for E-book Record Sets

The following fields and values should generally be present on bibliographic records for electronic book collections (and other collections of monographic electronic resources) batch loaded into Aleph. This list is not necessarily exhaustive; specific collections may require additional fields and/or coding changes.

Note: During the editing process, save altered but unfinished files to L:\IT\ET\Records\RecordsPending. Do not overwrite the original files in RecordsIn. Original files will be archived.

Note: Before beginning, determine whether there are any serial records in the file by examining LDR fields. If any are present, extract them into a separate file using Tools/Select MARC Records/Extract Selected Records and edit them separately.

Note: Before finishing, spot check access points on several records for authorized forms of names/headings and typos. Use your judgment to determine how many records to spot check; if the set is generally high-quality and from a more trusted provider, spot check fewer records. If the set has many problems, or is from a new provider or a provider with many known issues, spot check more records. Note any severe or widespread problems you can’t easily fix with MARCEdit in the spreadsheet for post-load correction.


Figures

Figure 1

Overlap between Error Types



Tables
Table 1

Categorization of Errors Found in Record Sets


Critical Errors
MARC Field(s) Error Description
N/A Record length exceeding 22,000 bytes
All Invalid MARC coding/tagging
001, 035 Missing control number or other unique identifier
856 Missing or bad URL/URI
Access Errors
MARC Field(s)
Error Description
LDR, 008 Missing or incorrect values in LDR or 008
006, 007 Missing or incorrect values in 006/00 and 09 or 007/00-01
010, 020, 035 Identifiers for print versions coded in 010, 020, or 035 $a
050, 090 $a Missing LC class number
1XX, 240 Missing main entry (name or uniform title)
7XX Missing or inappropriate name heading
1XX, 7XX Unauthorized form of name(s)
1XX, 24X, 6XX, 7XX, 8XX Typographical error(s) in access points.
245 $h Missing general material designation (GMD)
6XX Missing subject heading(s)
6XX Unauthorized form of heading(s)
6XX $v $x $y $z
Missing subdivision(s)
Other Errors
MARC Field(s) Error Description
260 Missing or incorrect place, publisher, or date of publication
300 Missing or incorrect physical description
4XX, 7XX, 8XX Presence of vendor-specific series or names
440, etc. Presence of obsolete MARC tags
506, 516, 530, 533, 534, 538 Presence of obsolete note fields
776 (or other 77X/78X) Missing, incomplete, or incorrect linking entry field

Table 2

Number of Sets with Errors in Each Category


Category of Error Number of Sets
Critical errors 17
Access errors 85
Other errors 65

Table 3

Combinations of Error Types


Error Types Present No. of Sets
Critical, Access, and Other 13
Critical and Access 2
Critical and Other 1
Access and Other 49
Critical only 1
Access only 21
Other only 2

MARC Field PNR MAV UMN Details Full Check (F) or Spot Check (S)
LDR/06-07 M M S
LDR/17 M Check for misleading values. (MAV recommends Elvl 3 unless “constructed according to AACR2”). S
001/003 N/A M Confirm presence of unique control numbers. Where applicable, confirm that number is retained for subsequent iterations of the same record. F
006 M M M 006/00 = m, 006/09 = d for books. Optional additional 006s if reproduction. F
007 M/A M M 007/00 = c, 007/01 = r. F
008/06-14 M Check date(s) against 260 $c. S
008/15-17 M Check place of publication against 260 $a. S
008/23 M M M 008/23 = o F
008/28 M Evaluate only for known government publication sets. F
008/35-37 M Check for correct language of content. S
010 A A Do not use for print LCCN; put in 776. S
020 A A Electronic ISBN in 020 $a; others in 020 $z; if in doubt 020 $z; also copy print ISBNs to 776. Check for qualifiers. S
024 A Check for presence and type of identifier. Note inclusion of DOIs. Do not verify individual numbers unless there is evidence of a problem. F
035 O Check for OCLC number. If present, verify it correctly identifies electronic version. F
040 M M Do not put code for agency that did the print record here. F
050/060/082/086/[090] O O Check for presence of field only. Do not evaluate for correctness. F
1XX/7XX N/A A Check for presence of name headings; check if appropriate; check if authorized; check if any 710s identify vendor. S
245 $h M M M Check for presence of correct GMD. F
246 A A Check if provider-specific titles are given here. S
250 A A Check for provider-specific edition statements. F
256 X X Verify that this field is not used. F
260 M M 1st named publisher should apply to all known online versions. If reproduction, then 260 should be for print publisher S
300 M M 1 online resource (pagination optional) F
490/830 A A Present if applicable; traced; authorized. Should not include provider-specific series. S
506 X X Verify that this field is not used. F
516 X X Verify that this field is not used. F
530 X X Verify that this field is not used. F
533 X X Verify that this field is not used. F
534 X X Verify that this field is not used. F
538 X X Verify that this field is not used. F
583 X X Check if preservation information is applicable to these records F
500/588 A A If “Description based on” note is used, 776 should also be present. S
6XX N/A O Check for presence of subject headings; check for source vocabulary (note if vendor’s vocabulary); check if authorized. Describe specific issues in spreadsheet. S
655 N/A O Check if genre/form headings given; check if any indicate electronic format. S
776 A A Use if print original is known. Check for presence and correct use of $z, $w. S
856 A M M Check if URL is non-institution specific; check for $3, $z; check if it actually points somewhere, and to the right resource. Check for presence of multiple URLs (if supplied directly by vendor, do they ensure that their URLs are the only ones present?) Check for additional URLs for related content, e.g. LC TOC URLs. S

Legend

PNR Requirement of “Provider-Neutral E-Monograph MARC Record Guide.” (PCC document)

MAV Requirement of “MARC Record Guide for Monograph Aggregator Vendors.” (PCC document)

UMN University of Minnesota requirement

 M Field is mandatory

 A Field is mandatory if applicable

 O Field is optional

 X Field is obsolete


MARC field Required coding/elements
LDR/09 If records are not UTF-8, convert the file to UTF-8 encoding.
001/003 Verify presence of a control number in 001 and a qualifying code in 003. If field 035 exists on all records and accurately references e-version records, delete the 001 and 003.
007 (electronic resources) Code as follows:00 c01 r03 usually c; use fill character if adding 00704 n05 blank06-13 fill characters
006 (electronic resources) Code as follows:00 m05 blank (if adding 006, use fill character)06 o09 d11 blank (if adding 006, use fill character)
007 For non-textual resources (except music scores), 007 fields should be present and accurately coded for the specific type of content.
008 For all resources: 008/23 oFor non-textual resources, check format-specific positions in 008 for accurate coding (note especially 008/33 for videorecordings)
020 Verify that any ISBN in 020 $a is for e-version; move print ISBNs to 020 $z and 776 $z
245 Verify presence of GMD $h [electronic resource] (follows $a, n, and p; precedes $b and c).
300 In $a, use 1 online resource. Pagination may optionally follow in parentheses, as well as $b indicating the presence of illustrations, etc.If $c is present, delete it.
440/490 If 440 fields are present, copy them to 830 fields, then retag all 440 fields as 490 first indicator 1.
506/533/540/583 Delete these fields if they contain provider-specific information.
516/530/538 Delete if present. These fields are obsolete.
710 or 830 Add the established form of the provider name, or the established series heading for the collection.Note: This field is included to facilitate easy retrieval of all records belonging to a particular set for ongoing maintenance. Choose one or the other based on the model for subscription and record provision: use the provider name if there is a single subscription to all of the publisher’s e-book content (e.g. Brill); use the collection/series title if a publisher offers multiple collections with distinct titles and content (e.g. North American Theatre Online, one collection of many offered by Alexander Street Press). For sets containing records that are additions to previously loaded sets, make sure that the form of name or series is the same as that used for previous loads.
856 Verify that only one URL per volume represented on the record is present, for the correct provider. Delete URLs for other providers. Add the proxy prefix to URLs. Add $y click-on text.


Article Categories:
  • Library and Information Science
    • NOTES ON OPERATIONS

Refbacks

  • There are currently no refbacks.


ALA Privacy Policy

© 2024 Core