|
Kat Hagedorn, OAIster/Metadata Harvesting Librarian,
University of Michigan Libraries
Kat Hagedorn is the Metadata Harvesting Librarian responsible for the OAIster project (http://www.oaister.org), a search gateway for Open Archives Initiative (http://www.openarchives.org/) (OAI) harvested records, leading to digital objects on the Web. The OAIster project, originally funded by the Mellon Foundation in 2001-2002, is now in its second generation of funding support, thanks to the Institute of Museum and Library Services. The original image of OAIster was as something like an "academic HotBot," but now it is better thought of as more like a union catalog of digital objects. The growth of OAIster has followed the "punctuated equilibrium" model, with growth occurring in jumps and stages as compared with a smooth curve. OAIster's reach is broad, with the intent of harvesting "everything available" except for materials in test repositories, which tend not to be ready to harvest or stable enough to be reliable. There are, in addition, three criteria for inclusion of resources in OAIster. There must be a digital object link, allowing the user access to the object in a single click. There must be "decent metadata" associated with the object, meaning metadata that is actually discoverable: extremely minimal metadata tends to not allow for search and discovery. Third, the resource must be scholarly or informational in nature, the latter indicating resources which may not be primary source material, but still of academic value. As examples, Ms. Hagedorn connected to a humorous short film by Thomas Edison of 1903, via the Library of Congress, in addition to a web site devoted to nanotechnology. The latter is considered to be informational, and although not a discrete digital object itself, it provides links to open access journal articles about nanotechnology, which are discrete digital objects. She noted that an online finding aid could be a digital object, if sufficiently rich in associated metadata. OAIster is a large resource, with over seven million records as of April 2006, for materials of all types, from images of artwork to audios, finding aids, articles, manuscripts and more. Although a lot of automation comes into play in the process of adding resources into OAIster, there are many aspects which have not been automated and may perhaps never be. When a new repository is discovered, it requires a description in OAIster. Additionally, the metadata itself has to be tested for validity. As part of a policy of making their metadata as widely available as possible, OAIster now provides metadata files on a monthly basis to Yahoo. As an example of the results of Yahoo search of OAIster, Ms. Hagedorn pulled up "Frontier Flirtation", the first silent movie shown. OAIster provides records to Google as well, but unfortunately Google doesn't actually use the metadata, but only crawls the associated URLs. As a result, it is more difficult to find these resources through Google than through Yahoo. The project has also built an interface to SRU: Search / Retrieve via URL (a href="http://www.loc.gov/standards/sru/" target="_blank">http://www.loc.gov/standards/sru/), considered to be the next generation of Z39.50, to support federated searching. At this point, Ms. Hagedorn spoke about OAI itself, what it is and is not. The Open Archives Initiative, which "develops and promotes interoperability standards that aim to facilitate the efficient dissemination of content," includes a Protocol for Metadata Harvesting (PMH), which is what the OAIster project uses for resource discovery. The OAI does not equal open access, however, so some materials included in OAIster may have restrictions on access. Restricted resources are included in the project if they allow for multicampus or multinstitutional subscriber access. Resources which allow single-account access only are not included. Fortunately, the great majority of records point to materials which are freely accessible to all. OAIster also harvests many open access repositories, for both self-published and peer-reviewed materials, including DSpace, OJS (Open Journals Solutions), EPrints, PLOS (the Public Library of Science), and arXiv: a physics pre/postprint archive. In the OAI-PMH model, data providers supply XML UTF-8 encoded metadata records. Dublin Core is the required metadata format, but records in additional formats are accepted. Service providers discover and harvest the metadata, transform it to suit the characteristics of the service, and index it. The OAIster transformation tools removes records which do not point to digital objects, adds normalized fields for limiting searches to resource types such as image or audio, and maps simple Dublin Core records to their DLXS Bibliographic Class for indexing (http://www.dlxs.org/).
Thanks to a grant from the Institute of Museum and Library Services (IMLS), the newest development in OAIster is the harvesting of Metadata Object Description Schema (MODS) (http://www.loc.gov/standards/mods/) format records from the Digital Library Federation's Aquifer service (http://www.diglib.org/aquifer/). Aquifer focuses on institutional repository data which makes use of MODS, the currently favored standard among richer metadata formats. MODS is based on MARC, but is "human-readable" in presentation and offers possibilities for search, retrieval and display at greater levels of depth than does Dublin Core. Other enhanced features include the addition of thumbnails in results lists, the availability of users' "bookbags". As the project develops, there is also the possibility of more detailed searching capabilities on, for example, different types of subject fields. The future also holds the possibility of improved "metadata remediation," or the process of clustering and classifying metadata from disparate providers, so that it is more easily searchable and browsable. Finally, in response to the rhetorical question, "Who will win: Google or OAIster?", Kat Hagedorn concluded that the question is not really that pertinent. Search engines and services such as OAIster have different purposes, provide access to different sets of materials, and are not in competition in reality. The audience's questions included an interest in repositories which use MODS, but are not part of the Digital Library Federation. Will records from these repositories be available via OAIster? At present, the DLF grant sets a limit to the repositories harvested, but it is hoped that this scope will be expanded. Another question had to do with the integrity of the data harvested. For example, what happens if URLs change? Metadata can be reharvested and therefore updated as often as weekly, if the date stamps in the records change. Reported by
|