Friday, October 24, 2008

Week 10 Reading Notes

Digital Library Design for Usability

This article outlines five models of computer systems design. The authors find several of these models lacking. The most successful design elements gleaned from these five models include:
  • Learnability: the user can start using a digital library quickly without picking up a lot of new skills.
  • Memorability: the user can remember how to use the library after a significant length of time.
  • The user should be able to recover from errors.
  • The user should be able to save search results or search paths for later use.
  • Users within an organization should be able to get training and guidance on using the library.
  • Library prototypes should be tested on end users and revised before the final product is released.
  • Proactive "agents" that know a user's preferences can alert her to new items of interest.

Evaluation of Digital Libraries: An Overview

Saracevic's typo-riddled article points out that evaluation of digital libraries, especially commercial libraries, is rare. When digital libraries are evaluated, a systems-centered approach is most common. Human- and user-centered approaches are less common. To me this is problematic; if digital libraries are used by humans, their needs should be evaluated first. Perhaps this is why Saracevic notes that "users have many difficulties with digital libraries" such as ignorance of the library's capabilities and scope.

I strongly disagree with his assertion that "it may be too early in the evolution of digital libraries for evaluation." Even when he wrote his article in 2004, many digital libraries were in existence. Now there are even more, and the number is growing all the time. Institutions spend hundreds of thousands of dollars on commercial digital libraries alone, so they should have some evaluation results on which to base their funding allocations.

Arms Chapter 8: User Interfaces and Usability

Arms details some reasons for the disconnect between end users and digital libraries. First, the interfaces, collections, and services in digital libraries change constantly, but the user adapts slowly. This can cause much frustration. Second, digital libraries were initially used primarily by experts who understood what they were using. Now that the Internet is nearing ubiquity, fewer digital library end users are experts. They "do not want to spend their own time learning techniques that may be of transitory value." Thus digital libraries must be accessible to both skilled and unskilled end users.

Arms lists four parts of a digital library's conceptual model: interface design, functional design, data and metadata, and computer systems and networks.

Several points stood out to me:

  • To increase space on the screen for content, remove on-screen manipulation buttons and have the user navigate with keystrokes.
  • Structural metadata is required to relate page sequence with actual page numbers. The page number in the original document rarely matches the sequence of the digital version, since prefaces and tables of contents are seldom numbered.
  • To reduce the time of page loading, data can be sent to the user's computer before she requests it. If she is viewing page 6, for instance, the computer can "pre-fetch" page 7 in the meantime.

Some of Arms' suggestions for digital libraries:

  • They should be accessible from anywhere on the Internet.
  • The interface should be extensible.
  • Content should be accessible indefinitely. (This tenet seems under threat by copyright laws and DRM.)
  • Interfaces should be customizable.
  • Spacial representations of library content can aid the user's memory and increase access.
  • Interfaces should have consistent appearance, controls, and function.
  • The interfaces should provide feedback to users about what is happening.
  • Users should be able to stop an action or return.
  • There should be several ways for the user to complete the same task; some routes can be simple for the novice user while some routes can be faster for experts.
  • Interfaces should be accessible regardless of a user's computer display preferences, Internet speed, or operating system.
  • Caching and mirroring should be used to reduce delays in information transfer over the Net. Through mirroring, the user accesses the content closest to her, though it may be stored on several servers around the globe.
  • Summarize the user's choices.

Thursday, October 23, 2008

Week 10 Muddiest Point and Question

I have no muddiest point for this week. My question is about the final demonstration of my group's digital library. Can we demonstrate the library to Dr. He any time during the final week of the semester? When can we start scheduling the demonstration?

Friday, October 17, 2008

Week 8 Reading Notes

The Truth About Federated Searching

I do not completely trust the veracity of this article because it comes from a private corporation that sells federated search technology.

  • Federated searching is a Web-based search of multiple databases.
  • User authentication can be problematic. Federated searches should be available to patrons on-site as well as off-site.
  • True de-duping, which eliminates all duplicate search results, is a myth. For all identical records to be dismissed, the search engine would spend hours processing. However, it seems wise to de-dupe if only to eliminate duplicates in initial results sets.
  • Relevancy ranking is based on words only in the citation, not in the abstract, index, or full text. Therefore, relevancy ranking ignores many of the crucial keywords. This suggests it is important to skim several pages of results instead of assuming the first 5 or 10 will be most useful.
  • "You can't get better results with a federated search engine than you can with the native database search." So why use a federated search? It still saves time to search several databases at once than to seek out the portals of individual databases. That is, unless you know of one or two individual databases that consistently provide you with results relevant to your topic.
  • One of Hane's myths, that federated searching is software, is still positive. Because federated searching is a service, not software, libraries don't need to update databases using translators on a daily basis.
Federated Searching: Put It in Its Place

Miller's article supports using federated searching in conjunction with library catalogs. In many ways, his 2004 article is woefully outdated. The University of Pittsburgh's digital libraries are already searchable through a Google-like search box; this engine is called Zoom! and is much slower and less reliable than Google. On the other hand, the article is still relevant and still being ignored. The University of Pittsburgh's actual catalog (as well as the Carnegie Library's catalog, and many others) does not have a Google-like interface. Users type a search term but also choose limitations such as title, author or location. Personally, this is useful and easy, but for a generation raised on Google, library catalogs probably need to evolve.

I agree with Miller that "Amazon ... has become a de facto catalog of the masses." When I was a reference assistant, finding a specific title for a patron was much easier in Amazon than on the library's own catalog. Amazon's visual aspects, ease of searching, patron reviews and reader's advisory were usually preferable. For research, I often consult WorldCat before searching a local catalog.

Search Engine Technology and Digital Libraries

Lossau confronts the problem of including non-indexed Web content in library searches. First, much of the Web's content is not appropriate in the academic world because it lacks authenticity and authoritativeness. Second, much of this information changes by the second. Also, the content is not guaranteed to exist, nor to exist at the same location.

Much of the deep web, however, contains highly authoritative information, especially regarding the sciences. To access this, Lossau suggests a search index with a customizable interface that can mesh with existing local portals. Users should be able to display their result sets in a variety of ways. Automatically extracted metadata can improve access to useful but difficult-to-find materials that have not been indexed by a human.

Week 8 Muddiest Point and Questions

I have no muddiest point or question for this week.

Wednesday, October 8, 2008

Week 7 Reading Notes

Lesk's Understanding Digital Libraries Chapter 4

Audio
  • two types of formatting: high quality for music, low quality for voice
  • music recordings are more difficult to store than voice recordings
  • searching databases of sound recordings is difficult
  • accelerating voice recordings saves space and listener's time, but eventually degrades quality by altering pitch
  • one way of organizing voice recordings is for computers to detect sentences and paragraphs by analyzing pauses
Images
  • GIF: lossless on bitonal images, lossy on color, good for limited color images such as CADs, not good for photos
  • you can improve GIF compression performance by reducing number of colors to 64
  • JPEG: has "generally stronger algorithm than GIF", not as good for computer-generated images
  • dithering approximates gray with variations in black and white dots; dithering can improve image quality but increases size
  • images are difficult to classify and index; image thesauri help somewhat; images often labeled by artist, title, size; experimental image searching now done by color, shape and texture
  • accessibility tags aid in online image searching
Video
  • storage is a challenge; TV footage should be compressed
  • video footage can be accessed more easily if it is divided into sections and labeled by sample images
  • searching is difficult; closed captioning tracks aid in television footage searches
Hawking's "Web Search Engines"

Because search engines do not have the time or power to search every page on the Web, they use several techniques to cull the most relevant pages.
  • crawling algorithms: "seed" URLs point to groups of reputable sites; duplicate content is identified and ignored; priority queues give preference to sites that are changed often, have many incoming links and are clicked often
  • spam: webmasters' insertion of false keywords is no longer effective; artificial linking and cloaking (showing content to the crawler that isn't shown to the user) are very effective; crawlers create blacklists of spamming URLs
  • indexing: search engines create dictionaries of query terms; multiple machines work to index pages; precomputed lists of common search queries increase processing speed; anchor text is generally reliable and assists in indexing
  • ranking: attributes such as URL brevity improve page's score
  • caching: search engine stores results of popular queries
I was surprised that this article mentions "the" as a valid search term for Web search engines. This usually is an invalid or ignored term in library catalog queries.

Henzinger's "Challenges in Web Search Engines"


This article suggests several ways to improve search engine performance:
  • avoiding spam, such as link farms (groups of links to all pages on a site), doorway pages (pages of links intended to attract search engine attention) and cloaking
  • identifying page quality by tracking user clicks
  • finding out when web conventions (such as anchor text, hyperlinks and META tags) are being violated
  • avoiding duplicate content and duplicate hosts by paying attention to DNS semantics and eliminating generally equivalent terms (such as .co.uk and .com)
  • being wary of vaguely-structured data: examining HTML tags to give preference to certain page sections, examining layout to assume page worthiness, and dismissing pages with egregious markup mistakes

Week 7 Muddiest Point

I have no muddiest point. My question is about META tags, introduced in the Henzinger article from 2002. At that time, META tags were "the primary way to include metadata within HTML." Are META tags still used now that XML is well known? How do they compare to XML?

Wednesday, October 1, 2008

Week 6 Reading Notes

Researching Challenges in Digital Archiving and Long-term Preservation

Hedstrom's paper was disheartening. She maintains that there is tremendous work yet to be done in ensuring preservation through digital libraries. There is no date on her paper; perhaps the situation has improved in the years since she wrote it.

She outlines five concerns:
  1. We cannot manage and preserve digital libraries fast enough; we need either more manpower, more money, or more automated tasks.
  2. Uninterrupted preservation of digital objects into the unforeseeable future is a great challenge, since software and hardware require frequent migration.
  3. Lack of precedents and models means more knowledge is needed on legal issues, cost-benefit analysis, etc.
  4. Better technology is needed, especially to automatically write, extract, restructure and manage metadata.
  5. Networks and templates are needed to encourage standardization and interoperability among digital libraries.
Actualized Preservation Threats

Littman's article was not much more encouraging than Hedstrom's. However, it is useful that Littman is publicizing potential pitfalls in executing a digital repository. The major problems were media failure (such as portable hard drive issues), hardware failure (data loss and service disruption due to hard drive failure), software failure (including METS and XML issues), and operator errors (mistakes caused by humans).

Littman mentioned that metadata was encoded as METS, MODS, MARCXML, PREMIS, and MIX. Why were so many codes used? Does this increase interoperability with other libraries?

"Ingest of digitized newspapers into the repository began while the repository was still under development," wrote Littman. "This is probably fairly common." Why is this common? Why were developers so pressed for time that they had to ingest before testing the repository? Is this a result of poor project planning and lack of deadline adherence?

The Open Archival Information System Reference Model

A forum of national space agencies formed this model from a desire to establish shared concepts and definitions for digital preservation and archiving. The open forum developed standards for a repository that would preserve and provide access to information.

They delineated several responsibilities of an Open Archival Information System:
  1. Define collection scope and motivate information owners to pass items to the archive.
  2. Obtain sufficient custody and intellectual property rights of the items.
  3. Define scope of primary user community.
  4. Be sure users can independently understand the items.
  5. Create preservation policies and procedures.
  6. Make the archive available to the intended community.
The archive has three parts - environment, functional components, and information objects.

Environment: management + producer + consumer
  • Management: creates and enforces policy, defines collection scope, conducts strategic planning
  • Producer: ingest information and metadata, guided by submission agreement
  • Consumer: information users, including designated community (smaller group of primary users who independently understand archived items)
Functional components: ingest, archival storage, data management, preservation planning, access, administration
  • Ingest: receive information, validate completeness of information, extract/create metadata
  • Archival storage: ensure information is stored correctly, refresh media, migrate between formats, check for errors, create disaster recovery plans
  • Data management: maintains metadata and databases
  • Preservation planning: creates preservation plan, keeps abreast of new storage and access technologies
  • Access: the user interface, helps users find and access information
  • Administration: coordinates operations of five previous services, monitors performance
Information objects:
  • Submission information package (SIP): original ingested item
  • Dissemination information package (DIP): what user accesses
  • Archival information package (AIP): descriptive information and packaging information (content information + preservation description information)
  1. Content information: content data object (the information) + representation information (renders bit sequences)
  2. Preservation description information: reference (unique identifier such as ISBN) + provenance (history of item's creation, owners, etc.) + context (relationship to other documents) + fixity (authenticity validation such as digital signature or watermark)

Week 6 Muddiest Point and Question

I have no muddiest point.

My question regards the exam on November 3. Will we have a review or discussion prior to the exam? Is it true that there are examples of past exams for us to look at?