Tuesday, November 25, 2008

Week 13 Reading Notes

Lynch: Where Do We Go from Here?

Lynch reminds readers that digital libraries didn't suddenly appear in the 90s alongside the Web, but instead trace their advent to the 60s. Digital library research began receiving major governmental funding in the mid-90s, which "legitimized digital libraries as a field of research." Lynch is very appreciative of the NSF funding that encouraged collaboration and community building, especially among a diverse group of sectors.

The bulk of programmatic government funding for digital libraries has tapered, except for research involving defense, intelligence and homeland security. Money is being directed at digital asset management, institutional repositories, and creating new collections.

In the future, the largest issues for digital libraries will be preservation and ethical stewardship. Lynch would like to see more studies on privacy/personal information management, user-behaviors, interactive library environments, and libraries' role in learning and human development.

Lynch writes, "The lion's share of the NSF funding went to computer science groups, with libraries often being only peripherally involved, if at all." I wonder if digital libraries would be more user-friendly today if librarians had gotten a bigger slice of the government pie back then.

Stiglitz: Intellectual Property Rights and Wrongs

This Nobel laureate wants to empower the developing world through less stringent intellectual property rights and more open access/open source content. He criticizes biased intellectual property rights that benefit developed nations and their monopolistic corporations. These rights, he says, can hinder innovation, which relies on liberal dissemination of ideas.

Stiglitz writes, "Monopolists may have much less incentive to innovate than they would if they had to compete." He uses Microsoft's squelching of the Netscape browser as an example. I agree that Microsoft needs more competition, especially since Vista is far from innovative. However, despite Microsoft's dominance, innovation has not died: Firefox is free, open source, and flourishing.

Knowledge Lost in Information

This report asks the NSF to continue funding digital libraries at a price of $60 million a year. Lapsed funding could cause international competitors to overtake the U.S. in research accomplishments. It could also lead to more chaos in libraries as users lose control over the world's growing masses of information.

Some of the research needs include:
  • transferring research models to broader contexts, thus increasing user populations
  • better strategies for accessing information in various formats
  • cognitive completion, or prompts that help a user pinpoint her desired subject
  • proactive storage systems that automatically cull new information
  • user-centered design, customizable user interfaces
  • automatic production of metadata
  • creating a universal architecture
  • active replication to ensure preservation
  • interoperability

Some reasons for NSF to fund digital libraries:

  • national security will be stronger
  • students will be better equipped to compete in a global economy
  • will motivate research in health and environmental sciences

I enjoyed this quote from Herbert Simon: "A wealth of information creates a poverty of attention." The authors of this report maintain that digital library research should provide ways to manage the information glut. One way is to "reduce available information to actionable information." This is a tricky task: who decides what is actionable and what should be tossed? For instance, 20 years from now, hurricane warning information for Katrina will not be actionable, but it will still have historical relevance.

Week 13 Muddiest Point

I have no muddiest point or question.

Friday, November 21, 2008

Week 12 Reading Notes

Arms Chapter 6: Economic and Legal Issues

Arms purports that challenges in digital libraries cannot be solved through new laws or technology. Instead, social information customs should be created first.

There are two economic models for digital libraries: open access and payment systems. Open access libraries are usually funded by grants, advertising, the government, or private individuals. Payment systems are funded by annual or monthly subscriptions, hourly rates, rates based on the number of workstations providing access, rates based on the number of simultaneous users, or rates based on the number of digital objects accessed. Arms says subscription is the most popular payment form because it is predictable. Libraries prefer subscriptions because they allow for the widest, simplest access.

Arms discusses the frustrating cycle of scholarly publishing in which academic libraries must pay exorbitant prices for their own faculty's articles by going through middlemen like Elsevier. He blames universities for using the quantity of published articles as a basis for faculty promotion and awards. "Because prestige comes from writing many papers," writes Arms, "they have an incentive to write several papers that report slightly different aspects of a single piece of research."

Though most academic libraries use electronic databases for journal access, there are three main flaws in this system:
  • If the publisher goes out of business or changes its database coverage, the library is left with neither current articles nor archives.
  • In order to maintain profits, publishers sell database subscriptions in bundles, which forces libraries to subscribe to less-favored databases in order to secure a reasonable price on the essential ones.
  • Patrons' rights involving the use of articles is murky. Some restrictions would prevent users from printing or saving copies of articles for individual use.
Arms Chapter 7: Access Management and Security

Access management is the policies that authorize users to access digital libraries, whether open or fee-based. The two parties in access management are the information managers, who create and implement policies, and the users, who must authenticate their roles.

Users may authenticate their roles by what they know (such as a username and password), what they own (such as a smart-card), where they are (i.e., at a particular IP address) and who they are (via biometrics, for instance). Common user roles are determined by group association (such as a Pitt student), location (someone at a Carnegie Library of Pittsburgh computer), subscription (someone paying a monthly fee to an online journal), robotic use (a spider crawling the Web), or payment (a user is paying per view). Users may be restricted in their computing actions and to what extent they may use a digital object.

Enforcing access management policies is tricky: too many controls can turn off users but too few controls may invite abuse. Some digital library operators choose fewer controls, knowing that happy users will be repeat customers and therefore compensate for profits lost to illegal users. Arms advocates displaying an access statement to users; in some cases this bears the strength of the law.

Digital library security and authenticity can be ensured via firewalls, encryption, watermarking and digital signatures.

Lesk on Economics

Lesk links library economies to publishing industry economies and says that digital libraries will need new funding strategies in an era of skyrocketing information prices. Libraries have difficulty with funding, he says, because users do not appreciate the value of library holdings and operations.

Lesk mentioned the idea of a library as a "buying club" where users are motivated to participate because sharing resources is cheaper than buying personal copies for permanent use. This is an interesting notion, but it paints the library as only an on-demand information source, not an institution upholding preservation and literacy. Publishers do not always win through the buying club model, but it does seem best for the average user.

Another economic model that Lesk considers is transclusion, in which users must pay to see quotations in an article. In exchange for a fee, the user will be directed from the original work to the cited work. Such a model, at least for the average undergraduate or graduate researcher, is prohibitive. I doubt a freshman would pay to see such information; the fees would likely drive her to cheaper, less reputable sources.

Lesk discusses copyright law's adverse effects on users. Since the Berne Convention, works do not need to bear copyright notices, dates or authors. Thus, finding the work's author or owner may be impossible.

This article is outdated, so I wonder if what Lesk says about authors avoiding online publishing is still true. Are faculty today loathe to publish in open access journals because their deans may not consider it tenure-worthy work?

One item in this chapter surprised me. I have seen lots of suggestions on how libraries can save money, but I had never seen advertising suggested as a way to get funding. How might this work? Instead of making a patron pay to see an online article, might he view a 5-second ad first? Or would library shelves have sponsor posters tacked to each end, just as corporate names line a football stadium? This suggestion is far more disturbing to me than resorting to pay-per-view research. Libraries are the last public places without advertising, and we should try to keep them that way.

Wednesday, November 19, 2008

Week 12 Muddiest Point

I have no muddiest point. I have a few questions:
  • Have the XML assignments been graded and returned? I have not received mine.
  • Regarding Greenstone: How do we divide our library into three collections? As of yet all of our content is in one big collection. How can we edit the text on the front page and can we edit the graphical user interface? How can we add captions to image thumbnails? Why do our title searches (and all other searches) give us 0 results? How does one delimit Dublin Core elements? We tried commas and semi-colons unsuccessfully.

Wednesday, November 12, 2008

Week 11 Reading Notes

A Viewpoint Analysis of the Digital Library

Arms lists three viewpoints with which to investigate this question: "Should digital libraries be self-sufficient islands or should we strive for a single global digital library?" The viewpoints are: organizational (such as the Library of Congress), technical, and user's.

The first viewpoint emphasizes the individual library as the source of all knowledge. It does not focus on collaboration or interoperability with other libraries and their interfaces. This viewpoint is often ineffective for users.

The second viewpoint concerns the technical systems of digital libraries. Interoperability between structures and metadata is key, but users are not. This viewpoint has offered many successes, such as XML, Z39.50 and MARC. Unfortunately, these advances are not used to their potential.

The third viewpoint, from the user, is indifferent to technological and organizational viewpoints. Interfaces are not adequately uniform from institution to institution. Organizations such as the Library of Congress may not even be in the user's subconscious.

Arms advocates more study of the user's viewpoint. He suggests holistic evaluations in which the user accesses multiple libraries.

Social Aspects of Digital Libraries

This paper, the result of a workshop between UCLA and the National Science Foundation, asserts that digital library creators should be more concerned with their social context. The article states an obvious but often ignored goal: "digital libraries should be constructed in a way that accommodates the actual tasks and activities that people engage in when they create, seek, and use information resources."

Users, if unimpressed by institutional digital libraries, "will construct digital libraries on their own behalf." Thus, digital library creators should follow one of the basic tenets of Website portal design: allow the interface and contents to be organized according to individual user preferences.

The Information Life Cycle has three stages: creation (when the digital object is active - being creating, modified, and indexed), searching (when the digital object is semi-active - being stored and distributed), and utilization (when the digital object is inactive - being discarded or mined).

Some of the issues that stand out to me are:
  • how to facilitate information sharing across multiple user communities
  • how to describe and organize content in flux (such as Web sites)
  • when to use human versus automated indexing, despite human indexing's cost and time
  • whether to create a single interface for a library, or different interfaces that are more useful for different groups (such as a simpler interface for children and a highly manipulable, complex interface for academicians)
The conclusion of this report is similar to Arms' conclusion, that user interaction with digital libraries needs to be studied further, especially among different cultural groups.

The Infinite Library

This article evaluates Google Book Search rather objectively.

Some concerns include:
  • entrusting global literary heritage to a corporation
  • libraries devoid of physical content; libraries as lonely shells for preservationists
  • libraries' inability to share their digital copies of scanned books with anyone but Google
  • handwritten texts that are unsearchable via OCR
  • digitizing books in the less stringent Internet Archive instead of Book Search
Some benefits include:
  • increased need for librarians to help guide patrons through the morass of online text
  • global access for books previously available only in noncirculating or restricted libraries
My personal concern is the sunny belief that Google is a "good citizen" as librarian John Wilkin notes. Google is relatively good now, but that guarantees nothing for the future. Like civilizations, companies rise and fall, go bankrupt, get bought out by conglomerates, and so on. Google's collaboration with the Chinese government to censor public Internet access does not qualify as good citizenship.

Week 11 Muddiest Point and Question

I have no muddiest point, but I have a question about Greenstone. How can I upload files into Greenstone? My understanding is that under the Gather tab, I can choose Local Filespace, choose a drive and file. When I try to drag and drop this file to the right, under Collection, nothing happens. Some students have expressed interest in dedicating some class time to a brief Greenstone tutorial about, for instance, uploading files. Is this possible?

Friday, October 24, 2008

Week 10 Reading Notes

Digital Library Design for Usability

This article outlines five models of computer systems design. The authors find several of these models lacking. The most successful design elements gleaned from these five models include:
  • Learnability: the user can start using a digital library quickly without picking up a lot of new skills.
  • Memorability: the user can remember how to use the library after a significant length of time.
  • The user should be able to recover from errors.
  • The user should be able to save search results or search paths for later use.
  • Users within an organization should be able to get training and guidance on using the library.
  • Library prototypes should be tested on end users and revised before the final product is released.
  • Proactive "agents" that know a user's preferences can alert her to new items of interest.

Evaluation of Digital Libraries: An Overview

Saracevic's typo-riddled article points out that evaluation of digital libraries, especially commercial libraries, is rare. When digital libraries are evaluated, a systems-centered approach is most common. Human- and user-centered approaches are less common. To me this is problematic; if digital libraries are used by humans, their needs should be evaluated first. Perhaps this is why Saracevic notes that "users have many difficulties with digital libraries" such as ignorance of the library's capabilities and scope.

I strongly disagree with his assertion that "it may be too early in the evolution of digital libraries for evaluation." Even when he wrote his article in 2004, many digital libraries were in existence. Now there are even more, and the number is growing all the time. Institutions spend hundreds of thousands of dollars on commercial digital libraries alone, so they should have some evaluation results on which to base their funding allocations.

Arms Chapter 8: User Interfaces and Usability

Arms details some reasons for the disconnect between end users and digital libraries. First, the interfaces, collections, and services in digital libraries change constantly, but the user adapts slowly. This can cause much frustration. Second, digital libraries were initially used primarily by experts who understood what they were using. Now that the Internet is nearing ubiquity, fewer digital library end users are experts. They "do not want to spend their own time learning techniques that may be of transitory value." Thus digital libraries must be accessible to both skilled and unskilled end users.

Arms lists four parts of a digital library's conceptual model: interface design, functional design, data and metadata, and computer systems and networks.

Several points stood out to me:

  • To increase space on the screen for content, remove on-screen manipulation buttons and have the user navigate with keystrokes.
  • Structural metadata is required to relate page sequence with actual page numbers. The page number in the original document rarely matches the sequence of the digital version, since prefaces and tables of contents are seldom numbered.
  • To reduce the time of page loading, data can be sent to the user's computer before she requests it. If she is viewing page 6, for instance, the computer can "pre-fetch" page 7 in the meantime.

Some of Arms' suggestions for digital libraries:

  • They should be accessible from anywhere on the Internet.
  • The interface should be extensible.
  • Content should be accessible indefinitely. (This tenet seems under threat by copyright laws and DRM.)
  • Interfaces should be customizable.
  • Spacial representations of library content can aid the user's memory and increase access.
  • Interfaces should have consistent appearance, controls, and function.
  • The interfaces should provide feedback to users about what is happening.
  • Users should be able to stop an action or return.
  • There should be several ways for the user to complete the same task; some routes can be simple for the novice user while some routes can be faster for experts.
  • Interfaces should be accessible regardless of a user's computer display preferences, Internet speed, or operating system.
  • Caching and mirroring should be used to reduce delays in information transfer over the Net. Through mirroring, the user accesses the content closest to her, though it may be stored on several servers around the globe.
  • Summarize the user's choices.

Thursday, October 23, 2008

Week 10 Muddiest Point and Question

I have no muddiest point for this week. My question is about the final demonstration of my group's digital library. Can we demonstrate the library to Dr. He any time during the final week of the semester? When can we start scheduling the demonstration?

Friday, October 17, 2008

Week 8 Reading Notes

The Truth About Federated Searching

I do not completely trust the veracity of this article because it comes from a private corporation that sells federated search technology.

  • Federated searching is a Web-based search of multiple databases.
  • User authentication can be problematic. Federated searches should be available to patrons on-site as well as off-site.
  • True de-duping, which eliminates all duplicate search results, is a myth. For all identical records to be dismissed, the search engine would spend hours processing. However, it seems wise to de-dupe if only to eliminate duplicates in initial results sets.
  • Relevancy ranking is based on words only in the citation, not in the abstract, index, or full text. Therefore, relevancy ranking ignores many of the crucial keywords. This suggests it is important to skim several pages of results instead of assuming the first 5 or 10 will be most useful.
  • "You can't get better results with a federated search engine than you can with the native database search." So why use a federated search? It still saves time to search several databases at once than to seek out the portals of individual databases. That is, unless you know of one or two individual databases that consistently provide you with results relevant to your topic.
  • One of Hane's myths, that federated searching is software, is still positive. Because federated searching is a service, not software, libraries don't need to update databases using translators on a daily basis.
Federated Searching: Put It in Its Place

Miller's article supports using federated searching in conjunction with library catalogs. In many ways, his 2004 article is woefully outdated. The University of Pittsburgh's digital libraries are already searchable through a Google-like search box; this engine is called Zoom! and is much slower and less reliable than Google. On the other hand, the article is still relevant and still being ignored. The University of Pittsburgh's actual catalog (as well as the Carnegie Library's catalog, and many others) does not have a Google-like interface. Users type a search term but also choose limitations such as title, author or location. Personally, this is useful and easy, but for a generation raised on Google, library catalogs probably need to evolve.

I agree with Miller that "Amazon ... has become a de facto catalog of the masses." When I was a reference assistant, finding a specific title for a patron was much easier in Amazon than on the library's own catalog. Amazon's visual aspects, ease of searching, patron reviews and reader's advisory were usually preferable. For research, I often consult WorldCat before searching a local catalog.

Search Engine Technology and Digital Libraries

Lossau confronts the problem of including non-indexed Web content in library searches. First, much of the Web's content is not appropriate in the academic world because it lacks authenticity and authoritativeness. Second, much of this information changes by the second. Also, the content is not guaranteed to exist, nor to exist at the same location.

Much of the deep web, however, contains highly authoritative information, especially regarding the sciences. To access this, Lossau suggests a search index with a customizable interface that can mesh with existing local portals. Users should be able to display their result sets in a variety of ways. Automatically extracted metadata can improve access to useful but difficult-to-find materials that have not been indexed by a human.

Week 8 Muddiest Point and Questions

I have no muddiest point or question for this week.

Wednesday, October 8, 2008

Week 7 Reading Notes

Lesk's Understanding Digital Libraries Chapter 4

Audio
  • two types of formatting: high quality for music, low quality for voice
  • music recordings are more difficult to store than voice recordings
  • searching databases of sound recordings is difficult
  • accelerating voice recordings saves space and listener's time, but eventually degrades quality by altering pitch
  • one way of organizing voice recordings is for computers to detect sentences and paragraphs by analyzing pauses
Images
  • GIF: lossless on bitonal images, lossy on color, good for limited color images such as CADs, not good for photos
  • you can improve GIF compression performance by reducing number of colors to 64
  • JPEG: has "generally stronger algorithm than GIF", not as good for computer-generated images
  • dithering approximates gray with variations in black and white dots; dithering can improve image quality but increases size
  • images are difficult to classify and index; image thesauri help somewhat; images often labeled by artist, title, size; experimental image searching now done by color, shape and texture
  • accessibility tags aid in online image searching
Video
  • storage is a challenge; TV footage should be compressed
  • video footage can be accessed more easily if it is divided into sections and labeled by sample images
  • searching is difficult; closed captioning tracks aid in television footage searches
Hawking's "Web Search Engines"

Because search engines do not have the time or power to search every page on the Web, they use several techniques to cull the most relevant pages.
  • crawling algorithms: "seed" URLs point to groups of reputable sites; duplicate content is identified and ignored; priority queues give preference to sites that are changed often, have many incoming links and are clicked often
  • spam: webmasters' insertion of false keywords is no longer effective; artificial linking and cloaking (showing content to the crawler that isn't shown to the user) are very effective; crawlers create blacklists of spamming URLs
  • indexing: search engines create dictionaries of query terms; multiple machines work to index pages; precomputed lists of common search queries increase processing speed; anchor text is generally reliable and assists in indexing
  • ranking: attributes such as URL brevity improve page's score
  • caching: search engine stores results of popular queries
I was surprised that this article mentions "the" as a valid search term for Web search engines. This usually is an invalid or ignored term in library catalog queries.

Henzinger's "Challenges in Web Search Engines"


This article suggests several ways to improve search engine performance:
  • avoiding spam, such as link farms (groups of links to all pages on a site), doorway pages (pages of links intended to attract search engine attention) and cloaking
  • identifying page quality by tracking user clicks
  • finding out when web conventions (such as anchor text, hyperlinks and META tags) are being violated
  • avoiding duplicate content and duplicate hosts by paying attention to DNS semantics and eliminating generally equivalent terms (such as .co.uk and .com)
  • being wary of vaguely-structured data: examining HTML tags to give preference to certain page sections, examining layout to assume page worthiness, and dismissing pages with egregious markup mistakes

Week 7 Muddiest Point

I have no muddiest point. My question is about META tags, introduced in the Henzinger article from 2002. At that time, META tags were "the primary way to include metadata within HTML." Are META tags still used now that XML is well known? How do they compare to XML?

Wednesday, October 1, 2008

Week 6 Reading Notes

Researching Challenges in Digital Archiving and Long-term Preservation

Hedstrom's paper was disheartening. She maintains that there is tremendous work yet to be done in ensuring preservation through digital libraries. There is no date on her paper; perhaps the situation has improved in the years since she wrote it.

She outlines five concerns:
  1. We cannot manage and preserve digital libraries fast enough; we need either more manpower, more money, or more automated tasks.
  2. Uninterrupted preservation of digital objects into the unforeseeable future is a great challenge, since software and hardware require frequent migration.
  3. Lack of precedents and models means more knowledge is needed on legal issues, cost-benefit analysis, etc.
  4. Better technology is needed, especially to automatically write, extract, restructure and manage metadata.
  5. Networks and templates are needed to encourage standardization and interoperability among digital libraries.
Actualized Preservation Threats

Littman's article was not much more encouraging than Hedstrom's. However, it is useful that Littman is publicizing potential pitfalls in executing a digital repository. The major problems were media failure (such as portable hard drive issues), hardware failure (data loss and service disruption due to hard drive failure), software failure (including METS and XML issues), and operator errors (mistakes caused by humans).

Littman mentioned that metadata was encoded as METS, MODS, MARCXML, PREMIS, and MIX. Why were so many codes used? Does this increase interoperability with other libraries?

"Ingest of digitized newspapers into the repository began while the repository was still under development," wrote Littman. "This is probably fairly common." Why is this common? Why were developers so pressed for time that they had to ingest before testing the repository? Is this a result of poor project planning and lack of deadline adherence?

The Open Archival Information System Reference Model

A forum of national space agencies formed this model from a desire to establish shared concepts and definitions for digital preservation and archiving. The open forum developed standards for a repository that would preserve and provide access to information.

They delineated several responsibilities of an Open Archival Information System:
  1. Define collection scope and motivate information owners to pass items to the archive.
  2. Obtain sufficient custody and intellectual property rights of the items.
  3. Define scope of primary user community.
  4. Be sure users can independently understand the items.
  5. Create preservation policies and procedures.
  6. Make the archive available to the intended community.
The archive has three parts - environment, functional components, and information objects.

Environment: management + producer + consumer
  • Management: creates and enforces policy, defines collection scope, conducts strategic planning
  • Producer: ingest information and metadata, guided by submission agreement
  • Consumer: information users, including designated community (smaller group of primary users who independently understand archived items)
Functional components: ingest, archival storage, data management, preservation planning, access, administration
  • Ingest: receive information, validate completeness of information, extract/create metadata
  • Archival storage: ensure information is stored correctly, refresh media, migrate between formats, check for errors, create disaster recovery plans
  • Data management: maintains metadata and databases
  • Preservation planning: creates preservation plan, keeps abreast of new storage and access technologies
  • Access: the user interface, helps users find and access information
  • Administration: coordinates operations of five previous services, monitors performance
Information objects:
  • Submission information package (SIP): original ingested item
  • Dissemination information package (DIP): what user accesses
  • Archival information package (AIP): descriptive information and packaging information (content information + preservation description information)
  1. Content information: content data object (the information) + representation information (renders bit sequences)
  2. Preservation description information: reference (unique identifier such as ISBN) + provenance (history of item's creation, owners, etc.) + context (relationship to other documents) + fixity (authenticity validation such as digital signature or watermark)

Week 6 Muddiest Point and Question

I have no muddiest point.

My question regards the exam on November 3. Will we have a review or discussion prior to the exam? Is it true that there are examples of past exams for us to look at?

Friday, September 26, 2008

Week 5 Readings

Bryan's Introduction to XML
  • XML helps join documents, adds editorial comments, places images within text files
  • XML is not standardized or a predefined set of tags
  • Documents made up of entities, made up of elements, made up of attributes
  • Unique identifiers provide cross references between two document points
  • Text entities are shorthand for a full name; this makes for more efficient document editing (I assume it saves typing time, too)
  • XML documents are best stored in databases

Some information in this article about tag sets is confusing.

Extending Your Markup: An XML Tutorial
  • XML: tells about content, "a semantic language that lets you meaningfully annotate text"
  • Ideal XML document starts with prolog and has one element
  • Prolog: XML version + standalone (yes or no) + encoding + DTD declaration
  • Element = root of the document, can be nonterminal or terminal
  • DTDs: define document structure, specify tag sets, specify tag order, specify tag attributes, can be in XML document or separate
  • Element attributes: not required; can be optional, required or fixed
  • Namespace: avoids confusion between names
  • XML schema and DTDs are still being perfected
After reading this article, information about DTD attributes, XML schema, and extending capabilities is still unclear to me.

W3Schools XML Schema Tutorial
  • XML Schema (XSD) can be used instead of DTDs and describes XML document structure
  • XSD defines elements, child elements, and attributes
  • Why is XSD preferable to DTDs? They are extensible to future additions, support data types and support namespaces.
  • XSD supports crosscultural communication because it ensures standard data types (i.e.: date formats of YYYY-MM-DD)
  • When elements or attributes have defined data types, invalid types will not be accepted
  • Facets = restrictions on XML elements (i.e.: initials field can contain only 3 uppercase letters)
  • Seven indicators define order, occurrence and group

This tutorial mentions several data types. I understand date, time and decimal types, but I would like more clarification on string types. Does string just refer to basic text (not numbers, etc.)?




Week 5 Muddiest Point

I have no muddiest point for this week.

Friday, September 19, 2008

Week 4 Readings

Introduction to Metadata: Setting the Stage

Gilliland included many definitions and uses of metadata, expanding my previous understanding of the term. I previously defined metadata as information about a document that assisted in its organization and access. Her inclusion of preservation as a type of metadata conflicted with my definition because preservation notes don't really aid in accessing an item.

She writes that the variety of metadata schemas are "potentially bewildering." I agree. Is this myriad of choices desirable or would one standard expand access across languages, nations and formats? Will the best standard eventually evolve out of the slough of options? I have heard that the Semantic Web, through the use of XML, will provide a way of standardizing metadata for digital objects.


I found this very interesting: "One information object's metadata can simultaneously be another information object's data." I tried to think of an example of this. I came up with a citation at the bottom of an article. A citation (author, title, format, etc.) is metadata used as a finding aid.


Witten

I was puzzled by this sentence in section 2.2: "Most library users locate information by finding one relevant book on the shelves and then looking around for others in the same area." I do this myself, and it still seems relevant especially for children or those not savvy OPACs, but I think this idea of browsing is outdated. I think more and more patrons are using OPACs to use materials.

Witten continued: "Most readers ... remain blithely unaware that there is any way of finding related books other than browsing library shelves." This is an important issue for librarians. If patrons don't know OPACs exist, it is our responsibility to educate them about all types of finding aids and access points. Especially in an age of full-text searching online, patrons may not realize that lists of subject headings (both in a digital library and in an OPAC) can be very useful.

Border Crossings: Reflection on a Decade of Metadata Consensus Building

I am interested in Weibel's comments on user-created metadata and his assertion that "almost nobody will spend the time." Yet aren't many people today creating metadata of their own volition through sites like de.li.cious? Also, if it is true that people won't tag for free, then why do Wikipedia users spend millions of hours editing articles for free?

I like Weibel's call for collaboration and consensus in creating and standardizing digital metadata. He notes that during an OCLC workshop, librarians and computer scientists were not on the same page. I encounter this idea often. What can be done to bring these two groups of professionals closer in understanding?

Weibel discusses the option of using indexed metadata terms versus full-text indexing, pointing out that neither has triumphed over the other. What are the advantages and disadvantages of each?

Week 4 Muddiest Point

I have a question about Gilliland's article. She wrote, "in any instance where it is crucial that metadata and content coexist, then it is recommended that the metadata become an integral part of the information object and not be stored elsewhere." What does this mean? When is it crucial for metadata and content to coexist?

Wednesday, September 10, 2008

Week 3 Readings

Identifiers and Their Role in Networked Information Applications

The information on URNs was new for me, but I have encountered PURLs many times when accessing articles from digital libraries. PURLs are very useful for ensuring citations remain valid. I wonder if OCLC is solely responsible for hosting PURLs or if digital libraries are hosting their own PURLs. For instance, when I see a PURL for an EBSCOHost
article, is it related to OCLC or to EBSCOHost?


I was also unaware that new Internet protocols will supersede HTTP. Is this happening now?


DOIs present some troubling issues for academic scholarship. Lynch wrote that the advent of DOIs is "likely to mean that the author of the citing work will need to obtain the DOI of the work that he or she wishes to cite either from the owner of the cited work or from some third party, and accessing a citation would then involve interaction with the DOI resolution service, raising privacy and control issues." I imagine this could discourage citations because of the time and effort required on the part of the citing individual. What if the cited author cannot be reached or found? Consider what this would mean for students making routine citations of authors' works. I agree with Lynch that "the act of reference should not rely upon proprietary databases or services."


Digital Object Identifier System

The structure of DOIs seems very well organized. The ingenious lack of rules about DOI length should allow its use far into the future, whereas the finite number ISBNs and ISSNs is problematic.

This article does not mention any of Lynch's concerns. Paskin wrote very recently; perhaps Lynch's issues have since been resolved.


Arms, Chapter 9: Text

Arms wrote in 1999, "Optical character recognition remains an inexact process." This is still troubling digital library users today. I read several journal articles a week for my classes in the MLIS program and it is common to find a handful of mistakes in each article due to poor OCR. Often the context allows me to correct the mistake, but sometimes mistakes lead to confusion and lack of understanding. I am surprised that articles that must undergo such stringent peer review and editorial scrutiny can then be posted with flaws in expensive subscription databases. It is very interesting that outsourced manual typing has been cheaper than OCR combined with proofreading. I have heard that non-native English speakers are sometimes more accurate at English data entry than native English speakers, because they must pay close attention to each unknown character.

Arms speaks about three approaches to page description: TeX, PostScript and PDF. PDF seems to have cornered the market now; most digital libraries offer articles in PDF format. Have things changed drastically since 1999, or are
TeX and PostScript still being used?

Week 3 Muddiest Point and Question

I have no muddy points.

After reading the Lynch article, I would like to see an example of a URN. Are they widely used now? What is their relation to digital libraries?

Sunday, September 7, 2008

Week 2 Readings

A Framework for Building Open Digital Libraries
Some of the key challenges of digital libraries are the lack of software toolkits to build them, the lack of interoperability between them, and lack of planning in implementing them. I was interested in this quote: "Most DLs are intended to be quick solutions to urgent community needs - so not much thought goes into planning for future redeployment of the systems." I wonder if the community need is often determined by librarians and administrators while the digital library is built by IT professionals, which may lead to a disconnect between goals and results. The open digital library system appears to be a viable solution. It seems a national or international organization (as opposed to many fragmented organizations) could set simple guidelines for building digital libraries, including expectations for interoperability, so that patrons can find information easily and archives can be updated and combined with less effort.

Several terms in this paper, such as "overloaded semantics" and "purposeful orthogonality," were unfamiliar to me.

The Internet and the World Wide Web
Much of the information in this article was not new for me, but I was not aware of the Internet Engineering Task Force and the RFC series. The lack of hierarchies and bureaucracy is truly democratic and encouraging in an age where governments and corporations seek autonomous control of networks and resources. I am thankful for these talented volunteers who continually improve the Internet.

The Los Alamos E-Print Archives is another democratic success. Arms writes, "The user interfaces have been designed to minimize the effort required to maintain the archive." Is this ideal? I believe the user should always come first, and that those developing and maintaining the archives should do everything possible to encourage smooth access to articles. Perhaps the grants cannot fund the cost of heavy maintenance.

An Architecture for Information in Digital Libraries
This article is very readable. I agree with the definition of a good user interface: "it can provide unsophisticated users with flexible access to rich and complicated information." The three basic principles for information architecture are also very user supportive, while understanding that those maintaining the collection should not be burdened with routine tasks. Since I have never been responsible for a DL, I am curious to see what the typical archive maintenance entails, and how some DLs use automatic maintenance to reduce human updating.

What is a "legacy database"?

Saturday, September 6, 2008

Week 2 Muddiest Points and Question

I had several muddy points from the first class. The first concerns blog publishing. The PowerPoint for Week 1 says blogs should be posted by Friday night each week. In the CourseWeb syllabus it says "email responses for each week's readings should be submitted by midnight the Sunday before the class." I know we aren't supposed to email our comments, but when exactly should they be posted?

My second muddy point is about the meanings of digital libraries in the CS community (slide 18). What is an example of a networked multimedia information system, and how does it differ from the LIS version of the digital library as an online repository?

Third, are we supposed to have posted reading notes for Week 1's readings or begin with Week 2's readings?

My question for our next class is broad. "A Framework for Building Open Digital Libraries" and "Interoperability for Digital Objects and Repositories" went beyond my technological understanding in some parts. If my goal is to become a reference librarian or a general public librarian, how much of this will I need to comprehend to be successful? What aspects can I leave to an IT staff member and what aspects should I be able to handle myself?