Friday, September 26, 2008

Week 5 Readings

Bryan's Introduction to XML
  • XML helps join documents, adds editorial comments, places images within text files
  • XML is not standardized or a predefined set of tags
  • Documents made up of entities, made up of elements, made up of attributes
  • Unique identifiers provide cross references between two document points
  • Text entities are shorthand for a full name; this makes for more efficient document editing (I assume it saves typing time, too)
  • XML documents are best stored in databases

Some information in this article about tag sets is confusing.

Extending Your Markup: An XML Tutorial
  • XML: tells about content, "a semantic language that lets you meaningfully annotate text"
  • Ideal XML document starts with prolog and has one element
  • Prolog: XML version + standalone (yes or no) + encoding + DTD declaration
  • Element = root of the document, can be nonterminal or terminal
  • DTDs: define document structure, specify tag sets, specify tag order, specify tag attributes, can be in XML document or separate
  • Element attributes: not required; can be optional, required or fixed
  • Namespace: avoids confusion between names
  • XML schema and DTDs are still being perfected
After reading this article, information about DTD attributes, XML schema, and extending capabilities is still unclear to me.

W3Schools XML Schema Tutorial
  • XML Schema (XSD) can be used instead of DTDs and describes XML document structure
  • XSD defines elements, child elements, and attributes
  • Why is XSD preferable to DTDs? They are extensible to future additions, support data types and support namespaces.
  • XSD supports crosscultural communication because it ensures standard data types (i.e.: date formats of YYYY-MM-DD)
  • When elements or attributes have defined data types, invalid types will not be accepted
  • Facets = restrictions on XML elements (i.e.: initials field can contain only 3 uppercase letters)
  • Seven indicators define order, occurrence and group

This tutorial mentions several data types. I understand date, time and decimal types, but I would like more clarification on string types. Does string just refer to basic text (not numbers, etc.)?




Week 5 Muddiest Point

I have no muddiest point for this week.

Friday, September 19, 2008

Week 4 Readings

Introduction to Metadata: Setting the Stage

Gilliland included many definitions and uses of metadata, expanding my previous understanding of the term. I previously defined metadata as information about a document that assisted in its organization and access. Her inclusion of preservation as a type of metadata conflicted with my definition because preservation notes don't really aid in accessing an item.

She writes that the variety of metadata schemas are "potentially bewildering." I agree. Is this myriad of choices desirable or would one standard expand access across languages, nations and formats? Will the best standard eventually evolve out of the slough of options? I have heard that the Semantic Web, through the use of XML, will provide a way of standardizing metadata for digital objects.


I found this very interesting: "One information object's metadata can simultaneously be another information object's data." I tried to think of an example of this. I came up with a citation at the bottom of an article. A citation (author, title, format, etc.) is metadata used as a finding aid.


Witten

I was puzzled by this sentence in section 2.2: "Most library users locate information by finding one relevant book on the shelves and then looking around for others in the same area." I do this myself, and it still seems relevant especially for children or those not savvy OPACs, but I think this idea of browsing is outdated. I think more and more patrons are using OPACs to use materials.

Witten continued: "Most readers ... remain blithely unaware that there is any way of finding related books other than browsing library shelves." This is an important issue for librarians. If patrons don't know OPACs exist, it is our responsibility to educate them about all types of finding aids and access points. Especially in an age of full-text searching online, patrons may not realize that lists of subject headings (both in a digital library and in an OPAC) can be very useful.

Border Crossings: Reflection on a Decade of Metadata Consensus Building

I am interested in Weibel's comments on user-created metadata and his assertion that "almost nobody will spend the time." Yet aren't many people today creating metadata of their own volition through sites like de.li.cious? Also, if it is true that people won't tag for free, then why do Wikipedia users spend millions of hours editing articles for free?

I like Weibel's call for collaboration and consensus in creating and standardizing digital metadata. He notes that during an OCLC workshop, librarians and computer scientists were not on the same page. I encounter this idea often. What can be done to bring these two groups of professionals closer in understanding?

Weibel discusses the option of using indexed metadata terms versus full-text indexing, pointing out that neither has triumphed over the other. What are the advantages and disadvantages of each?

Week 4 Muddiest Point

I have a question about Gilliland's article. She wrote, "in any instance where it is crucial that metadata and content coexist, then it is recommended that the metadata become an integral part of the information object and not be stored elsewhere." What does this mean? When is it crucial for metadata and content to coexist?

Wednesday, September 10, 2008

Week 3 Readings

Identifiers and Their Role in Networked Information Applications

The information on URNs was new for me, but I have encountered PURLs many times when accessing articles from digital libraries. PURLs are very useful for ensuring citations remain valid. I wonder if OCLC is solely responsible for hosting PURLs or if digital libraries are hosting their own PURLs. For instance, when I see a PURL for an EBSCOHost
article, is it related to OCLC or to EBSCOHost?


I was also unaware that new Internet protocols will supersede HTTP. Is this happening now?


DOIs present some troubling issues for academic scholarship. Lynch wrote that the advent of DOIs is "likely to mean that the author of the citing work will need to obtain the DOI of the work that he or she wishes to cite either from the owner of the cited work or from some third party, and accessing a citation would then involve interaction with the DOI resolution service, raising privacy and control issues." I imagine this could discourage citations because of the time and effort required on the part of the citing individual. What if the cited author cannot be reached or found? Consider what this would mean for students making routine citations of authors' works. I agree with Lynch that "the act of reference should not rely upon proprietary databases or services."


Digital Object Identifier System

The structure of DOIs seems very well organized. The ingenious lack of rules about DOI length should allow its use far into the future, whereas the finite number ISBNs and ISSNs is problematic.

This article does not mention any of Lynch's concerns. Paskin wrote very recently; perhaps Lynch's issues have since been resolved.


Arms, Chapter 9: Text

Arms wrote in 1999, "Optical character recognition remains an inexact process." This is still troubling digital library users today. I read several journal articles a week for my classes in the MLIS program and it is common to find a handful of mistakes in each article due to poor OCR. Often the context allows me to correct the mistake, but sometimes mistakes lead to confusion and lack of understanding. I am surprised that articles that must undergo such stringent peer review and editorial scrutiny can then be posted with flaws in expensive subscription databases. It is very interesting that outsourced manual typing has been cheaper than OCR combined with proofreading. I have heard that non-native English speakers are sometimes more accurate at English data entry than native English speakers, because they must pay close attention to each unknown character.

Arms speaks about three approaches to page description: TeX, PostScript and PDF. PDF seems to have cornered the market now; most digital libraries offer articles in PDF format. Have things changed drastically since 1999, or are
TeX and PostScript still being used?

Week 3 Muddiest Point and Question

I have no muddy points.

After reading the Lynch article, I would like to see an example of a URN. Are they widely used now? What is their relation to digital libraries?

Sunday, September 7, 2008

Week 2 Readings

A Framework for Building Open Digital Libraries
Some of the key challenges of digital libraries are the lack of software toolkits to build them, the lack of interoperability between them, and lack of planning in implementing them. I was interested in this quote: "Most DLs are intended to be quick solutions to urgent community needs - so not much thought goes into planning for future redeployment of the systems." I wonder if the community need is often determined by librarians and administrators while the digital library is built by IT professionals, which may lead to a disconnect between goals and results. The open digital library system appears to be a viable solution. It seems a national or international organization (as opposed to many fragmented organizations) could set simple guidelines for building digital libraries, including expectations for interoperability, so that patrons can find information easily and archives can be updated and combined with less effort.

Several terms in this paper, such as "overloaded semantics" and "purposeful orthogonality," were unfamiliar to me.

The Internet and the World Wide Web
Much of the information in this article was not new for me, but I was not aware of the Internet Engineering Task Force and the RFC series. The lack of hierarchies and bureaucracy is truly democratic and encouraging in an age where governments and corporations seek autonomous control of networks and resources. I am thankful for these talented volunteers who continually improve the Internet.

The Los Alamos E-Print Archives is another democratic success. Arms writes, "The user interfaces have been designed to minimize the effort required to maintain the archive." Is this ideal? I believe the user should always come first, and that those developing and maintaining the archives should do everything possible to encourage smooth access to articles. Perhaps the grants cannot fund the cost of heavy maintenance.

An Architecture for Information in Digital Libraries
This article is very readable. I agree with the definition of a good user interface: "it can provide unsophisticated users with flexible access to rich and complicated information." The three basic principles for information architecture are also very user supportive, while understanding that those maintaining the collection should not be burdened with routine tasks. Since I have never been responsible for a DL, I am curious to see what the typical archive maintenance entails, and how some DLs use automatic maintenance to reduce human updating.

What is a "legacy database"?

Saturday, September 6, 2008

Week 2 Muddiest Points and Question

I had several muddy points from the first class. The first concerns blog publishing. The PowerPoint for Week 1 says blogs should be posted by Friday night each week. In the CourseWeb syllabus it says "email responses for each week's readings should be submitted by midnight the Sunday before the class." I know we aren't supposed to email our comments, but when exactly should they be posted?

My second muddy point is about the meanings of digital libraries in the CS community (slide 18). What is an example of a networked multimedia information system, and how does it differ from the LIS version of the digital library as an online repository?

Third, are we supposed to have posted reading notes for Week 1's readings or begin with Week 2's readings?

My question for our next class is broad. "A Framework for Building Open Digital Libraries" and "Interoperability for Digital Objects and Repositories" went beyond my technological understanding in some parts. If my goal is to become a reference librarian or a general public librarian, how much of this will I need to comprehend to be successful? What aspects can I leave to an IT staff member and what aspects should I be able to handle myself?