Wednesday, October 8, 2008

Week 7 Reading Notes

Lesk's Understanding Digital Libraries Chapter 4

Audio
  • two types of formatting: high quality for music, low quality for voice
  • music recordings are more difficult to store than voice recordings
  • searching databases of sound recordings is difficult
  • accelerating voice recordings saves space and listener's time, but eventually degrades quality by altering pitch
  • one way of organizing voice recordings is for computers to detect sentences and paragraphs by analyzing pauses
Images
  • GIF: lossless on bitonal images, lossy on color, good for limited color images such as CADs, not good for photos
  • you can improve GIF compression performance by reducing number of colors to 64
  • JPEG: has "generally stronger algorithm than GIF", not as good for computer-generated images
  • dithering approximates gray with variations in black and white dots; dithering can improve image quality but increases size
  • images are difficult to classify and index; image thesauri help somewhat; images often labeled by artist, title, size; experimental image searching now done by color, shape and texture
  • accessibility tags aid in online image searching
Video
  • storage is a challenge; TV footage should be compressed
  • video footage can be accessed more easily if it is divided into sections and labeled by sample images
  • searching is difficult; closed captioning tracks aid in television footage searches
Hawking's "Web Search Engines"

Because search engines do not have the time or power to search every page on the Web, they use several techniques to cull the most relevant pages.
  • crawling algorithms: "seed" URLs point to groups of reputable sites; duplicate content is identified and ignored; priority queues give preference to sites that are changed often, have many incoming links and are clicked often
  • spam: webmasters' insertion of false keywords is no longer effective; artificial linking and cloaking (showing content to the crawler that isn't shown to the user) are very effective; crawlers create blacklists of spamming URLs
  • indexing: search engines create dictionaries of query terms; multiple machines work to index pages; precomputed lists of common search queries increase processing speed; anchor text is generally reliable and assists in indexing
  • ranking: attributes such as URL brevity improve page's score
  • caching: search engine stores results of popular queries
I was surprised that this article mentions "the" as a valid search term for Web search engines. This usually is an invalid or ignored term in library catalog queries.

Henzinger's "Challenges in Web Search Engines"


This article suggests several ways to improve search engine performance:
  • avoiding spam, such as link farms (groups of links to all pages on a site), doorway pages (pages of links intended to attract search engine attention) and cloaking
  • identifying page quality by tracking user clicks
  • finding out when web conventions (such as anchor text, hyperlinks and META tags) are being violated
  • avoiding duplicate content and duplicate hosts by paying attention to DNS semantics and eliminating generally equivalent terms (such as .co.uk and .com)
  • being wary of vaguely-structured data: examining HTML tags to give preference to certain page sections, examining layout to assume page worthiness, and dismissing pages with egregious markup mistakes

No comments: