13 most recent entries:
Great technical article on Indexing for Search (Aug 26 2008 19:37 GMT)
I am doing a talk about going inside the black box of the search index for the Enterprise Search Summit in September in San Jose (more on that later).While I have a lot to say about indexes, I used the opportunity to check around and look for current research on the topic, and pretty much struck gold. Although this paper is from 2006, it is exhaustive and detailed, with both practical and theoretical information, including finding that inverted indexes are both significantly faster to search and easier to maintain than relational database management systems, signature files and suffix arrays. It also has a thorough annotated bibliography. Best of all, Zobel and Moffat agree with me on lowercasing all words in the index and including stopwords, which they say "have an important role in phrase queries".
Sphinx (open source free search engine): New SearchTools Report (Jul 11 2008 18:36 GMT)
Sphinx is an open source search engine, written in C, using both SQL and custom index files to provide a very fast text search. The architecture scales to over a billion records by distributing the index and querying among multiple virtual and real processors.While it does a full text search, Sphinx is designed to work with structured content (music lyrics, products), and semi-structured content (RSS feeds, blog posts, magazine articles). Sphinx is much faster and more flexible than the internal SQL functions such as where, order by, and group by. This structure allows it to display results in a faceted metadata, for example in the widepress.
x-robots-tag (Jul 10 2008 20:00 GMT)
In the Robots Exclusion Protocol June 08 Agreement, the leading webwide search engines announced that they would recognize a new element in the HTTP header, the X-Robots-Tag. Google started using it at first, then Yahoo and now Microsoft Live Search is supporting it. When a browser or robot sends a request to the web server for a URL, part of the response is the invisible HTTP header, including information about the file type, encoding, and date modified. This information is generated by the web server. The new X-Robots-Tag, within the HTTP response header, can contain same values as the Robots META tags:
Webwide search robots now indexing Flash (and filling in forms) (Jul 03 2008 00:04 GMT)
The SWF (Flash) file format has been open for a while, and a lot of search engines have used the format to get at some of the static text in in the Flash files. However, Flash is now an interactive web site application builder, and there is a lot of text that just does not exist until someone comes along and clicks. This has meant that people who wanted their sites properly indexed by webwide search engines could not use Flash, or would have to go to extra lengths to provide static text for search engine robots to find. What Adobe and Google have just announced is that Adobe is making a special version of the Flash code that can approximate a human interacting with the Flash application in the SWF file, triggering as many application states as it can. As far as I can tell, the Flash client within the indexing robot will be clicking every possible button and entering text in text fields.
Search usability research findings (Jun 27 2008 17:55 GMT)
Whitney Quesenbery and her colleagues convey the findings of a long study about how search is used at the UK's Open University, She gave a talk at the Enterprise Search Summit, and presented more formally at the Usability Professionals? Association conference, in June 2008The study included search log analysis, heuristic reviews, remote and local usability testing on the search user experience, over the course of several years, and they are linked from Whitney's valuable Search Usability page. Designing for Search: Making Information Easy(PDF) covers both search and content. It recommends focusing improvements first on the most frequent terms, the short head of search popularity.
The Short Head and Long Tail of Search (Jun 26 2008 23:07 GMT)
I've just posted an article on the Long Tail, Short Head and Search. Every site, intranet and enterprise search log I've analyzed fits the model of the Long Tail, with a very few very popular search terms, then tailing off very quickly to unique queries (the Long Tail), creating a Zipf curve.The Short Head -- the few most frequently used search terms -- is the best place to start in analyzing search engine usage. My article also gives some suggestions for taking the information and using it to improve a search engine.
HCI/IR workshop (Jun 24 2008 00:03 GMT)
HCIR 2008: Workshop on Human-Computer Interaction and Information Retrieval Making the connection between interface and search, this workshop is focused this year on complex search tasks. The 2007 Workshop presentations ranged from visual text analysis to online consumer choice. This year's workshop will be 23 October, 2008, in Redmond, Washington, USA.
article on the new Robots Exclusion Protocol (Jun 15 2008 22:22 GMT)
My article is up on InfoToday: New Robots Exclusion Protocol Agreement Among Yahoo!, Google, and Microsoft Live Search. Nothing earthshaking, just a summary from a library point of view, and a quote from Danny Sullivan saying that it's an important first step.
More Information on the new Robots Exclusion Protocol (Jun 13 2008 21:18 GMT)
More Information on the new Robots Exclusion Protocol Search indexing robot writers and web publishers should definitely look at the new extensions to the REP, as there are useful additions to both robots.txt directives and Robots META tags. Most of these features have been supported by the big three search engines (Google, Yahoo, MSN Live), but it's nice to have that formalized, and other search robots can take advantage of the new functionality. The new X-Robots-Tag (added to the HTTP header for non-HTML files) is a good way to send the meta information, but requires automated extensions to the servers. For example, if content is available in both HTML and PDF formats, it's easy to send NOINDEX values for all PDF, directing search engines away from the printable format and towards the browser-readable format.
New Robot Exclusion Protocol! (Jun 04 2008 00:47 GMT)
Supported by webwide search engines Yahoo, Google and Microsoft, this adds directives to robots.txt:"Allow" directiveswildcards in URLs Sitemap LocationThere are also HTML meta tags and document properties directives for NOSNIPPETNOARCHIVENOODP (don't use ODP information for this page).Yahoo has a nice long blog entry on this, as does Google. Great news for web developers, who've been waiting for this for a very long time.
Yahoo vs. Google interactive geosearch (Jun 03 2008 22:00 GMT)
I wanted to find the trendy-but-good little shop I stopped by yesterday: it's not really a cafe or a coffehouse or a restaurant. They sell sandwiches and savory chicken pie and strawberry shortcake, with cartons of strawberries stacked high in the front. I knew where it was, so I compared Yahoo Maps and Google Maps, learn a little about geosearch.Yahoo maps knew where I live, so it started there, and then just scrolled and zoomed until I found the corner of 51st and Telegraph.
off to ESS (May 15 2008 22:16 GMT)
I will be leaving for the Enterprise Search Summit tomorrow (taking my family to New York for a little adventure). I'll be teaching a workshop, Enterprise Search 101, on Monday the 19th, and starting the Search Analytics track on Wednesday the 21st. If you see me, please introduce yourself, I'd love to meet people who read this blog.
A First Taxonomy for "Search Log Junk" (May 06 2008 17:57 GMT)
Search logs contain a lot of weird things, and some of them can have a significant effect on search log analysis. Having looked at tens of thousand lines of search log entries, I offer this first attempt at defining some of the weirdest and least useful kinds of log entry, which I call "Search Log Junk". Here are the types of junk that I've seen most frequently: Empty Queries Queries without any query text or usable parameters. These can appear when people think the " |