Thursday 1 February 2007, 3:34 PM
The problem with PageRank (and one alternative)
A search double whammy yesterday: with the launch of the latest Google Mini search appliance in the morning followed by a meeting with a start-up you won't have heard of called Ultraknowledge, who provide the search technology behind ZDNet UK.
I have still to get to the bottom of the algorithm that powers the Google Mini. Obviously Page Rank (Google's famous algorithm for ranking results according to reputation) doesn't work for corporate documents. Given that these documents - we're talking the likes of WOrd, Excel and PowerPoint files, don't link to each other, no reputation can be inferred for a document. PageRank is also not the right solution for HTML-based internal corporate content either, such as Intranets, simply because when searching your Intranet for information, all your data should have a high reputation. Context is likely to be much better measure and indicator of what you're looking for.
So how does the Google Mini work? The explanation I got from Google European enterprise director Roberto Solimene was too high level to be convincing - that it the appliance searches by keywords and gives weight to those it finds in the title, or in bold or italics for instance. We have reviewed a Google Mini search appliance in the past and found it perfectly serviceable, but we didn't delve into the intricacies of the search algorithm. I am promised a more in-depth explanation, and I'm sure that with the army of PhDs Google has at its disposal, there is something much, much smarter at the heart of its enterprise search than what I have managed to convey here.
While I wait to discover just how much smarter Google's enterprise search algorithm is, here's a thought.
Ultraknowledge is a small tech startup run by Ken Taylor, a man with a deep interest in all things cognitive. His take on Google - and established knowledge management firms such as Autonomy - is that they try to anticipate what people want based on the average of what most people like. The obvious folly in this approach is that when you are searching you don't necessarily want what other people think is good; you want what you want.
Furthermore, people often search because they want to know more about a subject, implying that they don't know enough about it at the point of entering their search term, but if you don't know enough about a subject how can you define your search criteria for it?
The approach that we're taking with Ultraknowledge relies on a statistical analysis of the words used in every article whether HTML (a web page) or PDF (one of the 70,000+ technology white papers in our Resources section). From this the Ultraknowledge search engine has a known universe of words and phrases; it knows where they occur, how often and perhaps most importantly, in what context.
You can get an idea of these relationships if you enter 'Google' into the search bar on ZDNet. Alongside the search results you'll get two extra features: related articles, and related tags. Let's look at the tag cloud first, and here's a snapshot that I took earlier (note that this will change over time as more articles are added to the site):
None of these tags were added manually, yet from its analysis the search engine knows that, for instance, "Kai Fu Lee", "Adsense", "PageRank" and "sponsored links" are terms that are closely associated with Google. "Google Mini" did not make it into the top 30 terms, so is not included in the box. What we're doing here is rather than saying here are words that other people associate with "Google", we're saying here are words and terms that are statistically important to what you searched for. And this is about turning search into navigation; giving people more clues about what they are searching for to help them in their quest. Does it work? You tell me.
Of course some people will believe that people are better than computers at creating these associations. After working with Ken on ZDNet's search, don't think that is the case, and I'll explain why in a separate blog post. And I also suspect that this will in time prove a better basis for enterprise search than reputation-based search.
The obvious disclaimer to all this is, of course, that we have a business relationship with Ultraknowledge as we pay to use their technology. We also incidentally have a business relationship with Google - adsense ads appear on ZDNet - so they in effect pay us.
Comments on this post
i have seen changes in the new PR updates and can say that by the end of the month ishall have updated full ( i hope) i have talked about this here http://www.googlebugs.com. i have seen my results go up and sum gone down so imagine when the next PR will be if this one took 6 months to take place????
thanks
aks

