Yahoo! is launching its search developer platform SearchMonkey (I don’t know why they name it monkey. Maybe because monkeys have some intelligence but only enough to handle trivial tasks?!). Early responses from the blogosphere seem quite positive. So I signed up for an account and played around with it a little bit. A couple of thoughts came to my mind:

First, it’s a really cool concept! This will for sure significantly improve search results. Actually, search engines are already showing this type of metadata for queries about maps, stock tickers, celebrities, etc. with Google OneBox and Yahoo! Shortcut. Now SearchMonkey is taking this concept one step further to include broader subjects and more web sites. Just imagine some day you’ll get a lot more Google OneBox type search results for a lot more queries.

However, for the technical part, SearchMonkey uses a DOM-based approach (XPath) to do the data extraction (a.k.a. scraping). That is, SearchMonkey is going to have the same disadvantages that all DOM-based approaches are born with. For example, it requires a person (or monkey) to program new XPath expressions for each new site added. Even for old sites you have dealt with before, you still need to constantly come back to re-program them when their HTML layouts are changed. With SearchMonkey, Yahoo! looks to be trying to enlist and organize an army of people (or monkeys, against Google’s army of robots, I guess) experienced in scraping and making their works shareable among the community. It is basically like a open-source, teamwork approach.

DOM-based approaches have been around for years, and has become people’s choice when you want some smartness in your scraping. But from our experience with the shopping vertical, DOM-based approaches simply don’t work so elegantly.

We at use a fundamentally different approach. We look at not only the DOM structure of an HTML page but also the semantics and many other things inside it. The result is, given any product page from any shopping site, our intelligent software is able to extract its product price and image. The process is fully automated without any involvement of people (or monkeys). That is, no XPath expressions or templates or scripts or whatever need to be programmed for any particular site.

Sounds impossible?! That’s most people’s response when they first hear about this. In fact, even Wikipedia currently says this is undoable (maybe I should try to get that page revised). Well, maybe not any more. Head to now and see it working live for yourself! Simply enter any product page URL from any shopping site and your email address, and we’ll scrape its product image and price real-time for you. And there are no monkeys working behind the scene as you send in your request:

While with Yahoo! SearchMonkey, you’ll need to deal with XML, XSLT, XPath, etc., which may just disqualify many people to be SearchMonkeys who don’t understand these things (including me):

So hopes to be the monkey for you in the shopping vertical so that you don’t have to. We do think that our technology will be an interesting complement to the SearchMonkey platform. In fact, we’d be happy to wrap our product scraping function as a SearchMonkey Data Service. What do you guys think?! Anyway, I’m going to the SearchMonkey Launch Party on May 15. I’d be happy to chat about this. Any ideas on how our technology can be used are welcome.

P.S.: Our scraping algorithm is already working with most shopping sites, although we are still fine-tuning it. We know it’s not perfect yet, but we are confident that we are heading in the right direction, and that we will get there soon.

Update: I had a chance to meet Amit Kumar, Yahoo! Director and product manager of SearchMonkey, at the Launch Party. He let me know that SearchMonkey indeed has another Web Service interface besides the DOM-based approach I previously mentioned. This makes wrapping our product scraping function as a SearchMonkey Data Service possible (and not difficult). So we’ll get started right away. Thanks, Amit!