Monday, March 10, 2008

Open Source Web Crawlers

Lucene is a very popular open source IR index creation package. I was about to post a simple example showing what classes and general architecture you would need to create a spider to crawl the web using Lucene, but then it hit me there are already dozens of open source web crawlers written on top of dozens of open source information retrieval engines.

Sure I could throw out some example code, but why bother implementing your own when there are so many packages out there that can do it for you. Even if they don't do exactly what you want it would still be easier to just modify the source of an existing project than to create your own from a Lucene foundation. Tragically, I didn't realize this before I made my own for PolVox. Still, if anyone wants a tutorial/example just leave me a comment about it and I might throw one out there...

2 comments:

Harry Chen said...

Aperture is a content extraction framework that you want to check out. It's designed to work with Semantic Web technologies.

Here is a tutorial

burtonator said...

Why even run one at all.

You could use Spinn3r:

http://spinn3r.com

We have research/educational licensing as well. We're about 1/5th the cost of running your own crawler....

I'm just about to blog about the fact that we're up to about a dozen PhDs using Spinn3r :)

Onward!