Tuesday, July 03, 2007

Google Desktop for Linux vs. Beagle

Recently Google released Google Desktop for Linux. I have been using Beagle on Fedora Core, since it was added, and currently am running Fedora Core 6. With that, I decided to try out the beta of Google Desktop, and compare search results between the two, to see if one was any better than the other.

So, I installed Google Desktop with their RPM for Fedora, and set the preferences. I setup my preferences for indexing the same as I did for Beagle, so the comparison would be fair on both sides. You can see the settings in the following image:

Most of the settings are the defaults provided, but I added /var, /opt, /etc and /tmp as file systems, because I like to be able to search for things in log files written by syslog, configuration files, etc., and I also am indexing all file types, and web history, with the only exception being https content.

This pretty much mirrors my Beagle preferences as you can see from below:


So, after setting the preferences, I watched Google Desktop go to work on indexing my file systems. What was interesting is that it took a very long time. Over two days to do the first pass at indexing. Now, granted, I have a lot of files on my laptop, so this is understandable, but Beagle seemed to index my files a lot faster, but I don't have a specific time to compare against, because there is no way to monitor the indexing progress of Beagle (at least not that I know of). Now that brings us to comparing search results.

With Beagle, I have been frustrated at times that it couldn't find files that I knew were there, but couldn't remember where I had saved them. Isn't that what desktop search is all about? In fact, as a result of trying to find a Portable Document Format (PDF) document that I had saved from the web, I opened a Bugzilla case thinking that Beagle was not indexing PDF's. It turned out that Beagle was indexing the PDF's, but Beagle only indexes based on a files metadata, not its entire contents. That explains why it couldn't find the file I was looking for, because the search phrase I was using didn't match the files metadata, but part of its content. So, I had the perfect test case to see whether Google Desktop could find what Beagle couldn't.

I searched with the term "Small is Beautiful", which is part of a subtitle of a document produced by Familiar Metric Management, and it is about software development productivity as it relates to team size. As you can see, from the image below, that this search phrase returns nothing using Beagle.


So, I did the same search with Google Desktop, and you can see the results below. Unfortunately, I couldn't find a way to capture a screen shot of the interface, without losing the results at the bottom, so I did the search from the browser interface instead.

As you can see from my cursor highlight, Google Desktop found the file I was looking for without any problem. This explains the major difference between Google Desktop and Beagle. Beagle trades off indexing speed, by just indexing the metadata on documents, while Google Desktop does a full index on the content, thereby taking much longer to index files, but giving much better results. I prefer the better results. There is one other difference that I would like to point out between the two.

In backing up my laptop, I noticed that my backup of my home directory was taking longer and longer, and the backup was getting very large. In looking into this, it turned out that a large percentage of my home directory was the beagle index. That led me to look into how large the Google Desktop index was in comparison. Well, there is no comparison. The Google Desktop index is much, much smaller (see below).


In fact, its 94% smaller than Beagle! This is a huge difference, and certainly pays off in disk usage.

In conclusion, I really liked Beagle, but Google Desktop offers better search results, with considerably less disk usage for the index. At this point, I'm ready to turn off Beagle (maybe even uninstall it), and rely on Google Desktop instead.


2 comments:

joeshaw said...

With Beagle, I have been frustrated at times that it couldn't find files that I knew were there, but couldn't remember where I had saved them. Isn't that what desktop search is all about? In fact, as a result of trying to find a Portable Document Format (PDF) document that I had saved from the web, I opened a Bugzilla case thinking that Beagle was not indexing PDF's. It turned out that Beagle was indexing the PDF's, but Beagle only indexes based on a files metadata, not its entire contents. That explains why it couldn't find the file I was looking for, because the search phrase I was using didn't match the files metadata, but part of its content. So, I had the perfect test case to see whether Google Desktop could find what Beagle couldn't.

Can you reference the Beagle bug for this? Because it's not quite right; both Beagle and GDL use the same tools (pdftotext) to extract text from PDF files and index them.

If you're having issues indexing with Beagle, please email me and I'll try to help you track it down.

Joe Shaw
joe@ximian.com

joeshaw said...

Also, looking at the file sizes, I greatly suspect that you are hitting a bug in Beagle. For a 20 gig home directory, 11 gigs in ~/.beagle is too big. It should be around 10% the size, so I would expect something in the 1-2 gig region.

Again, feel free to email me and we can try to work it out if you're interested. :)

Joe