Whenever I speak at conferences, someone always asks about the technology that drives blekko. Here’s a peek into the scale and the technology behind our search engine.
A whole-web search engine needs to crawl and index billions of webpages. It’s hard to say how many webpages there are, because most webpages are generated by web spammers. Our current cluster of 1,500 computers can crawl and index 4 billion webpages. We use over 10,000 hard disk drives, similar to the biggest one you might buy for a desktop computer. We have 3,000 flash, or solid state disk drives, similar to the drives you might have in a new laptop. The total amount of text in these 4 billion webpages is one petabyte, which is about 500 billion times the size of this blog posting.
We store all of the data in a special database system, of a type known as NoSQL. It has some pretty unique features that make developing a search engine easier than usual. If you’d like to learn a bit more about our database and how it aids writing a search engine, there are two blog postings over at highscalability.com which describe it in detail:
One unusual aspect of our system is that we don’t use the technology called RAID to make our disks more reliable. You can read about how we get reliability with more consistent performance and faster rebuilds in this blog posting:
We hope that you’ve enjoyed this peek into the technologies that underlie a modern search engine!