Blekko integrated the up-to-date crawl information that powers our search engine index directly into our SEO product offerings. Not only is the data more comprehensive, but there are major improvements in real-time updates to our SEO pages.
When it comes to pages crawled, the sweet spot for blekko is a little more than 4 billion pages. To keep our crawl fresh, we update at least 100 million pages each day. As soon as our crawler, Scoutjet, crawls a webpage, users have access to information about it through blekko’s SEO product. We want to enable people to see the Internet the way a search engine sees it, especially what the rest of the internet is saying about an url.
Scoutjet updates the top ranked starting pages on the Internet around every hour, while other high quality pages are checked at least every week. The continuous updates to blekko’s SEO data include page content, meta data, duplicate text, and inbound link counts. Staying up-to-date is as much about forgetting the old as finding the new. So, we eliminate inbound links that are no longer live and duplicate content that is no longer available.
Since our traffic continues to grow rapidly, we are bringing more machines into service to keep our site humming. While we were upgrading our site to handle more traffic, we decided to leverage our highly customizable NoSQL database to make real time access to our crawl publicly available. Our “combinator” abstraction proved critical in quickly making the right tradeoffs between crawl throughput and user request latency.
We hope you will be pleased with the new and improved performance for web search and SEO data!