We aggregate data from a variety of sources (crawling, data dumps, rss feeds, and in some cases even manual curation) after which we integrate them into our data pipeline. We update them using a power law distribution, where the top 1% of best selling products (based on our internal ranking system) is updated hourly, the next 3% updated every two hours, etc.. The whole index is refreshed at the end of each month.