2. Google limits the number of pages that you can see within a certain time. If you use this once with these settings, that'll be fine, but if you use it several times, it'll eventually fail. I believe Google allows each person to make 100 requests per hour before showing a CAPTCHA but I'm not sure about that number. If you run this example a few times, it may stop working. If this happens, press the 'log in' button at the bottom of the scan tab. You should see the CAPTCHA. If you complete it you should be able to continue. If this is a problem, adapt this tutorial for another search engine which doesn't have this limit.
We're using WebScraper for Mac which has some limits in the unregistered version.
1. The crawl
Your starting url looks like this: http://www.google.com/search?q=ten+best+cakes. Set Crawl maximum to '1 click from home' because in this example we can reach the first ten pages of search results within one click of the starting url (see Pagination below).
We want to follow the links at the bottom of the page for the next page of search results, and nothing else. (These settings are about the crawl, not about collecting data). So we set up a rule that says "ignore urls that don't contain &start= " (see above)
3. Output file
Add a column, choose Regex. The expression is <div class="r"><a href="(.*?)"
4. Separating our results
WebScraper is designed to crawl a site and extract a piece of data from each page. Each new row in the output file represents a page from the crawl. Here we want to collect multiple data from each page. So the scraped urls from each search results page will appear in a single cell (separated by a special character.) So we want to ask WebScraper to split these onto separate rows, which it does when it exports.
Press the >Go button, and you'll see the results fill up in the Results tab. As mentioned, each row is one page of the crawl, you'll need to export to split the results onto separate rows.
Post a Comment