Friday 18 January 2019

scraping Yelp for phone numbers of all plumbers in California (or whatever in wherever)

I've written similar tutorials to this one before, but I've made the screenshots and written the tutorial this morning to help someone and wanted to preserve the information here.

We're using the Webscraper app. The procedure below will work while you're in demo mode, but the number of results will be limited.

We enter our starting url. Perform your search on Yelp and then click through to the next page of results. Note how the url changes as you click through. In this case it goes:
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=0
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=20
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=40
etc.

(I added in &start=0 on the first one, that part isn't there when you first go to the results, but this avoids some duplication due to the default page and &start=0 being the same page).

So our starting url should be:
https://www.yelp.com/search?find_desc=Plumbers&find_loc=California&ns=1&start=0

In order to crawl through the results pages, we limit our crawl to urls that match the pattern we've observed above. We can do that by asking the crawl to ignore any urls that don't contain
?find_desc=Plumbers&find_loc=California&ns=1&start=

In this case we're going to additionally click the links in the results, so that we can scrape the information we want from the businesses' pages. This is done on the 'Output filter' tab. Check 'Filter output' and enter these rules:
URL contains /biz/
and URL contains ?osq=Plumbers
(The phone numbers and business names are right there on the results pages, we could grab them from there, but for this exercise we're clicking through to the business page to grab the info from there. It has advantages.)

Finally we need to set up the columns in our output file and specify what information we want to grab. On the business page, the name of the business is in the h1 tags, so we can simply select that. The phone number is helpfully in a div called 'biz-phone' so that's easy to set up too.

Then we run by pressing the Go button. In an unlicensed copy of the WebScraper app, you should see 10 results. Once licensed, the app should crawl through the pagination and collect all (in this case) 200+ results.

Limitations

I was able to get all of the results (matching those available while using the browser) for this particular category. For some others I noticed that Yelp didn't seem to want to serve more than 25 pages of results, even when the page said that there were more pages. Skipping straight to the 25th page and then clicking 'Next' resulted in a page with hints about searching.

This isn't the same as becoming blacklisted, which will happen when you have made too many requests in a given time. This is obvious because you then can't access Yelp in your browser without changing your IP address. One measure to avoid this problem is to use ProxyCrawl which is a service that you can use by getting yourself an account (free initially), switch on 'Use ProxyCrawl' in WebScraper and enter your token in Preferences.

No comments:

Post a Comment