PeacockMedia: data

Showing posts with label data. Show all posts

Tuesday, 4 February 2020

How to extract a table from html and save to csv (web to spreadsheet)

WebScraper users have sometimes asked about extracting data contained in tables on multiple pages.

Tables on multiple pages

That's fine if the table is for layout, or if there's just one bit of info that you want to grab from each, identifiable using a class or id.

But to take the whole table raises some questions - how do you map the web table to your output file? It may work if you can identify a similar table on all pages (matching columns) so that each one can be appended and match up, and if the first row is always headings (or marked up as th) and can be ignored, except for maybe the first one.

It's a scenario with a lot of ifs and buts, which means that it may be one of those problems that's best dealt on a case-by-case basis rather than trying to make a configurable app handle it. (if you do have this requirement, please do get in touch.)

Table from a single page

But this week someone asked about extracting a table from a single web page. It's pretty simple to copy the source from the web page, paste it into an online tool, or copy the table from the web page and paste into a spreadsheet app like Numbers or Excel and that was my answer.

But this set me thinking about the job of parsing html and extracting the table data ready for saving in whatever format.

At the core of this is a cocoa class for parsing the html and extracting the table (or tables if there are more than one on the page). With a view to possibly building this into WebScraper to allow it to do the 'tables on multiple pages' task, or for having this ready, should the need arise to use this in a custom app for a one-off job, I've now written that parser and built a small free app around it.

That app is the imaginatively-titled HTMLTabletoCSV which is now available here.

Monday, 1 April 2019

Scraping links from Google search results

This tutorial assumes that you want to crawl the first few search results pages and collect the links from the results.

Remember that:

1. The search results that you get from WebScraper will be slightly different from the ones that you see in your browser. Google's search results are personalised. If someone else in a different place does the same search, they will get a different set of results from you, based on their browsing history. WebScraper doesn't use cookies and so it doesn't have a personality or browsing history.

2. Google limits the number of pages that you can see within a certain time. If you use this once with these settings, that'll be fine, but if you use it several times, it'll eventually fail. I believe Google allows each person to make 100 requests per hour before showing a CAPTCHA but I'm not sure about that number. If you run this example a few times, it may stop working. If this happens, press the 'log in' button at the bottom of the scan tab. You should see the CAPTCHA. If you complete it you should be able to continue. If this is a problem, adapt this tutorial for another search engine which doesn't have this limit.

We're using WebScraper for Mac which has some limits in the unregistered version.

1. The crawl

Your starting url looks like this: http://www.google.com/search?q=ten+best+cakes. Set Crawl maximum to '1 click from home' because in this example we can reach the first ten pages of search results within one click of the starting url (see Pagination below).

One important point not ringed in the screenshot. Choose 'IE11 / Windows' for your user-agent string. Google serves different code depending on the browser. The regex below was written with the user-agent set to IE11/Windows.

2. Pagination
We want to follow the links at the bottom of the page for the next page of search results, and nothing else. (These settings are about the crawl, not about collecting data). So we set up a rule that says "ignore urls that don't contain &start= " (see above)

3. Output file
Add a column, choose Regex. The expression is <div class="r"><a href="(.*?)"

You can bin the other two default columns if you want to. (I didn't bother and you'll see them in the final results at the bottom of this article.)

4. Separating our results
WebScraper is designed to crawl a site and extract a piece of data from each page. Each new row in the output file represents a page from the crawl. Here we want to collect multiple data from each page. So the scraped urls from each search results page will appear in a single cell (separated by a special character.) So we want to ask WebScraper to split these onto separate rows, which it does when it exports.

5. Run
Press the >Go button, and you'll see the results fill up in the Results tab. As mentioned, each row is one page of the crawl, you'll need to export to split the results onto separate rows.

Here's the output in Finder/quicklook. The csv should open in your favourite spreadsheet app.

Monday, 5 November 2018

Webscraper and pagination with case studies

If you're searching for help using WebScraper for MacOS then the chances are that the job involves pagination, because this situation provides some challenges.

Right off, I'll say that there is another approach to extracting data in cases like this from certain sites. It uses a different tool which we haven't made publicly available, but contact me if you're interested.

Here's the problem: the search results are paginated (page 1, 2, 3 etc). In this case, all of the information we want is right there on the search results pages, but it may be that you want Webscraper to follow the pagination, and then follow the links through to the actual product pages (let's call them 'detail pages') and extract the data from those.

1. We obviously want to start WebScraper at the first page of search results. It's easy to grab that url and give it to WebScraper:

2. We aren't interested in Webscraper following any links other than those pagination links. (we'll come to detail pages later). In this case it's easy to 'whitelist' those pagination pages.

3. The pagination may stop after a certain number of pages. But in this case it seems to go on for ever. One way to limit our crawl is to use these options:

A more precise way to stop the crawl at a certain point in the pagination is to set up more rules:

4. At this point, running the scan proves that WebScraper will follow the search results pages we're interested in, and stop when we want.

5. In this particular case, all of the information we want is right there in the search results lists. So we can use WebScraper's class and regex helpers to set up the output columns.

Detail pages

In the example above, all of the information we want is there on the search result pages, so the job is done. But what if we have to follow the 'read more' link and then scrape the information from the detail page?

There are a few approaches to this, and a different approach that I alluded to at the start. The best way will depend on the site.

1. Two-step process

This method involves using the technique above to crawl the pagination, and collect *only* the urls of the detail pages in a single column of the output file. Then as a separate project, use that list as your starting point (File > Open list of links) so that WebScraper scrapes data from the pages whose those urls, ie your detail pages. This is a good clean method, but it does involve a little more work to run it all. With the two projects set up properly and saved as project files, you can open the first project, run it, export the results, open the second project, run it and then export your final results.

2. Set up the rules necessary to crawl through to the detail pages and scrape the information from only those.

Here are the rules for a recent successful project

"?cat=259&sort=price_asc&set_page_size=12&page=" is the rule which allows us to crawl the paginated pages.
"?productid=" is the one which identifies our product page.

Notice here that the two rules appear to contradict each other. But when using 'Only follow', the two rules are 'OR'd. The 'ignore' rules that we used in the first case study are 'AND'ed, which results in no results if you have more than one 'ignore urls that don't contain'.

So here we're following pages which are search results pages, or product detail pages.

The third rule is necessary because the product page (in this case) contains links to 'related products' which aren't part of our search but do fit our other rules. We need to ignore those, otherwise we'll end up crawling all products on the entire site.

That would probably work fine, but we'd get irrelevant lines in our output because WebScraper will try to scrape data from the search results pages as well as the detail pages. This is where the Output filter comes into play.

The important one is "scrape data from pages where... URL does contain ?productid". The other rule probably isn't needed (because we're ignoring those pages during the crawl) but I added it to be doubly sure that we don't get any data from 'related product' pages.

Whichever of those methods you try, the next thing is to set up the columns in the output file (ie what data you want to scrape.) That's beyond the scope of this article, and the 'helpers' are much improved in recent WebScraper versions. There's a separate article about using regex to extract the information you want here.

Saturday, 28 April 2018

Case study - using WebScraper to compile a list of information in a useful format from a website

Here's a frustrating problem that lent itself really well to using WebScraper for Mac.

This is the exhibitor list for an event I was to attend. It's a very long list, and unless you recognise the name, the only way to see a little more information about each exhibitor is to click through and then back again.

I wanted to cast my eye down a list of names and brief summaries, to see who I might be interested in visiting. Obviously this information will be in the printed programme, but I don't get that until the day.

(NB this walkthrough uses version 4.2.0 which is now available. The column setup table is more intuitive in 4.2 and the ability to extract h1s by class is a new feature in 4.2)

1. Setting up the scan. It's as easy as entering the starting url (the url of the exhibitor list). There's also this very useful scan setting (under Scan > Advanced) to say that I only want to travel one click away from my starting url (there's no pagination here, it's just one long list).

There's also a "new and improved" filter for the output. Think of this as 'select where' or just 'only include data in the output table if this is true'. In this case it's easy, we only want data in the output if the page is an exhibitor detail page. Helpfully, those all contain "/exhibitors-list/exhibitor-details/" in the url, so we can set up this rule:

2. Setting up the columns for the output table. The Helper tool shows me that the name of each business is on the information page within a heading that has a class. That's handy, because I can simply choose this using the helper tool and add a column which selects that class.

3. The summary is a little more tricky, because there's no class or id to identify it. But helpfully it is always the first paragraph (<p>) after the first heading. So we can use the regex helper to write a regular expression to extract this.

The easy way to write the expression is simply to copy a chunk of the html source containing the info you want, plus anything that identifies the beginning and end of it, and then replace the part you want with (.+?) (which means 'collect any number of any characters'). I've also replaced the heading itself with ".+?" (the same, but don't collect) because that will obviously change on each page. That is all more simple than I've made it sound there. I'm no regex expert (regexpert?) - you may well know more than me on the subject and there may be better expressions to achieve this particular job, but this works, as we can see by hitting enter to see the result.

Here's what the column setup looks like now:

(Note that I edited the column headings by double-clicking the heading itself. That heading is mirrored in the actual exported output file, and acts as a field name in both csv and json)

4. Press Go, watch the progress bar and then enjoy the results. Export to csv or other format if you like.

Friday, 30 December 2016

Full release of Webscraper

WebScraper, our utility for crawling a site and extracting data or archiving content, is now out of beta.

There have been some serious enhancements over recent months, such as the ability to 'whitelist' (only crawl) pages containing a search term, the ability to extract multiple classes / id's (as separate fields in the output file) and a class/id helper which allows you to visually choose the divs or spans for extraction.

Now that the earlier beSta is about to expire, it's time to make all of this a full release. The price is a mere 5 US Dollars, for a licence which doesn't expire. The trial period is 30 days and the only limitation is that the output file has a limited number of rows so that you can still evaluate its output.

Find out more and download the app here, and if you try it and have any questions or requests, there's a support form here.

Saturday, 30 April 2016

New Project - WebScraper for OSX

WebScraper application icon - an earth scraper

I love starting new things. This project uses the Integrity V6 Engine for the crawling which means that I could get right on and build the output functionality.

I noticed that this is something people have been trying to use Scrutiny's search functionality to achieve. Scrutiny will report which pages contain (or don't contain) your term in the text or the entire code. And you can export results to csv and choose columns.

But Scrutiny (currently) can't extract data from particular css classes or ids.

This is where WebScraper comes in. It quickly scan a website, and can output the data (currently) as csv or json. (Anyone want xml?) The output can include various meta data (more choices to be added), the entire content of each page (as text, html or markdown) and can extract parts of the pages (currently a named class or id of divs or spans).

Webscraper is new and in beta. Please use it for free and please get in touch with any requests, bug reports or observations.

http://peacockmedia.software/mac/webscraper/

There's a short demo video here