Saturday 28 April 2018

Case study - using WebScraper to compile a list of information in a useful format from a website

Here's a frustrating problem that lent itself really well to using WebScraper for Mac.

This is the exhibitor list for an event I was to attend. It's a very long list, and unless you recognise the name, the only way to see a little more information about each exhibitor is to click through and then back again.

I wanted to cast my eye down a list of names and brief summaries, to see who I might be interested in visiting. Obviously this information will be in the printed programme, but I don't get that until the day.

(NB this walkthrough uses version 4.2.0 which is now available. The column setup table is more intuitive in 4.2 and the ability to extract h1s by class is a new feature in 4.2)

1. Setting up the scan. It's as easy as entering the starting url (the url of the exhibitor list). There's also this very useful scan setting (under Scan > Advanced) to say that I only want to travel one click away from my starting url (there's no pagination here, it's just one long list).

There's also a "new and improved" filter for the output. Think of this as 'select where' or just 'only include data in the output table if this is true'. In this case it's easy, we only want data in the output if the page is an exhibitor detail page. Helpfully, those all contain "/exhibitors-list/exhibitor-details/" in the url, so we can set up this rule:
2. Setting up the columns for the output table. The Helper tool shows me that the name of each business is on the information page within a heading that has a class. That's handy, because I can simply choose this using the helper tool and add a column which selects that class. 

3. The summary is a little more tricky, because there's no class or id to identify it. But helpfully it is always the first paragraph (<p>) after the first heading. So we can use the regex helper to write a regular expression to extract this.

The easy way to write the expression is simply to copy a chunk of the html source containing the info you want, plus anything that identifies the beginning and end of it, and then replace the part you want with (.+?) (which means 'collect any number of any characters'). I've also replaced the heading itself with ".+?" (the same, but don't collect) because that will obviously change on each page. That is all more simple than I've made it sound there. I'm no regex expert (regexpert?) - you may well know more than me on the subject and there may be better expressions to achieve this particular job, but this works, as we can see by hitting enter to see the result.

Here's what the column setup looks like now:

(Note that I edited the column headings by double-clicking the heading itself. That heading is mirrored in the actual exported output file, and acts as a field name in both csv and json)

 4. Press Go, watch the progress bar and then enjoy the results. Export to csv or other format if you like.

Friday 27 April 2018

HTMLtoMD and Webscraper - a comparison converting & archiving a site as MarkDown

I noticed a few people visiting the website as a result of searching for "webpage to markdown" and similar.

Our app HTMLtoMD was designed to do exactly that, and is free. But it was experimental and hasn't received very much development in a long time.

It still does its job, but our app Webscraper can also do that job, with many more options, and is a much more active product, it's selling and is under regular development now.

I thought it was worth writing a quick article to compare the output and features of the two apps.

First HTMLtoMD. It's designed for one job and so is simple to use. Just type or paste the address of your page or homepage into the address bar and press Go. If you choose the single page option, you'll immediately see the page as markdown, and can copy that markdown to paste somewhere else.
If you don't choose the single page option, the app will scan your site, and then offer to save the pages as markdown.

Here's the big difference with WebScraper. HTMLtoMD will save each page separately as a separate markdown file:

So here's how to do the same thing with Webscraper. It has many more options (which I won't go into here.) The advanced ones tend not to be visible until you need them. So here's WebScraper when it opens.
For this task, simply leave everything at defaults, enter your website name and make sure you choose "collect content as markdown" from the 'simple setup' as shown above. Then press Go.

For this first example I left the output file format set to csv.  When the scan has run, the Results table shows how the csv file will look.  In this example we have three field names;  url, title, content - note that the 'complex setup' allows you to choose more fields, for example you might like to collect the meta description too.
You may have noticed in one of the screenshots above that the options for output include csv, json and txt. Unlike HTMLtoMD, Webscraper always saves everything as a single file. The csv output is shown above. The text file is also worth looking at here. Although it is a single file, it is structured.   I've highlighted the field headings in the screenshot below. As I mentioned above, you can set things up so that more information (eg meta description) is captured.


It occurred to me after writing the article above, that collecting images is a closely-related task. WebScraper can download and save all images while performing its scan. Here are the options, you'll need to switch to the 'complex setup' to access these options:

If you're unsure about any of this, or if you can't immediately see how to do what you need to do, please do get in touch. It's really useful to know what jobs people are using our software for (confidential of course). Even if it's not possible, we may have suggestions. We have other apps that can do similar but different things, and have even written custom apps for specific jobs.

Thursday 19 April 2018

Option to download images in WebScraper

WebScraper for Mac now has the option to download and save images in a folder, as well as, or instead of, extracting data from the pages.

WebScraper crawls a whole website and it can extract information based on a class or id, or dump the content as html, plain text or markdown.

By just checking that box and choosing a download location, it can also dump all images to a local folder. That includes those found linked (the target of a regular hyperlink), those in a <img src , and those found in an srcset.

If you only want particular images, you can select only those which match a partial url or regex.

This functionality is in version 4.1.0 which is available now.

Monday 16 April 2018

Webscraper version 4 released

Webscraper from PeacockMedia makes it quick and easy to extract data from a website.

It has some 'simple setup' options if you simply want to archive content or collect certain meta data. For more complex tasks you can add columns to your output file which extract data based on certain class or id, or even by regular expression (regex). (A tutorial is here).

version 4 is now released which has been updated to use the Integrity v8 engine. That makes some efficiency improvements, and allows better control if you need to limit the requests that WS makes to your server:

The app has a very modest price tag of $10 and you can use it for free to test whether it works for you.

Scrutiny version 8 released

Some serious 'under the hood' improvements have been made in Scrutiny version 8. Benefits include more efficient scanning of your website and more data collected.

we've invested a lot of time in something that isn't a killer new feature, but will make things run more smoothly, efficiently and reliably, and give you more information.

There have been a number of requests to add more information about the properties of a link , eg hreflang and more of the 'rel' attributes. HTML5 allows loads of allowable values in the 'rel' and if the one you want doesn't have its own yes/no column in the new apps then you can view the entire rel attribute. You can switch these columns on in any of the link tables as well as the one in the link inspector.

The information displayed in Scrutiny's SEO table will include more data about the pages, notably og: and twitter card meta data which is important to a lot of users.

Redirect information wasn't stored permanently, only the start and end of a redirect chain (the majority are a single step. But if there is more than one redirect, you'll want to know all the in-between details)

It's becoming increasingly necessary to apply some limits on the rate of the requests that Integrity / Scrutiny make.

Your server may keep responding no matter how hard you hit it. That's great; turn up those threads and watch Integrity / Scrutiny go through your site like a dose of salts.

If you do need to limit the scan, you'll no longer have to turn down the threads and use trial and error with the delay field. You'll be able to simply enter a maximum number of requests per minute, while still using multiple threads for efficiency.

Saturday 14 April 2018

Scraping details from a web database using WebScraper

In the example below, you can see that the important data are not identified by a *unique* class or id.  This is a common scenario, so I thought I'd write this post to explain how to crawl such a site and collect this information. This tutorial uses WebScraper from PeacockMedia.
1. As there's no unique class or id identifying the data we want*  we must use a reguar expression (regex).

Toggle to 'Complex setup' if you're not set to that already.

Add a column to the output file and choose Regular expression. Then use the helper button (unless you're a regex aficionado and have already written the expression). You can search the source for the part you want and test your expression as you write it. (see below). Actually writing the expression is outside the scope of this article, but one important thing to say is that if there is no capturing in your expression (round brackets) then the result of the whole expression will be included in your output. If there is capturing (as in my example below) then only the captured data is output.

 2. paste the expression into the 'add column' dialog and OK.   Here's what the setup looks like after adding that first column.
 3. This is a quick test, I've limited the scan to 100 links simply to test it. You can see this first field correctly going into the column. At this point I've also added another regex column for the film title, because the page title is useless here (it's the same for every detail page)
4. Add further columns for the remaining information as per step 1.

5. If you limited the links while you were testing as I did, remember to change that back to a suitably high limit (the default is 200,000 I believe) and set your starting url to the home page of the site or something suitable for crawling all of the pages you're interested in.

As always, please contact support if you need more help or would like to commission us to do the job.

*  as I write this, webscraper checks for class, id and itemprop  inside <div>, <span>, <dd> and <p> tags, this is likely to be expanded in future.

Wednesday 11 April 2018

Migrating from Integrity Plus to Integrity Pro

I'm gathering all of this information together so that it's all in one place.


If you're an Integrity Plus user, the newer Integrity Pro offers:

Upgrade fee

If you're interested in upgrading from Plus to Pro then you'll only need to pay the difference between the full current price of the two apps. The Integrity Pro upgrade form is here.

Migrating your website configurations

This is very new, by the time you read this, the necessary feature should exist within the current versions of Integrity Plus and Pro.

Making sure you're on at least 8.0.8 (8.0.86) of the source application (Integrity Plus) you can export all of your website configurations in one batch, or selected ones individually. Save the resulting file somewhere where you can easily find it.

Again making sure you're on at least 8.0.8 (8.0.86) of the destination application (Integrity Pro) use File > Open to open the file that you just saved. Integrity Pro should do the rest and your websites should appear in the list on the left.  If there are any problems, contact support.

Here are some screenshots showing Integrity Pro in action.
