Thursday, 19 April 2018

Option to download images in WebScraper

WebScraper for Mac now has the option to download and save images in a folder, as well as, or instead of, extracting data from the pages.



WebScraper crawls a whole website and it can extract information based on a class or id, or dump the content as html, plain text or markdown.

By just checking that box and choosing a download location, it can also dump all images to a local folder. That includes those found linked (the target of a regular hyperlink), those in a <img src , and those found in an srcset.

If you only want particular images, you can select only those which match a partial url or regex.

This functionality is in version 4.1.0 which is available now.

Monday, 16 April 2018

Webscraper version 4 released

Webscraper from PeacockMedia makes it quick and easy to extract data from a website.

It has some 'simple setup' options if you simply want to archive content or collect certain meta data. For more complex tasks you can add columns to your output file which extract data based on certain class or id, or even by regular expression (regex). (A tutorial is here).

version 4 is now released which has been updated to use the Integrity v8 engine. That makes some efficiency improvements, and allows better control if you need to limit the requests that WS makes to your server:

The app has a very modest price tag of $10 and you can use it for free to test whether it works for you.

Scrutiny version 8 released

Some serious 'under the hood' improvements have been made in Scrutiny version 8. Benefits include more efficient scanning of your website and more data collected.

we've invested a lot of time in something that isn't a killer new feature, but will make things run more smoothly, efficiently and reliably, and give you more information.

There have been a number of requests to add more information about the properties of a link , eg hreflang and more of the 'rel' attributes. HTML5 allows loads of allowable values in the 'rel' and if the one you want doesn't have its own yes/no column in the new apps then you can view the entire rel attribute. You can switch these columns on in any of the link tables as well as the one in the link inspector.

The information displayed in Scrutiny's SEO table will include more data about the pages, notably og: and twitter card meta data which is important to a lot of users.




Redirect information wasn't stored permanently, only the start and end of a redirect chain (the majority are a single step. But if there is more than one redirect, you'll want to know all the in-between details)

It's becoming increasingly necessary to apply some limits on the rate of the requests that Integrity / Scrutiny make.

Your server may keep responding no matter how hard you hit it. That's great; turn up those threads and watch Integrity / Scrutiny go through your site like a dose of salts.

If you do need to limit the scan, you'll no longer have to turn down the threads and use trial and error with the delay field. You'll be able to simply enter a maximum number of requests per minute, while still using multiple threads for efficiency.

Saturday, 14 April 2018

Scraping details from a web database using WebScraper

In the example below, you can see that the important data are not identified by a *unique* class or id.  This is a common scenario, so I thought I'd write this post to explain how to crawl such a site and collect this information. This tutorial uses WebScraper from PeacockMedia.
1. As there's no unique class or id identifying the data we want*  we must use a reguar expression (regex).

Add a column to the output file and choose Regular expression. Then use the helper button (unless you're a regex aficionado and have already written the expression). You can search the source for the part you want and test your expression as you write it. (see below). Actually writing the expression is outside the scope of this article, but one important thing to say is that if there is no capturing in your expression (round brackets) then the result of the whole expression will be included in your output. If there is capturing (as in my example below) then only the captured data is output.

 2. paste the expression into the 'add column' dialog and OK.   Here's what the setup looks like after adding that first column.
 3. This is a quick test, I've limited the scan to 100 links simply to test it. You can see this first field correctly going into the column. At this point I've also added another regex column for the film title, because the page title is useless here (it's the same for every detail page)
4. Add further columns for the remaining information as per step 1.

5. If you limited the links while you were testing as I did, remember to change that back to a suitably high limit (the default is 200,000 I believe) and set your starting url to the home page of the site or something suitable for crawling all of the pages you're interested in.


As always, please contact support if you need more help or would like to commission us to do the job.



*  as I write this, webscraper checks for class, id and itemprop  inside <div>, <span>, <dd> and <p> tags, this is likely to be expanded in future.

Wednesday, 11 April 2018

Migrating from Integrity Plus to Integrity Pro

I'm gathering all of this information together so that it's all in one place.

Features

If you're an Integrity Plus user, the newer Integrity Pro offers:

Upgrade fee

If you're interested in upgrading from Plus to Pro then you'll only need to pay the difference between the full current price of the two apps. The Integrity Pro upgrade form is here.

Migrating your website configurations

This is very new, by the time you read this, the necessary feature should exist within the current versions of Integrity Plus and Pro.

Making sure you're on at least 8.0.8 (8.0.86) of the source application (Integrity Plus) you can export all of your website configurations in one batch, or selected ones individually. Save the resulting file somewhere where you can easily find it.


Again making sure you're on at least 8.0.8 (8.0.86) of the destination application (Integrity Pro) use File > Open to open the file that you just saved. Integrity Pro should do the rest and your websites should appear in the list on the left.  If there are any problems, contact support.





Here are some screenshots showing Integrity Pro in action.




SaveSave

Saturday, 24 March 2018

Earlier in the year I made the business decision to sell our apps via the app store once again. It's working out OK, the store seem to have become much more user-friendly from the back end and we seem to be reaching more people with the apps.

Integrity, Integrity Plus and now the new Integrity Pro are all available at the store for those who prefer to obtain their apps that way, and for those who might not have discovered them otherwise.

Here are the links:





You'll be wondering which one suits you best. There's a handy chart here, which shows a broad outline of features and prices.



Friday, 23 March 2018

Bag a bargain! Scrutiny included in Creatable bundle

The Creatable Pick A Bundle 2018 contains 28 varied utilities, including our own Scrutiny website scrutineer. You can choose ten for $39 or all of them for $99.

If you're a user of Philips Hue or LIFX bulbs, our own Mac apps for these lighting systems are also included.

So that's Scrutiny plus nine other useful apps for less than half the regular price of Scrutiny.



Check out the bundle now
SaveSave