Thursday, 2 February 2017

Scrutiny 7 launched! 50% deal via MacUpdate

After many months in development and more in testing, Scrutiny v7 is now available.


Scrutiny builds on the link tester Integrity. As well as the crawling and link-checking functionality it also handles:

  • SEO - loads of data about each page
  • Sitemap - generate and ftp your XML sitemap (broken into parts with a sitemap index for larger sites)
  • Spelling and grammar check
  • Site search with many parameters including multiple search terms and  'pages that don't contain'
  • Many advanced features such as authentication, cookies, javascript.


The main new features of version 7 are:

  • Multiple windows - have as many windows open as you like to run concurrent scans, view data, configure sites, all at once
  • New UI, includes breadcrumb widget for good indication of where you are, and switching to other screens
  • Organise your sites into folders if you choose
  • Autosave saves data for every scan, giving you easy access to results for any site you've scanned
  • Better reporting - summary report looks nicer, full report consists of the summary report plus all the data as CSV's
  • Many more new features and enhancements


Note that there's an upgrade path for users of v5 and v6 with a small fee ($20) and it's a free upgrade to those who bought or upgraded to v6 in the last year. You can use this form for the upgrade.

Monday, 16 January 2017

How to use webscraper to compile a list of names and numbers from a directory

First we find a web search which produces the results we're after. This screenshot shows how that's done, and the url we want to grab as the starting url for our crawl.


That url goes into the starting url box in Webscraper.

We obviously don't want the crawling engine to scan the entire site but we do want it to follow those 'More info' links, because that's where the detail is. We notice that those links go through to a url which contains /mip/ so we can use that term to limit the scan (called 'whitelisting').


We also notice the pagination here. It'll be useful if Webscraper will follow those links, to find further results for our search, and then follow the 'more information' links on those pages. We notice that the pagination uses "&page=" in its urls, so we can whitelist that term too in order to allow the crawler access to page 2, page 3 etc of our search results.


The whitelist field in Webscraper allows for multiple expressions, so we can add both of these terms (separated with a comma). Webscraper shouldn't follow any url which doesn't contain either of these terms.

That's the setting up. If the site in question requires you to be logged in, you'll need to check 'attempt authentication' and use the button to visit the site and log in. That's dealt with in another article.

Kick off the scan with the Go button. At present, the way Webscraper works is that you perform the scan, then when it completes, you build your output file and finally save it.


If the scan appears to be scanning many more pages than you expected, then you can use the Live View button to see what's happening, and if necessary, stop and adjust your settings. You're very welcome to contact support if you need help

When the scan finishes, we're presented with the output file builder. I'm after a csv containing a few columns. I always start with the page url as a reference. That allows you to view the page in question if you need to check anything. Select URL and press Add.

Here's the fun part. We need to find the information we want, and add column(s) to our output file using a class or id if possible, or maybe a regular expression. First we try the class helper.

This is the class / id helper. It displays the page, it shows a list of classes found on the page, and even highlights them as you hover over the list. Because we want to scrape information off the 'more info' pages, I've clicked through to one of those pages. (You can just click links within the browser of the class helper.)


Rather helpfully, the information I want here (the phone number) has a class "phone". I can double-click that in the table on the left to enter it into the field in the output file builder, and then press the Add button to add it to my output file. I do exactly the same to add the name of the business (class - "sales-info").

For good measure I've also added the weblink to my output file. (I'm going to go into detail re web links in a different article because there are some very useful things you can do.)


So I press save and here's the output file. (I've not bothered to blur any of the data here, it's available on the web)


So that's it. How about getting a nice html file with all those weblinks as clickable links? That'll be in the next article.

Once again, You're very welcome to contact support if you need help with any of this.





Tuesday, 3 January 2017

Crawling a website that requires authentication

This is a big subject and gets bigger and more complicated as website become increasingly clever at preventing non-human visitors from being able to log in.

My post How to use Scrutiny to test a website which requires authentication has been updated a number of times in its history and I've just updated it again to include a relatively recent Scrutiny feature. It's a simple trick involving a browser window within Scrutiny which allows you to log into your site. If there's a tracking cookie, that's then retained for Scrutiny's scan.

It used to be possible to simply log in using Safari - Safari's cookies seem to have been systemwide, but after Yosemite, a browser's cookies seem to be specific to that browser.

The reason for this all being on my mind today is that I've just worked the same technique into WebScraper. I wanted to compile a list of some website addresses from pages within a social networking site which is only visible to authenticated users.



Webscraper doesn't have the full authentication features of Scrutiny but I think this method will work with the majority of websites which require authentication.

(This feature, and others, are in Webscraper 1.3 which will be available very shortly)
SaveSave

Sunday, 1 January 2017

17% off Integrity Plus

We'd like to wish you a happy and prosperous New Year.



Of course, that means having the best tools, and if you're a user of website link checker Integrity, or have trialled Integrity Plus, you'll enjoy the extra features of Integrity Plus for Mac. As well as the fast and accurate link check, you can filter and search your results, manage settings for multiple sites and generate an xml sitemap.

So we're offering a 17% discount to kick off 2017 (see what we did there?)  Exp 14 Jan 2017

There's no coupon, simply buy from within the app or use this secure link:
https://pay.paddle.com/checkout/496583

Friday, 30 December 2016

Full release of Webscraper

WebScraper, our utility for crawling a site and extracting data or archiving content, is now out of beta.

There have been some serious enhancements over recent months, such as the ability to 'whitelist' (only crawl) pages containing a search term, the ability to extract multiple classes / id's (as separate fields in the output file) and a class/id helper which allows you to visually choose the divs or spans for extraction.

Now that the earlier beSta is about to expire, it's time to make all of this a full release. The price is a mere 5 US Dollars, for a licence which doesn't expire. The trial period is 30 days and the only limitation is that the output file has a limited number of rows so that you can still evaluate its output.

Find out more and download the app here, and if you try it and have any questions or requests, there's a support form here.

Monday, 21 November 2016

Scrutiny for Mac Black Friday discount


50% off Scrutiny

For Black Friday


If you're a user of Integrity or Integrity Plus, or have trialled Scrutiny, we hope that this discount will help you to make the decision to add Scrutiny for Mac to your armoury. Exp 28-11-2016

Simply use the code:  5E861DD0

Look for the 'add coupon' link during the purchase process, either in-app, or using this secure link:
https://pay.paddle.com/checkout/494001

The code will work for the next week or so, please feel free to share the discount code, or use it to buy more licences if you have multiple users.

Monday, 31 October 2016

Webscraper from PeacockMedia - usage

I've had one or two questions about using WebScraper. There's a short demo video here  but if, like me, you prefer to cast your eye over some text and images rather than sit through a video, then here you go:

1. Type your website address (or starting url for the scan). Like Integrity / Scrutiny (Webscraper uses the same engine) the crawl will be limited to any 'directory' implied in the url.

2. Hit Go. The way this works (currently) is that the app crawls your site, then when complete, you choose  what and how you want to export your data.
3. When the scan is complete, the export options will open. Choose the format you want to export (currently csv, json) and which information you want to include. This can include various meta data or information extracted from the pages, by span or div, class or id.

4. If the output file isn't as you expected, then you can tinker with the output options without needing to crawl again. Just use the Export button on the Main (crawl) window.