Saturday, 18 May 2019

Scrutiny version 9, a preview and run-down of new features

Like version 8, version 9 doesn't have any dramatic changes in the interface, so it'll be free for existing v7 or v8 licence holders. It'll remain a on-off purchase but there may be a small increase in price for new customers or upgraders from v6 or earlier.

But that's not to say that there aren't some important changes going on, which I'll outline here.

All of this applies to Integrity too, although the beta of Scrutiny 9 will come first.

Inspectors and warnings

The biggest UI change is in the link inspector. It puts the list of redirects (if there are any) on a tab rather than a sheet, so the information is more obvious.  There is also a new 'Warnings' tab.
Traditionally in Integrity and Scrutiny, a link coloured orange means a warning, and in the past this meant only one thing - a redirect (which some users don't want to see, which is OK, there's an option to switch redirect warnings off.)

Now the orange warning could mean one or more of a number of things. While scanning, the engine may encounter things which may not be showstoppers but which the user might be grateful to know about. There hasn't been a place for information about such things. In version 9, these things are displayed in the Warnings tab and the link appears orange if there's anything to warn you about (including redirects, unless you have that switched off.)

Examples of things you may be warned about include more than one canonical tag on the target page, unterminated or improperly-terminated script tags or comments (ie <!--   with no  --> which could be interpreted as commenting out the rest of the page, though browsers usually seem to ignore the opening tag if there's no closing one and display the page). Redirect chains also appear in the warnings. The threshold for a chain can now be set in Preferences. Redirect chains have always been visible in the SEO results with a hard-coded threshold of 3.  

The number of things Scrutiny can alert you about in the warnings tab will increase in the future.

Strictly-speaking there's an important distinction between link properties and page properties. A link has link text, a target url and other attributes such as rel-nofollow.  A page is the file that your browser loads, it contains links to other pages.

A 'page inspector' has long been available in Scrutiny. It shows a large amount of information; meta data, headings, word count, the number of links on the page, the number of links *to* that page and more. A lot of this is of course visible in the SEO table.

Whether you see the link inspector or the page inspector depends on the context (for example, the SEO table is concerned with pages rather than links, so a double-click opens the page inspector.) But when viewing the properties of a link, you may want to see the page properties. (You would usually be thinking about the target page here rather than the link's 'parent' page(s)). So it's now possible to see some basic information and easily open the full page inspector via that 'target' tab of the link inspector.

[update 26 May] The page inspector has always had fields for the number of inbound links and outbound links. Now it has sortable tables showing the inbound and outbound links for the page being inspected:


This is an important function that Integrity / Scrutiny  have never done very well.

You run a large scan, you fix some stuff, and want to re-check. But you only want to re-check the things that were wrong during the first scan, not spend hours running another full scan.

Re-checking functionality has been limited. The interface has been changed quite recently to make it clear that you'll "Recheck this url", ie the url that the app found during the first scan.

Half of the time, possibly most of the time, you'll have 'fixed' the link by editing its target, or removing it from the page entirely.

The only way to handle this is to re-check the page(s) that your fixed link appears on. This apparently simple-sounding thing is by far the most complex and difficult of the v9 changes.

Version 9 still has the simple 're-check this url' available from the link inspector and various context menus (and can be used after a multiple selection). It also now has 'recheck the page this link appears on'. (also can be used with a multiple selection).


This is another important area that has had less than its fair share of development in the past. Starting to offer services ourselves has prompted some improvements here.

Over time, simple functionality is superseded by better options. "On finish save bad links / SEO as csv" are no longer needed and have now gone because "On finish save report" does those things and more. This led to confusion when they all existed together, particularly for those users who like to switch everything on with abandon. Checking all the boxes would lead to the same csvs being saved multiple times and sometimes multiple save dialogs at the end of the scan.

The 'On finish' section of the settings now looks like this. If you want things saved automatically after a scheduled scan or manual 'scan with actions' then switch on 'Save report' and then choose exactly what you want included.

Reduced false positives

The remaining v9 changes are invisible. One of those is a fundamental change to the program flow. Previously link urls were 'unencoded' for storage and display and encoded for testing. In theory this should be fine, but I've seen some examples via the support desk where it's not fine. In one case a redirect was in place which redirected a version of the url containing a percent-encoding, but not the identical url without the percent-encoding. The character wasn't one that you'd usually encode. This unusual example shows that as a matter of principle, a crawler ought to store the url exactly as found on the page and use it exactly as found when making the http request. Only 'unencoding' for display purposes.

When you read that, it'll seem obvious that a crawler should work that way. But it's the kind of decision you make early on in an app's development, and tend to work with as time goes on rather than spending a lot of time making fundamental changes to your app and risking breaking other things that work perfectly well.

Anyhow, that work is done,  it'll affect very few people, but for those people it'll reduce those false positives. (or 'true negatives' whichever way you want to look at a link reported bad that is actually fine.)

Version 9 will be available for beta testing very shortly, and maybe as an option for new users.

[update 23 May 2019] I take it back, possibly the hardest thing to do was to add the sftp option (v9 will allow   ftp / ftp with TLS (aka ftps) / sftp)  for when automatically ftp'ing sitemap or checking for orphaned pages.

Thursday, 25 April 2019

Apple's notarization service

A big change that's been happening quietly in MacOS is Apple's Notarization service.

Ever since the App Store opened and was the only place to obtain software for the iPhone ('jailbreaking' excepted), I've been waiting for the sun to set on being able to download and install Mac apps from the web. Which is the core of my business. (Mac App Store sales amount to a small proportion of my income. That balance is fine, because Apple take a whopping 1/3 of the selling price).

Notarization is a step in that direction, although it still leaves developers free to distribute outside the app store. It means that Apple examine the app for malware. At this point they can't reject your app for any reason other than the malware search. They do specify 'hardened runtime' which is a tighter security constraint but I've  not found this to restrict functionality, as the Sandboxing requirement did when the App Store opened.

When the notarization service started last year, it was optional. Now Gatekeeper gives a more preferable message when an app is notarized, and it looks as if 10.15's Gatekeeper will refuse to install apps that haven't been notarized.

It's easy to feel threatened by this and imagine a future where Apple are vetting everything in the same way they do for the app store. For users that's a great thing, it guarantees them a certain standard of quality in any app they may be interested in. As a developer it feels like a constraint on my freedom to build and publish.

It genuinely seems geared towards reducing malware on the Mac. "This is a good thing" says John Martellaro in his column.

Wednesday, 17 April 2019

New tool for analysing http response

Your web page looks great in your browser but a tool such as a link checker or SEO checker reports something unexpected.

It's an age-old story. I'm sure we've all been there a thousand times.

OK, maybe not, but it does happen. An http request can be very different depending on what client is sending it. Maybe the server gives a different response depending on the user-agent string or another http request header field. Maybe the page is 'soft 404'.

What's needed is a tool that allows you to make a custom http request and view exactly what's coming back; the status, the response header fields and the content. And maybe to experiment with different request settings.

I've been meaning to build such a tool for a long time. Incorporated into Scrutiny, or maybe a standalone app.

I've also been thinking of making some online tools. The ultimate plan is an online Integrity. The request / response tool is the ideal candidate for an online tool.

For a name, one of my unused domains seemed strangely appropriate. A few minutes with the Wacom and it has a logo.

It's early days and there are more features to add. But feel free to try it. If you see any problems, please let me know.

Monday, 1 April 2019

Scraping links from Google search results

This tutorial assumes that you want to crawl the first few search results pages and collect the links from the results.

Remember that:

1. The search results that you get from WebScraper will be slightly different from the ones that you see in your browser. Google's search results are personalised. If someone else in a different place does the same search, they will get a different set of results from you, based on their browsing history. WebScraper doesn't use cookies and so it doesn't have a personality or browsing history.

2. Google limits the number of pages that you can see within a certain time. If you use this once with these settings, that'll be fine, but if you use it several times, it'll eventually fail. I believe Google allows each person to make 100 requests per hour before showing a CAPTCHA but I'm not sure about that number.  If you run this example a few times, it may stop working. If this happens, press the 'log in' button at the bottom of the scan tab. You should see the CAPTCHA. If you complete it you should be able to continue.  If this is a problem, adapt this tutorial for another search engine which doesn't have this limit.

We're using WebScraper for Mac which has some limits in the unregistered version.

1. The crawl
Your starting url looks like this:   Set Crawl maximum to '1 click from home' because in this example we can reach the first ten pages of search results within one click of the starting url (see Pagination below).
One important point not ringed in the screenshot. Choose 'IE11 / Windows' for your user-agent string. Google serves different code depending on the browser. The regex below was written with the user-agent set to IE11/Windows.

2. Pagination 
We want to follow the links at the bottom of the page for the next page of search results, and nothing else. (These settings are about the crawl, not about collecting data). So we set up a rule that says "ignore urls that don't contain &start= " (see above)

3. Output file
Add a column, choose Regex. The expression is <div class="r"><a href="(.*?)"
You can bin the other two default columns if you want to. (I didn't bother and you'll see them in the final results at the bottom of this article.)

4. Separating our results
WebScraper is designed to crawl a site and extract a piece of data from each page. Each new row in the output file represents a page from the crawl. Here we want to collect multiple data from each page. So the scraped urls from each search results page will appear in a single cell (separated by a special character.) So we want to ask WebScraper to split these onto separate rows, which it does when it exports.
5. Run
Press the >Go button, and you'll see the results fill up in the Results tab. As mentioned, each row is one page of the crawl, you'll need to export to split the results onto separate rows.
 Here's the output in Finder/quicklook. The csv should open in your favourite spreadsheet app.

Wednesday, 27 March 2019

C64 / 6502 IDE for Mac

This post seems to fit equally well here on the PM blog as it does on the separate blog for my hobby project. To avoid duplication I'll post it there and link to it from here.

Version 9 of Integrity and Scrutiny underway

As with version 8, the next major release is likely to look very similar on the surface.

The biggest change will be more information of the 'warning' kind.

So far it's been easy to isolate or export 'bad links'. These tend to be a 4xx or 5xx server response codes, so you can already see at a glance what's wrong. You may have seen links marked with orange, which usually denotes a redirect. And you can easily filter those and export them if you want.
But there are many other warnings that Integrity and Scrutiny could report.

At this point, this won't mean a full validation of the html (that's another story) but there are certain things that the crawling engine encounters as it parses the page for links. One example is the presence of two canonical tags (it happens more often than you'd think). If they contain different urls, it's unclear what Google would do. There are a number of situations like this that the engine handles, and which the user might like to know about.

There are other types of problem, such as an old perennial; one or more "../" ('up a directory') at the start of a relative url which technically takes the url above the server's http root. (Some users take the line "if it works in my browser, then it's fine and your software is incorrectly reporting an error", and for that reason there's a preference to make the app as tolerant as browsers tend to be with these links.)

In short, there are various warnings that Integrity and Scrutiny could list. Version 9 will make use of the orange highlighting. There will be tabs within the link inspector so that it can display more information, and the warnings relating to that link (if any) will be one of these tabs.

Reporting / exporting is important to users. So there will be another main table (with export option) with the links results to display a list of each warning, by link / page.

This will be the main focus but there will be other features. Better re-checking is on the list. This is early days, so if there's anything else that you'd like to see, please do tell!

Saturday, 23 March 2019

Press release - Slight change of name and full launch for Website Watchman

Press release
Mon 25 Mar 2019


Watchman from PeacockMedia is now fully released and it's now called Website Watchman. The original name clashed with another project, but Website Watchman better sums up the app.

It'll be useful to anyone with a mac and a website. It does a very impressive trick. As it scans a website and checks all of the files that make up a page; html, js, css, images linked documents. It archives any changes. So the webarchive that it builds is four-dimensional. You can browse the pages, and then browse the versions of that page back through time. You view a 'living' version of the page, not just a snapshot.

As the name suggests, you can watch a single page, part of a website or a whole website, or even part of a page, using various filters including a regex, and be alerted to changes.

Thank you for your attention and any exposure that you can give Website Watchman.