PeacockMedia: crawl

Monday, 4 June 2018

Test HTML validation and accessibility checkpoints of whole website

Didn't Scrutiny used to do this?

Yes, but when the w3c validator's 'nu' engine came online, it broke Scrutiny's ability to test every page. The 'nu' engine no longer returned the number of errors and warnings in the response header, which Scrutiny had used as a fast way to get basic stats for each page. It also stopped responding after a limited number of requests (some Scrutiny users have large websites).

Alternative solutions

After exploring some other options (notably html tidy, which is installed on every mac) it appears that the W3C service now offers a web service which is responding well and we haven't seen it clam up after a large number of fast requests (even when using a large number of threads).

The work in progress is called Tidiness (obviously a reference to tidy, which we've been experimenting with).

It contains a newer version of tidy than the one installed on your Mac. However, the html validation results are useful but not as definitive as the ones from the W3C service.

So Tidiness as it stands is a bit of a hybrid. It crawls your website, passing each page to the W3C service (as a web service). If you like you can switch to tidy for the validation, which makes things much quicker as everything is then running locally. If you like, you can simultaneously make accessibility checks at level 1,2 or 3, with all of the results presented together.

Here are some shots.

[update 3 Jan 2021] Due to lack of interest, this project is mothballed. We have since built more html validation tests into Scrutiny / Integrity Pro and that functionality will be expanded through 2021. If you are interested in Tidiness - please tell us.

Friday, 19 January 2018

Limiting Integrity / Scrutiny's scan to a particular section of the site or directory

This is a frequently-asked question and a subject that's been on the workbench this week.

Integrity / Integrity Plus and Scrutiny allow you to set up rules which limit the scan, or prevent checking of urls matching a pattern:

The rules dialog* will shortly look as above; the dialog will appear as a sheet attached to its main window, it'll allow 'urls which contain' / 'urls which don't contain' and the 'only follow' option is gone from the first menu, leaving 'ignore', 'do not check', and 'do not follow'.

('only follow' was confusing because if you have two such rules, then it doesn't make logical sense. 'do not follow urls that don't contain' does the same job and makes sense if you have more than one.)

It's important to say that it shouldn't be necessary to use a rule such as 'do not follow urls that don't contain /mac/scrutiny' if you are starting at peacockmedia.software/mac/scrutiny because Integrity and Scrutiny should only follow urls at or below the 'directory' that you start at.

It's also important to note that 'not following' isn't the same as 'ignoring'. If a link appears on a page then you will see it in the link check results regardless of whether it matches your 'follow' pattern. If you only want to see urls which follow a certain pattern, use an 'ignore' rule instead.

(That last statement applies to the link check results. The SEO table in Scrutiny and Sitemap table in Integrity Plus and Scrutiny should only show pages which meet your rules and appear at or below the directory you start in.)

Important note - if you're 'ignoring urls not containing' then you probably won't be able to find broken images or linked files (unless their urls happen to match your rule's pattern). So if you have the 'images and linked files' option checked then you'll need to use a 'follow' rule rather than 'ignore'.

Protip - this has been possible for a while but not very well documented: you can use certain special characters. An asterisk to mean 'any number of any character' and a dollar sign at the end to mean 'appears at the end'. For example, "don't check urls that contain .dmg$" will only match urls where .dmg appears at the end of the url. And peacockmedia.software/*/mac/scrutiny will match peacockmedia.software/en/mac/scrutiny and peacockmedia.software/fr/mac/scrutiny

Regex is not supported (yet) in these rules (though it is available in Scrutiny's site search). It's not necessary to type whole urls or to use the asterisk at the start or end of your term. By default it's a partial match, so "urls not containing /fr/mac/scrutiny" will work.

The new "urls that don't contain" functionality is in Scrutiny from v7.6.8 and should shortly be available in Integrity and Integrity plus from v6.11.14

*Years ago, these rules were called 'Blacklist / Whitelist Rules'. I'm mentioning that in case anyone searches for those terms.