Friday 5 February 2021

HTML validation of an entire website

Version 10 of Scrutiny and Integrity Pro contain  built-in html validation. This means that they can make some important checks on every page as they crawl. 

It's enabled by default but can be switched off (with very large sites it can be useful to switch off features that you don't need at the time, for reasons of speed or resources).


Simply scan the site as normal. When it's finished, the task selection screen contains "Warnings: HTML validation and other warnings >"
(NB Integrity Pro differs here, it doesn't have the task selection screen above, but a 'Warnings' tab in its main tabbed view.)

Warnings can be filtered, sorted and exported. If there's a type of warning that you don't need to deal with right now, you can "hide warnings like this" temporarily or until the next scan. (Right-click or ctrl-click for context menus.)


The description of the warning contains a line number and/or url where appropriate / possible.

In addition, links are coloured orange (by default) in the link-check results tables if there are warnings. Traditionally, orange meant a redirection, and it still does, but other warnings now colour that link orange. A double-click opens the link inspector and the warnings tab shows any reason(s) for the orange colouring.  Note that while the link inspector is concerned with the link url, many of these warnings will apply to the target page of the link url.



The full list of potential warnings (to date) is at the end of this post. We're unsure whether this list will ever be as comprehensive as the w3c validator, and unsure whether it should be.  At present it concentrates on many common and important mistakes; the ones that have consequences.

Should you wish to run a single page through the w3c validator,  that option still exists in the context menu of the SEO table (the one table that lists all of your pages.  The sitemap table excludes certain pages for good reasons.)



Full list of possible html validation warnings (so far):

unclosed div, p, form
extra closing div, p, form
extra closing a
p within h1/h2...h6
h1/h2...h6 within p
more than one doctype / body
no doctype / html / body /
no closing body / html
unterminated / nested link tag 
script tag left unclosed
comment left unclosed
end p with open span
block level element XXX cannot be within inline element XXX  (currently limited to div/footer/header/nav/p  within a/script/span  but will be expanded to recognise more elements )
'=' within unquoted src or href url
link url has mismatched or missing end quotes
image without alt text. (This is an accessibility, html validation and SEO issue. The full list of images without alt text can also be found in Scrutiny's SEO results.)
more than one canonical
more than one opening html tag
Badly nested <form> and <div>
Form element can't be nested
hanging comma at end of src list (w3 validator reports this as "empty image-candidate string")
more than one meta description is found in the head

warnings that are not html validation:

Type mismatch: Type attribute in html is xxx/yyy, content-type given in server response is aaa/bbb
The server has returned 429 and asked us to retry after a delay of x seconds
a link contains an anchor which hasn't been found on the target page
The page's canonical url is disallowed by robots.txt
link url is disallowed by robots.txt
The link url is a relative link with too many '../' which technically takes the url above the root domain.
(if 'flag blacklisted' option switched on) The link url is blacklisted by a blacklist / whitelist rule. (default is off)   With this option on, the link is coloured red in the link views, even if warnings are totally disabled.