Monday, 25 May 2020

Testing website accessibility (WCAG / ADA compliance) using Scrutiny

No software can test your website and declare it ADA, or more specifically WCAG, compliant because some of the checks need to be made by a human or are subjective.

For example, is a heading, title or link text meaningful? Only a human can judge. But software can tell you whether headings and title of a reasonable length are present and thus report pages of possible concern.

Having said that, there are certain very important things that automated testing does do very well, such as checking for images without alt text.

With that in mind, here is a list of the ways that Scrutiny can help. The checkpoint numbers relate to the WGAC 2.0 requirements.

Alt text (1.1.1):  "non-text content that is presented to the user has a text alternative" (level A)

  • Scrutiny can report images without alt text

Adaptable website structure (1.3.1, 1.3.2): Properly marked up and well-organised headings  (level A) and
Section headings (2.4.10) "Section headings are used to organize the content." (level AAA)

  • Scrutiny can report pages with more than one h1 tag. For a specific page, it can show you the outline (ie just the headings, indented)

  • Scrutiny's Robotize feature can display a 'text-only' view of a web page and let you browse the site, with headings and links listed separately. This is a good way to test this checkpoint.

Keyboard accessible (2.1): "Make all functionality available from a keyboard."

  • if Scrutiny crawls a website fully (particularly with the 'run js' option switched off) then the navigation links are correctly-formed hyperlinks and it should be possible to tab through them using a keyboard and therefore navigate the site. (NB Scrutiny does not currently test / report form fields or buttons)

Page titles (2.4.2): "Web pages have titles that describe topic or purpose." "The title of each Web page should: Identify the subject of the Web page, Make sense when read out of context, Be short"  (level A)

  • Scrutiny can report pages which have a non-unique title, and pages which have a title which is too short / too long.

Link text (2.4.4): "The purpose of each link can be determined from the link text alone or from the link text together with its programmatically determined link context" (level A)

  • Scrutiny can report empty links as bad links (this option defaults to off, needs to be switched on)
  • Scrutiny makes it easier to check manually for meaningful link text.  Cast your eye down the 'link text' column of the links / flat view (or sort the table by this column) and look for link text that doesn't explain the purpose of the link.

Descriptive headings and labels (2.4.6): "Headings and labels describe topic or purpose."  "Determine if the Web page contains headings. Check that each heading identifies its section of the content." (level AA)

  • Scrutiny can report pages with no h1
  • Scrutiny can include headings (h1's h2's etc in separate columns) in the SEO report to make it easier to scan them by eye and pick out non-descriptive ones

Parsing (4.1.1): Make sure HTML code is clean and free of errors, particularly missing bracket closes. Also, make sure all HTML elements are properly nested.

  • Scrutiny provides a way to validate the html code for a specific page. Your website is likely to be template-based (ie the code for the design of the page is likely to be identical throughout the site) then validating the home page (and certain other pages if the design varies for differnet types of page) is a good indication of validation throughout the site.

Monday, 6 April 2020

Checking your browser's bookmarks

I had not considered this until someone recently asked about using Integrity to do the job.

Yes, in principle you can export your bookmarks from Safari or Firefox as a .html file and ask Integrity, Integrity Plus, Pro and Scrutiny to check all of the links it contains.

The only issue is that the free Integrity, and App Store versions of Integrity Plus and Integrity Pro are 'sandboxed', meaning that for security reasons, they generally only have access to local files within their own 'container'. Apple insists on this measure for apps distributed via their App Store.

For this reason, those versions of those apps will not be able to fully crawl a website stored locally (some people like to do this, although there are some advantages if you crawl via a web server, even via the apache server included with MacOS).

However, here we're only talking about parsing a single html file for links, and testing those.

A sandboxed app can access any file that you have chosen via an open or save dialog.

So all you need to do is to use File > Open to choose your bookmarks.html file rather than typing its name or dragging it to the starting url field. (Remember 'check this page only' to ensure that you only check the links on the bookmarks file and the app doesn't try to follow all of them.)
I have bookmarks in Safari going back many years. (nearly 2,000 apparently) There are so many pages there I'd forgotten about and some that clearly no longer exist or have moved.

Sunday, 22 March 2020

Improvements to free sitemap visualiser

SiteViz is as its name suggests, a tool for visualising a sitemap.

Integrity Plus, Pro and Scrutiny can export their sitemap data in the form of a .dot file, which contains the information necessary to draw a chart.

Early on, Omnigraffle did a great job of displaying these .dot files but it became expensive. There were other options but nothing free and easy to use. That's where SiteViz came in and its functionality has been built into Scrutiny.

Its default chart didn't look great though. The layout was flawed making the charts less than ideal.

Version 3 contains some vast improvements to that 'bubble tree' theme, along with some improvements to the colouring. Nodes can be coloured according to links in or 'link juice'. (Think liquid flowing through your links). Connections can be coloured according to multiple criteria too. The charts now look much more professional and are much more useful, especially the bubble tree theme. The screenshots you see on this page were made using it.

It will remain a free viewer. You can export the chart as a png image or pdf, but it can't create the .dot sitemap. For that you'll need a website crawler like Integrity Plus or Pro.

Thursday, 19 March 2020

What happens in Objective C if you compare two NSObjects with a simple 'greater than' or 'less than'?

This is one of those odd bugs that goes unnoticed because a line of code that shouldn't work,  strangely does, at least most of the time.

The problem exhibited itself under certain circumstances. That led me to investigate and discover this line (in pseudocode):

if ([object getProperty] > [anotherObject getProperty]){

Years ago, this line was correct, because in this particular object's getProperty used to return a primitive such as an int or NSInteger (I can't remember which).

But at some point getProperty was changed so that it returned an NSNumber, which is an object rather than a simple value.

The line should have been updated to (and has been now):

if ([[object getProperty] integerValue] > [[anotherObject getProperty] integerValue]){

Of course I should have searched the source for 'getProperty' and updated accordingly but this particular line escaped. It went unnoticed. The compiler didn't complain, and everything still seemed to work.

If a tree falls in a forest......    If a bad line of code appears to work under testing and no-one reports a problem, is it still a bug?

It didn't always work though. Under certain circumstances that particular thing went screwy (no crash, no exception, just odd results sometimes). But not randomly. It worked or didn't work predictably, depending on the circumstances.

I can't find confirmation of this but it seems from what I've observed that

(NSObject > NSObject) 

returns true or false depending on the order that they were created. I'm assuming that the comparison is being made using the object's address (pointer) treated as a numeric value. This makes sense because declaring NSObject *myobject  declares myobject as a pointer, which is some kind of primitive, known to contain a memory address.

A simple experiment seems to bear this out.

    NSNumber *number1 = [NSNumber numberWithInt:2];
    NSNumber *number2 = [NSNumber numberWithInt:10];
    NSLog(@"number1: %p, number2: %p", number1, number2);
    if(number1 > number2){
        NSLog(@"number1 is greater than number2");
        NSLog(@"number2 is greater than number1");


NumberTest[89819:23978762] number1: 0x7312659ea1d74f79, number2: 0x7312659ea1d74779
NumberTest[89819:23978762] number1 is greater than number2

It's interesting that the objects seem to be allocated going backwards in memory in this example. I assume that allocation of memory is not predictable, but that there would be some general sequence to it, backwards or forwards.

I'm obviously pleased to have found and fixed a problem in this app. But more than anything this has amused and interested me.

If you have anything to contribute on this subject, leave it in the comments.

Monday, 16 March 2020

scraping email addresses with WebScraper

This has been a popular question from Webscraper users.

It has already been possible to use WebScraper to scrape email addresses with some caveats. a/ It required setting up, using your favourite regular expression for matching email addresses (these are easy to find online if you're not a regex demon) and b/ it'll only work if the email addresses appear unobfuscated in the source or visible text. It has long been popular to use some method of hiding email addresses from bots.

To help with this particular task, you can now set up WebScraper more easily (as from version 4.11.0).  'Email Addresses' now appears in the drop-down buttons for the simple and complex setups.

Here's the simple setup. It couldn't be easier; 'Email Addresses' appears in the second drop-down if you choose 'Content' in the first. Skip to the Post-processing tab  >>

Output file columns

For the complex setup, by default you get the URL and Title columns by default. You may like to keep those so that you can see which page each email address appears on.  Or delete them if you simply want a list of email addresses. Then add a column. As with the simple setup, choose Content and then Email Addresses.

Results tab

At this point (After running the scan or a test) each page is presented on a row of this table. If there are multiple email addresses on a page, they'll be  listed in a single cell separated by a pipe. We'll fix that later. There may also be big gaps where pages don't contain an email address. That's also something we can fix.

During the scan, at the point where email addresses are scraped from a page, the results are uniqued. So if the same address appears multiple times on the same page (which is likely if the address appears as a link - it may be in the link and in the visible text) then it'll only appear once in that row on the Results tab.

Post-processing tab

Here's where we can tidy up the output. The first checkbox will split multiple results onto separate rows. The third checkbox will skip rows where there are no results. The drop-down button will contain a list of your output columns, choose your email address column. As the label says, these things will be done when you export your output to csv.


The default expression used for this task will match strings like  I found that this can match certain images that have an @ symbol in their filename. If you wish to improve the regular expression, then simply change it in this field in Preferences.


Here's my output for this example (I chose not to include the url and title columns). Note that the same address appears a lot. At time of writing WebScraper doesn't have a 'unique' option on the post-processing tab but that's under consideration. Also note that caveat b at the top of this article still applies.

Tuesday, 10 March 2020

Changes to nofollow links : sponsored and ugc attributes : how to check your links

Google announced changes last year to the way they'd like publishers to mark nofollow links.

The rel attribute can also contain 'sponsored' or 'ugc' to indicate paid links and user-generated content. A while ago, nofollow links were not used for indexing or ranking purposes. But this is changing. Google will no longer treat them as a strict instruction to not follow or index.

This article on lists the changes and how these affect you.

As from version 9.5.6 of Integrity (including Integrity Plus and Pro) and version 9.5.7 of Scrutiny, These apps allow you to see and sort your links according to these attributes.

There was already a column in the links views for 'rel' which displayed the content of the rel attribute, and a column for 'nofollow' which displayed 'yes' or 'no' as appropriate. Now there are new columns for 'sponsored' and 'ugc' (also displaying yes/no for easy sorting). Many of the views have a 'column' selector . If visible, these columns will be sortable and they'll be included in csv exports.

Tuesday, 4 February 2020

How to extract a table from html and save to csv (web to spreadsheet)

WebScraper users have sometimes asked about extracting data contained in tables on multiple pages.

Tables on multiple pages

That's fine if the table is for layout, or if there's just one bit of info that you want to grab from each, identifiable using a class or id.

But to take the whole table raises some questions - how do you map the web table to your output file? It may work if you can identify a similar table on all pages (matching columns) so that each one can be appended and match up, and if the first row is always headings (or marked up as th) and can be ignored, except for maybe the first one.

It's a scenario with a lot of ifs and buts, which means that it may be one of those problems that's best dealt on a case-by-case basis rather than trying to make a configurable app handle it. (if you do have this requirement, please do get in touch.)

Table from a single page

But this week someone asked about extracting a table from a single web page. It's pretty simple to copy the source from the web page, paste it into an online tool, or copy the table from the web page and paste into a spreadsheet app like Numbers or Excel and that was my answer.

But this set me thinking about the job of parsing html and extracting the table data ready for saving in whatever format.

At the core of this is a cocoa class for parsing the html and extracting the table (or tables if there are more than one on the page). With a view to possibly building this into WebScraper to allow it to do the 'tables on multiple pages' task, or for having this ready, should the need arise to use this in a custom app for a one-off job, I've now written that parser and built a small free app around it.

That app is the imaginatively-titled HTMLTabletoCSV which is now available here.