Wednesday, 5 August 2020

Checking hyperlinks within a Word document (.docx)

Scrutiny has long been able to check links within pdf documents encountered during a website scan. Scrutiny is a website crawling tool, it wasn't intended that you could point it at a local pdf and ask it to check the hyperlinks within it. But with a tweak, the current version can do this.

The option to check links with in a Word document isn't a frequently-requested feature, but it has arisen a couple of times, and this week I've had a task where the ability to test / examine the hyperlinks within a .docx document would be valuable.


It has been an enjoyable (if sometimes bewildering) curve to learn about the docx format. 

As with the pdf option, (with the option switched on) Scrutiny should now look in Word documents discovered during the scan and report the link url and link text, and test that link. This also works if the document is on the local drive and the hyperlink points to another local document. At present this will only work on the .docx format, not the older .doc format.


As I write this post, (5 August 2020) this feature now exists within the current development version of Scrutiny and is in testing. If you would like to try it, I'd be pleased to let you have a test version for you to try. (Contact me.) It's important to try this on as many different docx files as possible before release. 

(Scrutiny offers a 30-day trial, so you'll still be able to try the feature if you're not a Scrutiny licence holder.)

Sunday, 19 July 2020

A sneaky peeky at the current development project

Years ago I wrote a simple development environment to help me hand code small websites. (Please don't judge my code.)


Development environment is overblowing it. But it did one very important trick.

But first, why would anyone want to hand-code a website? (Does anyone else hand-code any more?)

  • Having a server-side website CMS hacked is frustrating
  • Having plain html files on the server makes your site faster
  • If you're a control-freak like me, you want to write your own clean code, rather than allowing a system to generate unnecessary unreadable guff.

The first thing you notice when you hand-code a site, even a small one, is that if you make a change to code that appears on all pages, it's a huge PITA to go through all pages making the same change.

Hence that trick I mentioned. Those blue links in the screenshot are where an 'object' is inserted. An object is stored once, you edit it once and when you 'compile' the site, those placeholders on all pages are expanded. In the same operation, the app uploads the compiled pages to the server by ftp or sftp. Obviously, clicking that blue link in the editor loads the object for editing. The editor has forward and back navigation buttons.

That's a very brief overview. I've been using this myself for years. But as with tools you write for yourself, it's not very well-refined.

I've been thinking that I may not be alone in my wish to manage small sites this way. I guess most people who don't want a server-side CSM will be using a visual website creator on their computer.

I've decided to improve this to the point where it's a credible commercial (initially free) app and see what interest there is.

It's not ready yet. The whole object / compiling / uploading trick works, I've been using that for a long time.  I now have basic syntax colouring working and before any kind of release, I plan to build in these features:

  • A button to trigger 'Tidy', a unix html tidying utility (can also do some validation)
  • Preview of selected page, updated as you type
  • Compress (remove whitespace) before uploading
If you would like to read more, an early draft of the user guide is here.

Is this of interest to you? Are there other features you'd like in such an app? Let me know.

Wednesday, 8 July 2020

The browser padlock and why it might not appear



It's important to have an SSL certificate these days if your site is to have any credibility.

Even if you do have a valid certificate in place, you may still find that a browser refuses to display the padlock. Different browsers have their own criteria and display the information in different ways, but we've generally moved from 'a padlock when the site is secure' to a clear 'site insecure' warning.

The image above illustrates this. The site does have a valid certificate in place.  My two favourite browsers do both have developer tools which allow you to drill down and find the reason(s) for the warnings.

That's good for a single page that you know has a problem. But if you're a Scrutiny user, you want to be notified of any such problems on any page of your site.

Scrutiny has long had features to help you with migration to https://. It alerts you to old links to your http:// pages and pages which have mixed content. (images or linked files which are http://)

As mentioned above, browsers vary in their criteria for displaying the padlock. As from v9.8.0, Scrutiny makes additional checks / warnings:

The insecure content alert/report will now include:

  • insecure urls found in certain meta tags, such as open graph or Twitter cards.
  • insecure images, whether hosted internally or externally
  • insecure form action urls, even if the 'check form action' is switched off.

Monday, 25 May 2020

Testing website accessibility (WCAG / ADA compliance) using Scrutiny

No software can test your website and declare it ADA, or more specifically WCAG, compliant because some of the checks need to be made by a human or are subjective.

For example, is a heading, title or link text meaningful? Only a human can judge. But software can tell you whether headings and title of a reasonable length are present and thus report pages of possible concern.

Having said that, there are certain very important things that automated testing does do very well, such as checking for images without alt text.

With that in mind, here is a list of the ways that Scrutiny can help. The checkpoint numbers relate to the WGAC 2.0 requirements.



Alt text (1.1.1):  "non-text content that is presented to the user has a text alternative" (level A)

  • Scrutiny can report images without alt text


Adaptable website structure (1.3.1, 1.3.2): Properly marked up and well-organised headings  (level A) and
Section headings (2.4.10) "Section headings are used to organize the content." (level AAA)

  • Scrutiny can report pages with more than one h1 tag. For a specific page, it can show you the outline (ie just the headings, indented)



  • Scrutiny's Robotize feature can display a 'text-only' view of a web page and let you browse the site, with headings and links listed separately. This is a good way to test this checkpoint.



Keyboard accessible (2.1): "Make all functionality available from a keyboard."

  • if Scrutiny crawls a website fully (particularly with the 'run js' option switched off) then the navigation links are correctly-formed hyperlinks and it should be possible to tab through them using a keyboard and therefore navigate the site. (NB Scrutiny does not currently test / report form fields or buttons)


Page titles (2.4.2): "Web pages have titles that describe topic or purpose." "The title of each Web page should: Identify the subject of the Web page, Make sense when read out of context, Be short"  (level A)

  • Scrutiny can report pages which have a non-unique title, and pages which have a title which is too short / too long.

Link text (2.4.4): "The purpose of each link can be determined from the link text alone or from the link text together with its programmatically determined link context" (level A)

  • Scrutiny can report empty links as bad links (this option defaults to off, needs to be switched on)
  • Scrutiny makes it easier to check manually for meaningful link text.  Cast your eye down the 'link text' column of the links / flat view (or sort the table by this column) and look for link text that doesn't explain the purpose of the link.


Descriptive headings and labels (2.4.6): "Headings and labels describe topic or purpose."  "Determine if the Web page contains headings. Check that each heading identifies its section of the content." (level AA)

  • Scrutiny can report pages with no h1
  • Scrutiny can include headings (h1's h2's etc in separate columns) in the SEO report to make it easier to scan them by eye and pick out non-descriptive ones


Parsing (4.1.1): Make sure HTML code is clean and free of errors, particularly missing bracket closes. Also, make sure all HTML elements are properly nested.

  • Scrutiny provides a way to validate the html code for a specific page. Your website is likely to be template-based (ie the code for the design of the page is likely to be identical throughout the site) then validating the home page (and certain other pages if the design varies for differnet types of page) is a good indication of validation throughout the site.


Monday, 6 April 2020

Checking your browser's bookmarks

I had not considered this until someone recently asked about using Integrity to do the job.



Yes, in principle you can export your bookmarks from Safari or Firefox as a .html file and ask Integrity, Integrity Plus, Pro and Scrutiny to check all of the links it contains.

The only issue is that the free Integrity, and App Store versions of Integrity Plus and Integrity Pro are 'sandboxed', meaning that for security reasons, they generally only have access to local files within their own 'container'. Apple insists on this measure for apps distributed via their App Store.

For this reason, those versions of those apps will not be able to fully crawl a website stored locally (some people like to do this, although there are some advantages if you crawl via a web server, even via the apache server included with MacOS).

However, here we're only talking about parsing a single html file for links, and testing those.

A sandboxed app can access any file that you have chosen via an open or save dialog.

So all you need to do is to use File > Open to choose your bookmarks.html file rather than typing its name or dragging it to the starting url field. (Remember 'check this page only' to ensure that you only check the links on the bookmarks file and the app doesn't try to follow all of them.)
I have bookmarks in Safari going back many years. (nearly 2,000 apparently) There are so many pages there I'd forgotten about and some that clearly no longer exist or have moved.

Sunday, 22 March 2020

Improvements to free sitemap visualiser

SiteViz is as its name suggests, a tool for visualising a sitemap.

Integrity Plus, Pro and Scrutiny can export their sitemap data in the form of a .dot file, which contains the information necessary to draw a chart.

Early on, Omnigraffle did a great job of displaying these .dot files but it became expensive. There were other options but nothing free and easy to use. That's where SiteViz came in and its functionality has been built into Scrutiny.

Its default chart didn't look great though. The layout was flawed making the charts less than ideal.

Version 3 contains some vast improvements to that 'bubble tree' theme, along with some improvements to the colouring. Nodes can be coloured according to links in or 'link juice'. (Think liquid flowing through your links). Connections can be coloured according to multiple criteria too. The charts now look much more professional and are much more useful, especially the bubble tree theme. The screenshots you see on this page were made using it.

It will remain a free viewer. You can export the chart as a png image or pdf, but it can't create the .dot sitemap. For that you'll need a website crawler like Integrity Plus or Pro.

Thursday, 19 March 2020

What happens in Objective C if you compare two NSObjects with a simple 'greater than' or 'less than'?

This is one of those odd bugs that goes unnoticed because a line of code that shouldn't work,  strangely does, at least most of the time.

The problem exhibited itself under certain circumstances. That led me to investigate and discover this line (in pseudocode):

if ([object getProperty] > [anotherObject getProperty]){

Years ago, this line was correct, because in this particular object's getProperty used to return a primitive such as an int or NSInteger (I can't remember which).

But at some point getProperty was changed so that it returned an NSNumber, which is an object rather than a simple value.

The line should have been updated to (and has been now):

if ([[object getProperty] integerValue] > [[anotherObject getProperty] integerValue]){

Of course I should have searched the source for 'getProperty' and updated accordingly but this particular line escaped. It went unnoticed. The compiler didn't complain, and everything still seemed to work.

If a tree falls in a forest......    If a bad line of code appears to work under testing and no-one reports a problem, is it still a bug?

It didn't always work though. Under certain circumstances that particular thing went screwy (no crash, no exception, just odd results sometimes). But not randomly. It worked or didn't work predictably, depending on the circumstances.

I can't find confirmation of this but it seems from what I've observed that

(NSObject > NSObject) 

returns true or false depending on the order that they were created. I'm assuming that the comparison is being made using the object's address (pointer) treated as a numeric value. This makes sense because declaring NSObject *myobject  declares myobject as a pointer, which is some kind of primitive, known to contain a memory address.

A simple experiment seems to bear this out.

    NSNumber *number1 = [NSNumber numberWithInt:2];
    NSNumber *number2 = [NSNumber numberWithInt:10];
   
    NSLog(@"number1: %p, number2: %p", number1, number2);
   
    if(number1 > number2){
        NSLog(@"number1 is greater than number2");
    }
    else{
        NSLog(@"number2 is greater than number1");
    }

returns:

NumberTest[89819:23978762] number1: 0x7312659ea1d74f79, number2: 0x7312659ea1d74779
NumberTest[89819:23978762] number1 is greater than number2

It's interesting that the objects seem to be allocated going backwards in memory in this example. I assume that allocation of memory is not predictable, but that there would be some general sequence to it, backwards or forwards.


I'm obviously pleased to have found and fixed a problem in this app. But more than anything this has amused and interested me.

If you have anything to contribute on this subject, leave it in the comments.

Monday, 16 March 2020

scraping email addresses with WebScraper

This has been a popular question from Webscraper users.

It has already been possible to use WebScraper to scrape email addresses with some caveats. a/ It required setting up, using your favourite regular expression for matching email addresses (these are easy to find online if you're not a regex demon) and b/ it'll only work if the email addresses appear unobfuscated in the source or visible text. It has long been popular to use some method of hiding email addresses from bots.

To help with this particular task, you can now set up WebScraper more easily (as from version 4.11.0).  'Email Addresses' now appears in the drop-down buttons for the simple and complex setups.

Here's the simple setup. It couldn't be easier; 'Email Addresses' appears in the second drop-down if you choose 'Content' in the first. Skip to the Post-processing tab  >>

Output file columns

For the complex setup, by default you get the URL and Title columns by default. You may like to keep those so that you can see which page each email address appears on.  Or delete them if you simply want a list of email addresses. Then add a column. As with the simple setup, choose Content and then Email Addresses.

Results tab

At this point (After running the scan or a test) each page is presented on a row of this table. If there are multiple email addresses on a page, they'll be  listed in a single cell separated by a pipe. We'll fix that later. There may also be big gaps where pages don't contain an email address. That's also something we can fix.

During the scan, at the point where email addresses are scraped from a page, the results are uniqued. So if the same address appears multiple times on the same page (which is likely if the address appears as a link - it may be in the link and in the visible text) then it'll only appear once in that row on the Results tab.

Post-processing tab

Here's where we can tidy up the output. The first checkbox will split multiple results onto separate rows. The third checkbox will skip rows where there are no results. The drop-down button will contain a list of your output columns, choose your email address column. As the label says, these things will be done when you export your output to csv.

Preferences

The default expression used for this task will match strings like xxxx@xxxx.xxx  I found that this can match certain images that have an @ symbol in their filename. If you wish to improve the regular expression, then simply change it in this field in Preferences.

Export

Here's my output for this example (I chose not to include the url and title columns). Note that the same address appears a lot. At time of writing WebScraper doesn't have a 'unique' option on the post-processing tab but that's under consideration. Also note that caveat b at the top of this article still applies.



Tuesday, 10 March 2020

Changes to nofollow links : sponsored and ugc attributes : how to check your links

Google announced changes last year to the way they'd like publishers to mark nofollow links.

The rel attribute can also contain 'sponsored' or 'ugc' to indicate paid links and user-generated content. A while ago, nofollow links were not used for indexing or ranking purposes. But this is changing. Google will no longer treat them as a strict instruction to not follow or index.

This article on moz.com lists the changes and how these affect you.

As from version 9.5.6 of Integrity (including Integrity Plus and Pro) and version 9.5.7 of Scrutiny, These apps allow you to see and sort your links according to these attributes.

There was already a column in the links views for 'rel' which displayed the content of the rel attribute, and a column for 'nofollow' which displayed 'yes' or 'no' as appropriate. Now there are new columns for 'sponsored' and 'ugc' (also displaying yes/no for easy sorting). Many of the views have a 'column' selector . If visible, these columns will be sortable and they'll be included in csv exports.


Tuesday, 4 February 2020

How to extract a table from html and save to csv (web to spreadsheet)

WebScraper users have sometimes asked about extracting data contained in tables on multiple pages.

Tables on multiple pages


That's fine if the table is for layout, or if there's just one bit of info that you want to grab from each, identifiable using a class or id.

But to take the whole table raises some questions - how do you map the web table to your output file? It may work if you can identify a similar table on all pages (matching columns) so that each one can be appended and match up, and if the first row is always headings (or marked up as th) and can be ignored, except for maybe the first one.

It's a scenario with a lot of ifs and buts, which means that it may be one of those problems that's best dealt on a case-by-case basis rather than trying to make a configurable app handle it. (if you do have this requirement, please do get in touch.)

Table from a single page


But this week someone asked about extracting a table from a single web page. It's pretty simple to copy the source from the web page, paste it into an online tool, or copy the table from the web page and paste into a spreadsheet app like Numbers or Excel and that was my answer.

But this set me thinking about the job of parsing html and extracting the table data ready for saving in whatever format.

At the core of this is a cocoa class for parsing the html and extracting the table (or tables if there are more than one on the page). With a view to possibly building this into WebScraper to allow it to do the 'tables on multiple pages' task, or for having this ready, should the need arise to use this in a custom app for a one-off job, I've now written that parser and built a small free app around it.

That app is the imaginatively-titled HTMLTabletoCSV which is now available here.


Wednesday, 29 January 2020

Upgrading the hard disk in a G4 iMac

I bought this iMac new in 2003 and used it as my main machine for some time. I was writing software at the time, had a number of products out, but it wasn't my full-time business then. Eventually it was packed away to make room for newer machines.
Isn't it beautiful? OSX 10.4 is a really great version of the operating system, and this is the time when iTunes was at its height. I'd long had it in mind that it would make a great jukebox for kitchen or bedroom, with its great Harmon Kardon speakers with optional subwoofer. 

To be useful again, mine needed:
  • Wireless card (I never did fit one)
  • Hard drive upgrade* (the standard 80Gb one isn't big enough for half of my music collection)
  • Replacement CD drive** (mine had broken somewhere along the line)
  • New battery (small alkaline battery for keeping the time and date)
[Edit] since first writing this post, I've also added:
  • a set of JBL Creature speakers 
  • 1Gb RAM

The parts I expected to be hard were easy, and the parts that should have been easy were hard. I'm documenting my experiences here because there are some things that may be useful to others.

I'd already collected some of these parts, intending to do the work one day. Last week something reminded me of Cro-Mag Rally. It's a great game and the Mac version is now free, thank you Pangea. The OSX version needs a PowerPC mac, and my clamshell iBook proved too slow. so out came the G4 iMac.
 It wasn't remembering the time, which of course means that the battery is dead, and a battery inside an almost 20-year old computer is bad news. So perhaps the time was right to carry out those other upgrades at the same time.
The wireless card and adding RAM are pretty much a trapdoor job (well, behind the stainless steel cover, which has 4 captive crosshead screws. Very simple.) No worries if that's all you have to do.

NB, 10.4 (Tiger) allows connection with WPA2, which is what I needed at home. and the Airport Extreme is working fine. 10.3 and earlier don't have the WPA options, only WEP. Whether those earlier systems can be made to work with a modern router, I can't say.
I had to go deeper. You need to go one layer further even to replace that battery.  This is the new hard drive and replacement CD drive. The latter came from an eBay seller, used but working (allegedly). The HD and my new battery are from the BookYard, I recommend them (as a UK buyer), their website makes it easy find the right parts for your particular machine. The service was efficient and swift.
I have read that delving more deeply into the machine is an advanced task and I was expecting it to be harder than it really was. It's actually very simple. I'm not going to give instructions, there are plenty elsewhere. That's the old CD/HD assembly coming out, note the horrible amount of dust in there. With those out, it's possible to vacuum out the fan and all of the nooks and crannies, taking anti-static precautions.

One point of uncertainty for me was about the jumper settings for the new HD and CD drives. The replacement CD came with a single ribbon cable with a 'tap' halfway for the HD. The existing ones each had their own cables (there are two sockets on my board). I suspect that the CD drive came from a different model (there are several variations and three with 17" screens - 800Ghz, 1Ghz and 1.25Ghz.) Besides the difference with the cable, the little metal screen at the front was slightly different and I had to nab the that part from the old old drive.

Back to the jumper settings. The replacement CD drive came jumpered as a slave, which may be consistent with it being on the same ribbon cable as its HD. With my existing drives, each had their own cable and each was jumpered for 'cable select'. I decided to use my existing cables and jumper both drives for cable select as before. That worked.

NB - if you're going further than the trapdoor, there's some thermal paste that you must clean up and apply new. See the iFixit article for details.

 There's the replacement battery in and with everything back together, it chimed and as expected there's the icon for 'no startup volume'.
This is where the games really started. The new CD drive worked and I expected it to be easy to make a fresh OS install on the new blank drive, but I had a very frustrating time.

I have numerous discs from various computers. However, not all were readable and some were upgrade discs rather than install. Those that I downloaded from the web and burned to CD were either not bootable or simply would not boot. (If you're reading this and know why, please leave the answer in the comments.)

I still had the drive that I'd taken out, and 'restoring' or cloning an existing volume to a new one should be very easy using Disk Utility, which comes on every system and every bootable install disc.

At first I opened up the computer again and plugged in the old drive in the CD drive's socket on the motherboard. This worked, I could boot from it and perform the 'restore'.

However, it didn't work first time - the clone on the new drive wouldn't boot. I had to try again and again. Because my first method was a little makeshift, I devised this - it's a firewire  'caddy' made from an old LaCie firewire hard drive. I opened it up, unplugged its own HD and plugged in my computer's HD and voila! It plugs into the reassembled computer and can boot the computer (you can boot from a firewire external drive but not a USB one).
The old drive had three different versions of OSX installed on different partitions. Back in the day this allowed me to boot into  OSX 10.2, 10.3 or 10.4. Very useful for testing my software.

My first attempts at cloning the 10.4 partition failed a couple of times. The first time I thought it was because I hadn't checked 'erase' before doing the copy. But I don't know why the second attempt wasn't successful.

I'd read mixed opinions about the maximum size for the boot volume with this machine. I thought this might be the problem for a little while but I can tell you that I have a 500GB drive with a single partition / volume, and that it now boots.

For my final attempt I tried booting into 10.3 (Panther) and using its version of Disk Utility. (This I think being the system that came with the machine). I don't know whether this made the difference, or whether it was because I reformatted the new drive at that point. (It had come apparently formatted, so I hadn't bothered before). That new clone of the 10.3 volume did successfully boot. After that I successfully used an upgrade disc to take it to 10.4, and then a multipart update downloaded from the Apple developer site to take it to 10.4.11

Now I have a mac with a system that really doesn't feel out of date, a version of iTunes (7) which is a joy to use, a 500GB drive, wireless (10.4 allows me to connect with WPA2). A useful and usable machine, and one which looks amazing.

[edit]

I've now added 1Gb RAM. This is very cheap now, and keeps it responsive. Tiger is a newer system than the one the machine came with. I was starting to get lack of responsiveness with very little running.

I've also added these 'Creature' speakers. They were sold as 'made for mac' when the G4 was sold new. I really wanted a white set, but this set was an absolute bargain. Replacing the drivers in the satellites is another story. The original speaker cones are destined to disintegrate, and these ones had. Now that I've replaced them I'm not entirely happy. I'll live with these for a while. I will probably get a broken but nice-looking white set in the future, and repair the little speakers with better replacement drivers.




* Why not a SSD? I had intended to do this for some time. But a ATA/IDE/PATA  SSD isn't so easy to find, and I read that the limitations of the ATA system mean that the speed benefit is lost. A brand new spinning disc is cheaper and still quiet. So that's what I went for.
** I refer to the drive as a CD drive in this post. Actually it's a combined CD reader / writer and DVD reader, known as a 'superdrive'.