Tuesday 21 September 2021

First look at new app LinkDoc (for testing links within locally-stored pdf or docx documents)

One of the most frequently-asked questions on the Integrity support desk is how to test the links within a local document (.pdf or .doc). It should be possible with Scrutiny, it can parse a pdf or doc but only when Scrutiny encounters it as part of a website crawl.

Rather than shoehorn the functionality into the existing apps, this sounds more like a job for a 'single button' app built for this one purpose. Here it is.

It happens that I'm well into a ground-up rewrite of the Integrity / Scrutiny crawling engine. It's at the point where it runs. There's plenty to do, but for parsing a single page (document in this case) and checking the links, it should be fine. Of course as the 'V12 engine' develops, then any apps that use it, such as the new LinkDoc will receive those updates.

If you'd like to try it, it's available for download now. It's free but in return, please contact us with any problems or suggestions.

Tuesday 14 September 2021

Frustrating loop with NSURLSession when implementing URLSession: task: willPerformHTTPRedirection: newRequest: completionHandler:

This isn't really a question, I think I've solved the problem. But I think I've seen this in the past and solved it then. So I'm posting this here in case anyone else - most likely future me - has this same problem and wants to save some time.

Context: I'm working on the next version of my crawling engine - it moves to NSURLSession rather than NSURLConnection**. Therefore (with the connection stuff at least) it's a ground-up rewrite. 

Here's the problem I hit this morning. A Youtube link is redirecting from a shortened link to a longer one. But some unexpected things are happening. 

My URLSession: task: willPerformHTTPRedirection: newRequest: completionHandler:   is being called and at first I simply put in the minimum code needed to simply continue with the redirection:

According to example code found in various places, that should work. The documentation suggests that this is fine: "completionHandler: A block that your handler should call with either the value of the request parameter, a modified URL request object, or NULL to refuse the redirect and return the body of the redirect response."

As you can see from the screenshot above, our url is being redirected over and over, each time with some garbage appended to the querystring. 

Integrity and Scrutiny handle this fine. (Though at this point they are using the old NSURLConnection.) However, they have a lot more code in the redirect delegate method than my new one. Notably, they make a mutable copy of the new request and make some changes. Why do they need to do that?

I've a funny feeling that I've seen this problem before. Indeed, adding code to make a clone of the new request and modify it is what cures this problem.

It's not enough to simply clone and return the new request.  This is the line that makes the difference:

[clonedRequest  setValue: [[request URL] host] forHTTPHeaderField: @"Host"];

(There's more to it than that, but there are reasons why my code isn't open source! Also, I use Objective C, apologies if you're a Swift person*).

In short, when we create the original request, we set the Host field. In the past I have found this to be necessary in certain cases. In fact, the documentation says that the Host field must be present with http 1.1 and a server may return a 4xx status if it's not present and correct.

If we capture the header fields of the proposed new request in our delegate method, the original Host field appears unchanged. Therefore, it no longer matches the actual host of the updated request url. Here, the url is being redirected to https://www.youtube.com/watch?v=xyz, but the Host field remains "youtu.be"

As I mentioned, I've written the fix into Integrity and Scrutiny in the dim and distant past, presumably because I have spent time on this problem before. 

I'm guessing that this isn't a problem if you don't explicitly set the Host field yourself in your original request, but if you don't then you may find other problems pop up occasionally. 

If future me is reading this after starting another ground-up project and running into the same problem: you're welcome.

** The important delegate methods of NSURLSessionTask are very similar to the old connection ones. Because I'm seeing similar behaviour with both, I do believe that beneath the skin, NSURLSession and NSURLConnection are just the same. 

* I'm aware that many folks are using Swift today, and it's getting harder to find examples and problem fixes written in Objective C. I'm expecting that (as with java-cocoa) Apple will eventually remove support. But I won't switch and I'm grateful for Objective C support as long as it lasts.

Thursday 9 September 2021

429 status codes when crawling sites

I've had a few conversations with friends about Maplin recently. I have very good memories of the Maplin catalogue way back when they sold electronic components. The catalogue grew bigger each year and featured great spaceship artwork on the cover. They opened high street shops, started selling toys and then closed their shops.

The challenge with this site is that it would finish early after receiving a bunch of 429 status codes. 

This code means "too many requests". So the server would respond normally for a while before deciding not to co-operate any more. When this happens, it's usually solved by throttling the crawler; limiting its number of threads, or imposing a limit on the number of requests per minute.

With Maplin I went back to a single thread and just 50 requests per minute (less than one per second) and even at this pedestrian speed, the behaviour was the same. So I guess that it's set to allow a certain number of requests from a given IP address within a certain time. It didn't block my IP and so after a break would respond again. 

I managed to get through the site using a technique which is a bit of a hack but works. It's the "Pause and Continue" technique. When you start to receive errors, pausing and waiting for a while allows us to continue and make a fresh start with the server. A useful feature of Integrity and Scrutiny's engine is that on Continue, it doesn't just continue from where it left off. It will start at the top of its list, ignore the good statuses but re-check any bad links. This leads to the fun spectacle of the number of bad links counting backwards!

On finish, there seems to be around 50 genuinely broken links. Easily fixed once found.

Saturday 4 September 2021

Crawling big-name websites. Some thoughts.

Over the last couple of weeks I've been crawling the websites of some less-popular* big names. 

I enjoy investigating websites, it gives me some interesting things to think about and comment on, and it allows me to test my software 'in the wild'.

Already I'm feeling disappointed with the general quality of these sites, and I'm noticing some common issues. 

The most common by far is the "image without alt text" warning. As someone with a history in website accessibility, this is disappointing, particularly as it's the easiest accessibility improvement and SEO opportunity. Above is a section of the warnings from the RBS site. Every page has a list of images without alt text, and I see this regularly on sites that I'm crawling.

Next are the issues which may be the result of blindly plugging plugins and modules into a CMS. Last week I saw the issue of multiple <head> tags, some of them nested in the Shell UK website. This showed up a small issue with Scrutiny (fixed in 10.4.2 and above). 

One of the sites I've crawled this week, Ryanair, showed a different problem which may also be the result of plugins that don't play nicely together. 

The content page has two meta descriptions. Only one of them is likely to be displayed on Google's search results page. Don't leave that to chance.

Before getting to that point, the first black mark to Ryanair is that the site can't be viewed without javascript rendering. It's all very well for js to make pretty effects on your page but if nothing is visible on the page without js doing its stuff in the browser, then that is bad accessibility and arguably could hinder search engines from being able to index the pages properly**

This is what the page looks like in a browser without JS enabled, or on any other user agent that doesn't do rendering. This is what Integrity and Scrutiny would see by default. To crawl this site we need to enable the 'run js' feature. 

This aspect of the site helps to mask the 'double-description' problem from a human - if you 'view source' in a browser (depending on the browser) you may not even see the second meta description because you may see the 'pre-rendered' page code.

 Scrutiny reported the problem and I had to look at the 'post-rendered' source to see the second one:

I hope you enjoy reading about this kind of thing. I enjoy doing the investigation. So far no-one from any of the companies I've mentioned on blog pages and tweets have made contact, but I'd welcome that. 

*less-popular with me.

** It used to be the case that no search engine would be able to index such a page. Now Google (but not all search engines) does render pages. To some extent.