Wednesday 24 May 2017

Heads-up : Internationalised Domain Names (IDNs) supported in our web crawlers

We're a UK-based concern, our apps have been almost always available in a single language - UK English (or just English as we call it here in England!)  The vast majority of our users as I write this are from English-speaking countries.

Our alphabet entirely consists of characters available in ascii, and so there has been little call for Integrity, Integrity Plus and Scrutiny (and other tools based on the same engine) to support domain names - ie domain names which contain characters not found in the ascii character set.

But now we've started work on localisation of our apps and web pages, and have received the odd question concerning IDNs.

Let's not confuse this with unusual characters in the path and filename of the url. Our apps have long supported these. You may still see the non-ascii characters displayed, but behind the scenes, those characters are encoded before the http request is put together, usually using a percent-encoding system.

The method is similar with the domain name, but using a a more complex and clever system of character encoding. Browsers (and our web crawlers) often still display the user-friendly unicode version.

You can enter your starting url in the unicode form or the 'punycode' form and it'll be handled correctly. The same goes for unicode or punycode links found on your pages.

Personally, I'm not keen, this does allow for spoofing of legitimate domains using similar characters. There are rules excluding many characters for these reasons.

After lots of extra homework for us, Scrutiny is now handling IDNs, and is in testing.

[update 26 May 2017] Integrity and Integrity Plus also have this enhancement and are also in testing.

If this is useful for you, and you'd like to try the new version (remembering that there may still be the odd bug to iron out)  then you're very welcome to download and use it (with the condition that you let us know about any issues you spot.)

[edit - links taken out, out of date]

No comments:

Post a Comment