Tuesday, 3 January 2017

Crawling a website that requires authentication

This is a big subject and gets bigger and more complicated as website become increasingly clever at preventing non-human visitors from being able to log in.

My post How to use Scrutiny to test a website which requires authentication has been updated a number of times in its history and I've just updated it again to include a relatively recent Scrutiny feature. It's a simple trick involving a browser window within Scrutiny which allows you to log into your site. If there's a tracking cookie, that's then retained for Scrutiny's scan.

It used to be possible to simply log in using Safari - Safari's cookies seem to have been systemwide, but after Yosemite, a browser's cookies seem to be specific to that browser.

The reason for this all being on my mind today is that I've just worked the same technique into WebScraper. I wanted to compile a list of some website addresses from pages within a social networking site which is only visible to authenticated users.



Webscraper doesn't have the full authentication features of Scrutiny but I think this method will work with the majority of websites which require authentication.

(This feature, and others, are in Webscraper 1.3 which will be available very shortly)
SaveSave

2 comments:

  1. How does that work then, does it allow you to pre-launch the browser session so that you can enter your login credientials?

    ReplyDelete
  2. Yep, it has its own simple browser window (within the Webscraper app) which opens with the homepage of the site in question. You use that to log in to the site. Cookies are enabled within that browser window and now that cookies are 'per app' rather than systemwide (since 10.10) then when Webscraper starts scanning, it'll use any cookies it already has for that site. This does depend on the site using cookies to track the authenticated user, but many do. I'm not sure but I don't think it counts as the same browser session. If not, it may not work for some sites. I guess that's where the 'keep me logged in' checkbox comes into its own.
    Scrutiny covers more bases with many more options and methods of attempting authentication.

    ReplyDelete