PeacockMedia: December 2019

Tuesday, 31 December 2019

Finding mixed / insecure website content using Scrutiny

It's been a while since some browsers have been warning that a page is insecure. I read recently that Google Chrome will start blocking HTTP resources in HTTPS pages.

If you've not yet migrated your website to https:// then you're probably thinking about doing it now.

Once the certificate is installed (which I won't go into) then you must weed out links to your http:// pages and pages that have 'mixed' or 'insecure' content, ie references to images, css, js and other files which are http://.

Scrutiny makes it easy to find these.

If you're not a Mac user or you'd simply like me to do it for you, I'm able to supply a mixed content report for a modest one-off price. It will list

pages with links to internal http: pages
pages which use resources (images, style sheets, etc) which are http
https:// pages which have a canonical which is http://
https:// urls which redirect to a http:// url

If you're interested in using Scrutiny to do this yourself, read on.

1. Find links to http pages and pages with insecure content.

First you have to give Scrutiny your https:// address as your starting url, and make sure that these two boxes are ticked in your site-specific settings,

and these two as well,

After running a scan, Scrutiny will offer to show you these issues,

You'll have to fix-and-rescan until there's nothing reported. (When you make certain fixes, that may reveal new pages to Scrutiny for testing).

2. Fix broken links and images

Once those are fixed, there may be some broken links and broken images to fix too (I was copying stuff onto a new server and trying to only copy what was needed. There are inevitably things that you miss...) Scrutiny will report these and make them easy to find.

3. Submit to Google.

Scrutiny can also generate the xml sitemap for you, listing your new pages (and images and pdf files too if you want).

Apparently Google treats the https:// version of your site as a separate 'property' in its Search Console (was Google Webmaster Tools). So you'll have to add the https:// site as a new property and upload the new sitemap.

4. Redirect

As part of the migration process, Google recommends that you then "Redirect your users and search engines to the HTTPS page or resource with server-side 301 HTTP redirects" (full article here)

Wednesday, 11 December 2019

Using Webscraper to extract data from paginated search result pages, another real case study

First we need to give WebScraper our starting URL. This may be as simple as performing the search in a browser and grabbing the url of the first page of results. In this case things are even easier, because the landing page on the website has the first pageful of results. So that's our starting URL.

After that, we want WebScraper to 'crawl' the pagination (ie next/previous, or page 1, 2 3 etc) and ignore any of the other links to other parts of the site. In this case, the pages have a nice clean querystring which only includes the page number:

So as far as the crawl goes, we can tell WebScraper to ignore any url which doesn't contain "?page=". We're in the Advanced section of the Scan tab for these next few steps.

In some cases, the pages we want the data from are a further click from the search results pages. Let's call those 'detail pages'. It's possible to tell WS to stick to the scan rule(s) we've just set up, but to also visit and only scrape data from detail pages. This is done by setting up rules on the 'Output filter' tab. We don't need to do that here but this article goes into more detail about that.

In our case we're still going to use the output filter to make sure that we only scrape data from pages that we're interested in. We're starting within the 'directory' /names. So the first part, "url does contain /names" shouldn't be necessary*. There's a silly quirk here which is that the first page won't stick at ?page=1, it redirects back to /names. Therefore we can't use /names?page= here if we're going to capture the data from that first page. I notice that there are detail pages in the form /names/somename so in order to filter those from the results, I've added a second rule here, "URL does not contain names/"

If we're lucky, the site will be responsive and won't clam up on us after a limited number of responses, and our activities will go unnoticed. However, it's always a good policy to put in a little throttling (the lower the number, the longer the scan will take).

Don't worry about the number of threads too much here, the above setting takes that into account and it's beneficial to use multiple threads, even when throttling right down).

Finally, we set up our columns. If you don't see the Output file columns tab, click Complex setup (below the Go button). By default, you'll see two columns set up already, for page URL and page title. Page URL is often useful later, page title not so much. You can delete one or both of those if you don't want them in the output file.

In this example, we want to extract the price, which appears (if we 'view source') as
data-business-name-price="2795". This is very easy to turn into a regex, we just replace the part we want to collect with (.+?)

I had a little confusion over whether this appeared on the page with single-quotes or double-quotes, and so to cover the possibility that the site isn't consistent with these, I also replaced the quotes in the expression with ["'] (square brackets containing a double and single quote).

(Setting up the other columns for this job was virtually identical to the above step).

If necessary, WebScraper has a Helper tool to help us create and test that Regex. (View>Helper for classes/Regex.)

The Test button on the Output columns tab will run a limited scan, which should take a few seconds and put a few rows in the Results tab. We can confirm that the data we want is being collected.

Note that, by default, Webscraper outputs to csv and puts the data from one page on one row. That's fine if we want one piece of data from each page (or extract a number of different pieces into different columns) but here we have multiple results on each page.

Choosing json as an output format gets around this because the format allows for arrays within arrays. If we need csv as the output then WS will initially put the data in a single cell, separated with a separator character (pipe by default) and then has an option on the Post-process tab to split those rows and put each item of data on a separate row. This is done when exporting, so won't be previewed on the Results tab.

That's the setup. Go runs the scan, and Export exports the data.

*remembering to answer 'Directory' when asked whether the starting url is a page or directory

Sunday, 8 December 2019

Manual for Webscraper for Mac

Webscraper is becoming one of our more popular apps and it's growing in terms of its features and options too.

When I use it to do a job for myself or for someone else, I am having trouble myself remembering the rules for certain features.

So I've written a manual in Plain English which should help everyone whether they are a first-time user or just need to check a detail.

https://peacockmedia.software/mac/webscraper/manual.html

There are links on the app's home page and support page. I'll put a link under the app's Help menu too.