Thursday 25 April 2019

Apple's notarization service

A big change that's been happening quietly in MacOS is Apple's Notarization service.

Ever since the App Store opened and was the only place to obtain software for the iPhone ('jailbreaking' excepted), I've been waiting for the sun to set on being able to download and install Mac apps from the web. Which is the core of my business. (Mac App Store sales amount to a small proportion of my income. That balance is fine, because Apple take a whopping 1/3 of the selling price).

Notarization is a step in that direction, although it still leaves developers free to distribute outside the app store. It means that Apple examine the app for malware. At this point they can't reject your app for any reason other than the malware search. They do specify 'hardened runtime' which is a tighter security constraint but I've  not found this to restrict functionality, as the Sandboxing requirement did when the App Store opened.

When the notarization service started last year, it was optional. Now Gatekeeper gives a more preferable message when an app is notarized, and it looks as if 10.15's Gatekeeper will refuse to install apps that haven't been notarized.

It's easy to feel threatened by this and imagine a future where Apple are vetting everything in the same way they do for the app store. For users that's a great thing, it guarantees them a certain standard of quality in any app they may be interested in. As a developer it feels like a constraint on my freedom to build and publish.

It genuinely seems geared towards reducing malware on the Mac. "This is a good thing" says John Martellaro in his column.

https://www.macobserver.com/columns-opinions/editorial/notarization-apple-greatly-reduce-malware-on-macs/?utm_source=macobserver&utm_medium=rss&utm_campaign=rss_everything

Wednesday 17 April 2019

New tool for analysing http response

Your web page looks great in your browser but a tool such as a link checker or SEO checker reports something unexpected.

It's an age-old story. I'm sure we've all been there a thousand times.

OK, maybe not, but it does happen. An http request can be very different depending on what client is sending it. Maybe the server gives a different response depending on the user-agent string or another http request header field. Maybe the page is 'soft 404'.

What's needed is a tool that allows you to make a custom http request and view exactly what's coming back; the status, the response header fields and the content. And maybe to experiment with different request settings.

I've been meaning to build such a tool for a long time. Incorporated into Scrutiny, or maybe a standalone app.

I've also been thinking of making some online tools. The ultimate plan is an online Integrity. The request / response tool is the ideal candidate for an online tool.

For a name, one of my unused domains seemed strangely appropriate. A few minutes with the Wacom and it has a logo.



It's early days and there are more features to add. But feel free to try it. If you see any problems, please let me know.

http://purpleyes.com

Monday 1 April 2019

Scraping links from Google search results

This tutorial assumes that you want to crawl the first few search results pages and collect the links from the results.

Remember that:

1. The search results that you get from WebScraper will be slightly different from the ones that you see in your browser. Google's search results are personalised. If someone else in a different place does the same search, they will get a different set of results from you, based on their browsing history. WebScraper doesn't use cookies and so it doesn't have a personality or browsing history.

2. Google limits the number of pages that you can see within a certain time. If you use this once with these settings, that'll be fine, but if you use it several times, it'll eventually fail. I believe Google allows each person to make 100 requests per hour before showing a CAPTCHA but I'm not sure about that number.  If you run this example a few times, it may stop working. If this happens, press the 'log in' button at the bottom of the scan tab. You should see the CAPTCHA. If you complete it you should be able to continue.  If this is a problem, adapt this tutorial for another search engine which doesn't have this limit.

We're using WebScraper for Mac which has some limits in the unregistered version.

1. The crawl
Your starting url looks like this: http://www.google.com/search?q=ten+best+cakes.   Set Crawl maximum to '1 click from home' because in this example we can reach the first ten pages of search results within one click of the starting url (see Pagination below).
One important point not ringed in the screenshot. Choose 'IE11 / Windows' for your user-agent string. Google serves different code depending on the browser. The regex below was written with the user-agent set to IE11/Windows.

2. Pagination 
We want to follow the links at the bottom of the page for the next page of search results, and nothing else. (These settings are about the crawl, not about collecting data). So we set up a rule that says "ignore urls that don't contain &start= " (see above)

3. Output file
Add a column, choose Regex. The expression is <div class="r"><a href="(.*?)"
You can bin the other two default columns if you want to. (I didn't bother and you'll see them in the final results at the bottom of this article.)

4. Separating our results
WebScraper is designed to crawl a site and extract a piece of data from each page. Each new row in the output file represents a page from the crawl. Here we want to collect multiple data from each page. So the scraped urls from each search results page will appear in a single cell (separated by a special character.) So we want to ask WebScraper to split these onto separate rows, which it does when it exports.
5. Run
Press the >Go button, and you'll see the results fill up in the Results tab. As mentioned, each row is one page of the crawl, you'll need to export to split the results onto separate rows.
 Here's the output in Finder/quicklook. The csv should open in your favourite spreadsheet app.