Saturday, 19 May 2018

Review - Notebook for Mac from Zoho

One person's pro is another's con. I've seen Notebook lose stars because it doesn't allow you to set the colour of text. But I've long been looking for a sticky note alternative that *doesn't* do rich text, ie allows me to copy / paste in and out as plain text. The very long-standing Stickies app is unbelievably still here in High Sierra, unchanged for decades. But those have long worked using rich text.

(OK I can usually cmd-alt-shift-V for paste as plain text, but that doesn't work everywhere.)

Initially it appeared that Notebook's text cards are plain text, but after some use I found that they're not plain-text - but not rich-text either. More of that later.

Learning curve

There's some nomenclature and a few concepts to learn, as with any app. The thing I think of as a sticky note is a 'text card' and collections of various types of note are called 'Notebooks'. Nothing too taxing there. When you start up there is a quick tour (later accessible from the Help menu - "Onboarding" - what horrible business-speak!) but frankly I find that kind of thing annoying. A little like peas; I know they're good for me but I want to go straight for the meat. There's a bit of emphasis on gestures, which a. don't seem to work with my magic mouse, and b. is a swipe or a pinch-out really easier than a double-click? I found myself single-clicking things and expecting them to open, which they don't and I don't see why they couldn't. All-in-all, up and running in moments with minimal hold-ups and bafflement. 9.5/10

Look and feel

This is the top-billed selling point - "the most beautiful note-taking app across devices". It certainly looks polished, if very white. It has a non-standard 'modern web app' feel to it, with lots of sliding and whitespace. Non-standard in MacOS terms, that is. I guess the white background and weird white title bar (lacking a standard customisable toolbar) is for consistency with the iOS version of the app (losing a star here). If the text cards themselves don't behave exactly like sticky notes, they are very analogous. They have a random coloured background and have a sticky-note coloured-square look before you open them up. Notebooks (collections of cards) have customisable covers. These visual cues are very important as your brain gets older and less agile!   9/10

What does it offer over and above the tools that come with the OS?

Bizarrely I can almost create the look and functionality that I want using TextEdit - it allows me to create square plain text documents and set the background colour of each. Unfortunately it doesn't remember those background colours after quit and restart.

MacOS's Notes app scores highly for its syncing across devices (with some frequent annoying duplication). It's the app I have been using when I need to paste something or note down a thought when my phone is in my hand, and later retrieve on the computer. It allows you to organise things in a pretty basic way.

Reminders is as per Notes but for lists. Slightly annoying having to open both those system apps to have text and lists at my fingertips.

Notebook combines text / lists / images / voice and file attachments (why only certain file types?) It has good visual cues (custom covers on the notebooks and coloured background to the text cards).

It still doesn't give me plain text paste in/out, so a full mark lost here, because it's something I'm specifically looking for. But it's not rich text either.... It carries text size and weight, but not colour. It doesn't carry pasted-in font; each note does have its own font which is carried when copying and pasting out. If Zoho are reading, I'd LOVE the option for the text cards to look and work exactly as they do, but to have only plain text copied to the clipboard whenever I copy from a text card. 8/10

Notebook does the syncying across devices, but only by creating an account with Zoho. iCloud-enabling would make a 9/10 into full marks for this part.

Main third-party competion

Without a doubt Evernote. This is made clear by the File > Import from Evernote option. Personally I didn't get past EN's start screen because it didn't seem to want to let me continue without creating an account.

Unexpected neat features

Notebook's 'quicknotes' feature is neat, allowing a quick paste of something from the status bar.  The ability to combine what I've been doing in Notes and in Reminders within one app. The ability to take voice memos and paste in pictures (as a separate card, not just because the cards are rich text - an important distinction). 10/10

Cost / if free - how are you paying?

This is the most remarkable thing about Notebook. The website clearly says that the app is free, subsidised by their business apps. They don't sell your data, there are no ads, there is no premium version, there are no crippled features. If there are any hidden costs, I haven't found them. Solid 10/10  here

In conclusion

Not quite meeting my 'plain text pasting out' requirements, and syncing probably good if you're willing to create an account with the makers. Combining functionality that I've been using various other apps for, and with neat fast access via the status bar. Thanks to Zoho for an incredibly functional and incredibly free app.  It all adds up to:


Saturday, 28 April 2018

Case study - using WebScraper to compile a list of information in a useful format from a website

Here's a frustrating problem that lent itself really well to using WebScraper for Mac.

This is the exhibitor list for an event I was to attend. It's a very long list, and unless you recognise the name, the only way to see a little more information about each exhibitor is to click through and then back again.

I wanted to cast my eye down a list of names and brief summaries, to see who I might be interested in visiting. Obviously this information will be in the printed programme, but I don't get that until the day.

(NB this walkthrough uses version 4.2.0 which is now available. The column setup table is more intuitive in 4.2 and the ability to extract h1s by class is a new feature in 4.2)

1. Setting up the scan. It's as easy as entering the starting url (the url of the exhibitor list). There's also this very useful scan setting (under Scan > Advanced) to say that I only want to travel one click away from my starting url (there's no pagination here, it's just one long list).

There's also a "new and improved" filter for the output. Think of this as 'select where' or just 'only include data in the output table if this is true'. In this case it's easy, we only want data in the output if the page is an exhibitor detail page. Helpfully, those all contain "/exhibitors-list/exhibitor-details/" in the url, so we can set up this rule:
2. Setting up the columns for the output table. The Helper tool shows me that the name of each business is on the information page within a heading that has a class. That's handy, because I can simply choose this using the helper tool and add a column which selects that class. 

3. The summary is a little more tricky, because there's no class or id to identify it. But helpfully it is always the first paragraph (<p>) after the first heading. So we can use the regex helper to write a regular expression to extract this.

The easy way to write the expression is simply to copy a chunk of the html source containing the info you want, plus anything that identifies the beginning and end of it, and then replace the part you want with (.+?) (which means 'collect any number of any characters'). I've also replaced the heading itself with ".+?" (the same, but don't collect) because that will obviously change on each page. That is all more simple than I've made it sound there. I'm no regex expert (regexpert?) - you may well know more than me on the subject and there may be better expressions to achieve this particular job, but this works, as we can see by hitting enter to see the result.

Here's what the column setup looks like now:

(Note that I edited the column headings by double-clicking the heading itself. That heading is mirrored in the actual exported output file, and acts as a field name in both csv and json)

 4. Press Go, watch the progress bar and then enjoy the results. Export to csv or other format if you like.

Friday, 27 April 2018

HTMLtoMD and Webscraper - a comparison converting & archiving a site as MarkDown

I noticed a few people visiting the website as a result of searching for "webpage to markdown" and similar.

Our app HTMLtoMD was designed to do exactly that, and is free. But it was experimental and hasn't received very much development in a long time.

It still does its job, but our app Webscraper can also do that job, with many more options, and is a much more active product, it's selling and is under regular development now.

I thought it was worth writing a quick article to compare the output and features of the two apps.

First HTMLtoMD. It's designed for one job and so is simple to use. Just type or paste the address of your page or homepage into the address bar and press Go. If you choose the single page option, you'll immediately see the page as markdown, and can copy that markdown to paste somewhere else.
If you don't choose the single page option, the app will scan your site, and then offer to save the pages as markdown.

Here's the big difference with WebScraper. HTMLtoMD will save each page separately as a separate markdown file:

So here's how to do the same thing with Webscraper. It has many more options (which I won't go into here.) The advanced ones tend not to be visible until you need them. So here's WebScraper when it opens.
For this task, simply leave everything at defaults, enter your website name and make sure you choose "collect content as markdown" from the 'simple setup' as shown above. Then press Go.

For this first example I left the output file format set to csv.  When the scan has run, the Results table shows how the csv file will look.  In this example we have three field names;  url, title, content - note that the 'complex setup' allows you to choose more fields, for example you might like to collect the meta description too.
You may have noticed in one of the screenshots above that the options for output include csv, json and txt. Unlike HTMLtoMD, Webscraper always saves everything as a single file. The csv output is shown above. The text file is also worth looking at here. Although it is a single file, it is structured.   I've highlighted the field headings in the screenshot below. As I mentioned above, you can set things up so that more information (eg meta description) is captured.


It occurred to me after writing the article above, that collecting images is a closely-related task. WebScraper can download and save all images while performing its scan. Here are the options, you'll need to switch to the 'complex setup' to access these options:

If you're unsure about any of this, or if you can't immediately see how to do what you need to do, please do get in touch. It's really useful to know what jobs people are using our software for (confidential of course). Even if it's not possible, we may have suggestions. We have other apps that can do similar but different things, and have even written custom apps for specific jobs.

Thursday, 19 April 2018

Option to download images in WebScraper

WebScraper for Mac now has the option to download and save images in a folder, as well as, or instead of, extracting data from the pages.

WebScraper crawls a whole website and it can extract information based on a class or id, or dump the content as html, plain text or markdown.

By just checking that box and choosing a download location, it can also dump all images to a local folder. That includes those found linked (the target of a regular hyperlink), those in a <img src , and those found in an srcset.

If you only want particular images, you can select only those which match a partial url or regex.

This functionality is in version 4.1.0 which is available now.

Monday, 16 April 2018

Webscraper version 4 released

Webscraper from PeacockMedia makes it quick and easy to extract data from a website.

It has some 'simple setup' options if you simply want to archive content or collect certain meta data. For more complex tasks you can add columns to your output file which extract data based on certain class or id, or even by regular expression (regex). (A tutorial is here).

version 4 is now released which has been updated to use the Integrity v8 engine. That makes some efficiency improvements, and allows better control if you need to limit the requests that WS makes to your server:

The app has a very modest price tag of $10 and you can use it for free to test whether it works for you.

Scrutiny version 8 released

Some serious 'under the hood' improvements have been made in Scrutiny version 8. Benefits include more efficient scanning of your website and more data collected.

we've invested a lot of time in something that isn't a killer new feature, but will make things run more smoothly, efficiently and reliably, and give you more information.

There have been a number of requests to add more information about the properties of a link , eg hreflang and more of the 'rel' attributes. HTML5 allows loads of allowable values in the 'rel' and if the one you want doesn't have its own yes/no column in the new apps then you can view the entire rel attribute. You can switch these columns on in any of the link tables as well as the one in the link inspector.

The information displayed in Scrutiny's SEO table will include more data about the pages, notably og: and twitter card meta data which is important to a lot of users.

Redirect information wasn't stored permanently, only the start and end of a redirect chain (the majority are a single step. But if there is more than one redirect, you'll want to know all the in-between details)

It's becoming increasingly necessary to apply some limits on the rate of the requests that Integrity / Scrutiny make.

Your server may keep responding no matter how hard you hit it. That's great; turn up those threads and watch Integrity / Scrutiny go through your site like a dose of salts.

If you do need to limit the scan, you'll no longer have to turn down the threads and use trial and error with the delay field. You'll be able to simply enter a maximum number of requests per minute, while still using multiple threads for efficiency.

Saturday, 14 April 2018

Scraping details from a web database using WebScraper

In the example below, you can see that the important data are not identified by a *unique* class or id.  This is a common scenario, so I thought I'd write this post to explain how to crawl such a site and collect this information. This tutorial uses WebScraper from PeacockMedia.
1. As there's no unique class or id identifying the data we want*  we must use a reguar expression (regex).

Toggle to 'Complex setup' if you're not set to that already.

Add a column to the output file and choose Regular expression. Then use the helper button (unless you're a regex aficionado and have already written the expression). You can search the source for the part you want and test your expression as you write it. (see below). Actually writing the expression is outside the scope of this article, but one important thing to say is that if there is no capturing in your expression (round brackets) then the result of the whole expression will be included in your output. If there is capturing (as in my example below) then only the captured data is output.

 2. paste the expression into the 'add column' dialog and OK.   Here's what the setup looks like after adding that first column.
 3. This is a quick test, I've limited the scan to 100 links simply to test it. You can see this first field correctly going into the column. At this point I've also added another regex column for the film title, because the page title is useless here (it's the same for every detail page)
4. Add further columns for the remaining information as per step 1.

5. If you limited the links while you were testing as I did, remember to change that back to a suitably high limit (the default is 200,000 I believe) and set your starting url to the home page of the site or something suitable for crawling all of the pages you're interested in.

As always, please contact support if you need more help or would like to commission us to do the job.

*  as I write this, webscraper checks for class, id and itemprop  inside <div>, <span>, <dd> and <p> tags, this is likely to be expanded in future.