PeacockMedia: HTMLtoMD and Webscraper - a comparison converting & archiving a site as MarkDown

Friday, 27 April 2018

HTMLtoMD and Webscraper - a comparison converting & archiving a site as MarkDown

I noticed a few people visiting the website as a result of searching for "webpage to markdown" and similar.

Our app HTMLtoMD was designed to do exactly that, and is free. But it was experimental and hasn't received very much development in a long time.

It still does its job, but our app Webscraper can also do that job, with many more options, and is a much more active product, it's selling and is under regular development now.

I thought it was worth writing a quick article to compare the output and features of the two apps.

First HTMLtoMD. It's designed for one job and so is simple to use. Just type or paste the address of your page or homepage into the address bar and press Go. If you choose the single page option, you'll immediately see the page as markdown, and can copy that markdown to paste somewhere else.

If you don't choose the single page option, the app will scan your site, and then offer to save the pages as markdown.

Here's the big difference with WebScraper. HTMLtoMD will save each page separately as a separate markdown file:

So here's how to do the same thing with Webscraper. It has many more options (which I won't go into here.) The advanced ones tend not to be visible until you need them. So here's WebScraper when it opens.

For this task, simply leave everything at defaults, enter your website name and make sure you choose "collect content as markdown" from the 'simple setup' as shown above. Then press Go.

For this first example I left the output file format set to csv. When the scan has run, the Results table shows how the csv file will look. In this example we have three field names; url, title, content - note that the 'complex setup' allows you to choose more fields, for example you might like to collect the meta description too.

You may have noticed in one of the screenshots above that the options for output include csv, json and txt. Unlike HTMLtoMD, Webscraper always saves everything as a single file. The csv output is shown above. The text file is also worth looking at here. Although it is a single file, it is structured. I've highlighted the field headings in the screenshot below. As I mentioned above, you can set things up so that more information (eg meta description) is captured.

[Update]

It occurred to me after writing the article above, that collecting images is a closely-related task. WebScraper can download and save all images while performing its scan. Here are the options, you'll need to switch to the 'complex setup' to access these options:

If you're unsure about any of this, or if you can't immediately see how to do what you need to do, please do get in touch. It's really useful to know what jobs people are using our software for (confidential of course). Even if it's not possible, we may have suggestions. We have other apps that can do similar but different things, and have even written custom apps for specific jobs.

Friday, 27 April 2018

HTMLtoMD and Webscraper - a comparison converting & archiving a site as MarkDown

No comments:

Post a Comment