Monday 31 October 2016

Webscraper from PeacockMedia - usage

[updated 23 Apr 2018 for version 4]
[reviewed 29 Aug 2021]

I've had one or two questions about using WebScraperThere's a short demo video here  but if, like me, you prefer to cast your eye over some text and images rather than sit through a video, then here you go:

1. Type your website address (or starting url for the scan). Like Integrity / Scrutiny (Webscraper uses the same engine) the crawl will be limited to any 'directory' implied in the url.

2. Configure your output. If it's a single piece of information you want to extract from each page, you can use the Simple Setup. If you want to set up a number of columns, use the Complex Setup. Toggle between these two options below the address bar.



You must configure your output file before scanning, and then the app crawls your site, collecting the data as it goes. This is more efficient than the way that the first version of Webscraper worked but it does mean that if you want to change the configuration of your output file, you'll need to re-scan.

If you choose 'Complex setup' you'll need to configure your output file here. When you add a  column you can choose  basic metadata (title, description etc), a class or id, a regular expression (regex) or content (as plain text, html, markdown or an outline).



3. Test or run. You'll be able to either begin the scan, or run a short test. The 'Run test' button will perform a very short scan of a few pages and present your output as a quickview. If all looks well, you can press Go, or if you need to make changes, you can head back to the Output file configuration.

4. When the scan is complete, the Results tab will open. You can export this using the export button above the table. It uses the options you set in the 'Output file format' tab. 

Note that the Save Project option from the File Menu will only save your setup, not the scan data.



A common scenario is that the data you want isn't defined by a unique class or id. In these cases a regular expression can be used, there's a detailed tutorial here.

No comments:

Post a Comment