Thursday 9 September 2021

429 status codes when crawling sites


I've had a few conversations with friends about Maplin recently. I have very good memories of the Maplin catalogue way back when they sold electronic components. The catalogue grew bigger each year and featured great spaceship artwork on the cover. They opened high street shops, started selling toys and then closed their shops.

The challenge with this site is that it would finish early after receiving a bunch of 429 status codes. 

This code means "too many requests". So the server would respond normally for a while before deciding not to co-operate any more. When this happens, it's usually solved by throttling the crawler; limiting its number of threads, or imposing a limit on the number of requests per minute.

With Maplin I went back to a single thread and just 50 requests per minute (less than one per second) and even at this pedestrian speed, the behaviour was the same. So I guess that it's set to allow a certain number of requests from a given IP address within a certain time. It didn't block my IP and so after a break would respond again. 

I managed to get through the site using a technique which is a bit of a hack but works. It's the "Pause and Continue" technique. When you start to receive errors, pausing and waiting for a while allows us to continue and make a fresh start with the server. A useful feature of Integrity and Scrutiny's engine is that on Continue, it doesn't just continue from where it left off. It will start at the top of its list, ignore the good statuses but re-check any bad links. This leads to the fun spectacle of the number of bad links counting backwards!


On finish, there seems to be around 50 genuinely broken links. Easily fixed once found.


No comments:

Post a Comment