Programming (back-end)

Get website screenshots with custom web crawler

Introduction

Clicking through the whole website is kind of no-brainer and for sure not the most favourite task of each and every developer. Nevertheless, it’s very important issue and I can mention many usages of this task:

  • You want to familiarize with web system – you do not even know what pages are accessible for the users
  • You want to refactor your system – you see dozens of views while users can access only some of them – all unused should be deleted
  • You want to write UI tests for your system
  • You want to see how different things looks in different browsers
  • (in my case) You need to see all styling issues
  • Fortunatelly every no-brainer can be replaced with some software/script. In this example it is a web crawler – robot that clicks throught the page, can build page sitemap with screenshots of pages and in general do all the things the user can.

    Unfortunately while googling for that robot I couldn’t find any valuable freeware stuff. Some of programs could take screenshots of given set of the links but couldn’t do anything more. On the other hand I didn’t want to set up whole UI testing IDE’s. Because of my past experiences with Selenium .NET, I tried to use it in some easy way (which means – PowerShell).

What is selenium

Selenium is a powerful ui testing framework for web applications. It providers record/playback testing (accessible directly from the browser as in example Firefox plugin). It’s also a library that can be used by many platforms such as Java, Ruby or .NET. While it is used to build powerfull UI testing framework, it can also be consumed directly from Powershell script.

Let’s see the code – implementation

The whole algorithm of our web crawler is rather straightforward:

  • Load page
  • Find links (anchor tags) in the loaded page
  • Remember not-checked links to navigate to them in future
  • Take screenshot
  • Go to first unchecked link and repeat unless all the links are checked

The algorithm above can be easiliy implemented with following Powershell snippet:

$RelativeImagePath = "C:\WebCrawler\Screencaptures\"
$SeleniumDriverPath = "C:\WebCrawler\WebDriver.dll"
$websiteUrl = "http://www.page-you-want-to-crawl.com/"

#... some helper methods...

#Initialize Selenium driver
Add-Type -path $SeleniumDriverPath
$linksToCrawl = @($websiteUrl)

#put some script if you want to prepare the page before taking screenshot
$customJavascript = '$(".someclass").each(function(i, el){ $(el).expand();})'

#Here you can load your browser
$driver = New-Object OpenQA.Selenium.IE.InternetExplorerDriver 

$linkid = 0
while($linkid -lt $linksToCrawl.Length)
{
    $link = $linksToCrawl[$linkid]
    $driver.Url = $link

    $urls = Find-Urls -driver $driver -element $driver
    $linksToCrawl = MergeLinks -src $linksToCrawl -dst $urls

    $driver.ExecuteScript($customJavascript, $null)

    Take-ScreenShotAndSaveUrl -driver $driver -name $link

    $linkid++
}
$driver.Quit()

Implementation of Find-Urls and Take-ScreenShotAndSaveUrl was removed to make the algorithm cleaner (full crawler is available in Download section of this post).

Why use selenium

Some smart guy can ask: “Why to use Selenium? In Powershell we can use internetexplorrer.application COM object”.

image

Of course it’s a valid argument. And we do not even need any external library because this feature can be used out-of-the-box of Windows system. Unfortunately it’s still low level usage. You can take control over your Internet Explorer, but:

  • Implementation for the other browsers may be different
  • You have to implement your own system of “sleeps” and “waits” checking whether IE finished downloading/rendering the page
  • You have better access to elements – Selenium allows you to query elements by XPath of Css selector (like in jQuery)
  • Selenium can easily take screenshot of the whole page – not just visible part of the screen or active window, that cannot be higher than screen size
  • And it’s the last argument that forced me to take Selenium .NET as the final solution.

I was inspired to use Selenium for this purpose after reading great post of Jaykul: http://huddledmasses.org/did-you-know-powershell-can-use-selenium/ 
You can also find there some tips while having problem with IE protected mode.

Download

You can download full code snippet from here: WebCrawler.zip

Advertisements

3 thoughts on “Get website screenshots with custom web crawler

  1. Hi there,

    Thanks for this script,

    Would be cool enough to give me more details on where and which files should I place in the lib folder for it to work propwrly through Chrome?

    Best regards,

  2. Having read this I believed it was rather enlightening.
    I appreciate you finding the time and energy to
    put this short article together. I once again find myself spending way too much time both reading and leaving comments.
    But so what, it was still worthwhile!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s