Clicking through the whole website is kind of no-brainer and for sure not the most favourite task of each and every developer. Nevertheless, it’s very important issue and I can mention many usages of this task:
- You want to familiarize with web system – you do not even know what pages are accessible for the users
- You want to refactor your system – you see dozens of views while users can access only some of them – all unused should be deleted
- You want to write UI tests for your system
- You want to see how different things looks in different browsers
- (in my case) You need to see all styling issues
Fortunatelly every no-brainer can be replaced with some software/script. In this example it is a web crawler – robot that clicks throught the page, can build page sitemap with screenshots of pages and in general do all the things the user can.
- Unfortunately while googling for that robot I couldn’t find any valuable freeware stuff. Some of programs could take screenshots of given set of the links but couldn’t do anything more. On the other hand I didn’t want to set up whole UI testing IDE’s. Because of my past experiences with Selenium .NET, I tried to use it in some easy way (which means – PowerShell).
What is selenium
Selenium is a powerful ui testing framework for web applications. It providers record/playback testing (accessible directly from the browser as in example Firefox plugin). It’s also a library that can be used by many platforms such as Java, Ruby or .NET. While it is used to build powerfull UI testing framework, it can also be consumed directly from Powershell script.
Let’s see the code – implementation
The whole algorithm of our web crawler is rather straightforward:
- Load page
- Find links (anchor tags) in the loaded page
- Remember not-checked links to navigate to them in future
- Take screenshot
- Go to first unchecked link and repeat unless all the links are checked
The algorithm above can be easiliy implemented with following Powershell snippet:
Implementation of Find-Urls and Take-ScreenShotAndSaveUrl was removed to make the algorithm cleaner (full crawler is available in Download section of this post).
Why use selenium
Some smart guy can ask: “Why to use Selenium? In Powershell we can use internetexplorrer.application COM object”.
Of course it’s a valid argument. And we do not even need any external library because this feature can be used out-of-the-box of Windows system. Unfortunately it’s still low level usage. You can take control over your Internet Explorer, but:
- Implementation for the other browsers may be different
- You have to implement your own system of “sleeps” and “waits” checking whether IE finished downloading/rendering the page
- You have better access to elements – Selenium allows you to query elements by XPath of Css selector (like in jQuery)
- Selenium can easily take screenshot of the whole page – not just visible part of the screen or active window, that cannot be higher than screen size
And it’s the last argument that forced me to take Selenium .NET as the final solution.
I was inspired to use Selenium for this purpose after reading great post of Jaykul: http://huddledmasses.org/did-you-know-powershell-can-use-selenium/
You can also find there some tips while having problem with IE protected mode.
You can download full code snippet from here: WebCrawler.zip