Programming (back-end)

Get website screenshots with custom web crawler

Introduction

Clicking through the whole website is kind of no-brainer and for sure not the most favourite task of each and every developer. Nevertheless, it’s very important issue and I can mention many usages of this task:

  • You want to familiarize with web system – you do not even know what pages are accessible for the users
  • You want to refactor your system – you see dozens of views while users can access only some of them – all unused should be deleted
  • You want to write UI tests for your system
  • You want to see how different things looks in different browsers
  • (in my case) You need to see all styling issues
  • Fortunatelly every no-brainer can be replaced with some software/script. In this example it is a web crawler – robot that clicks throught the page, can build page sitemap with screenshots of pages and in general do all the things the user can.

    Unfortunately while googling for that robot I couldn’t find any valuable freeware stuff. Some of programs could take screenshots of given set of the links but couldn’t do anything more. On the other hand I didn’t want to set up whole UI testing IDE’s. Because of my past experiences with Selenium .NET, I tried to use it in some easy way (which means – PowerShell).

What is selenium

Selenium is a powerful ui testing framework for web applications. It providers record/playback testing (accessible directly from the browser as in example Firefox plugin). It’s also a library that can be used by many platforms such as Java, Ruby or .NET. While it is used to build powerfull UI testing framework, it can also be consumed directly from Powershell script.

Let’s see the code – implementation

The whole algorithm of our web crawler is rather straightforward:

  • Load page
  • Find links (anchor tags) in the loaded page
  • Remember not-checked links to navigate to them in future
  • Take screenshot
  • Go to first unchecked link and repeat unless all the links are checked

The algorithm above can be easiliy implemented with following Powershell snippet:

$RelativeImagePath = "C:\WebCrawler\Screencaptures\"
$SeleniumDriverPath = "C:\WebCrawler\WebDriver.dll"
$websiteUrl = "http://www.page-you-want-to-crawl.com/"

#... some helper methods...

#Initialize Selenium driver
Add-Type -path $SeleniumDriverPath
$linksToCrawl = @($websiteUrl)

#put some script if you want to prepare the page before taking screenshot
$customJavascript = '$(".someclass").each(function(i, el){ $(el).expand();})'

#Here you can load your browser
$driver = New-Object OpenQA.Selenium.IE.InternetExplorerDriver 

$linkid = 0
while($linkid -lt $linksToCrawl.Length)
{
    $link = $linksToCrawl[$linkid]
    $driver.Url = $link

    $urls = Find-Urls -driver $driver -element $driver
    $linksToCrawl = MergeLinks -src $linksToCrawl -dst $urls

    $driver.ExecuteScript($customJavascript, $null)

    Take-ScreenShotAndSaveUrl -driver $driver -name $link

    $linkid++
}
$driver.Quit()

Implementation of Find-Urls and Take-ScreenShotAndSaveUrl was removed to make the algorithm cleaner (full crawler is available in Download section of this post).

Why use selenium

Some smart guy can ask: “Why to use Selenium? In Powershell we can use internetexplorrer.application COM object”.

image

Of course it’s a valid argument. And we do not even need any external library because this feature can be used out-of-the-box of Windows system. Unfortunately it’s still low level usage. You can take control over your Internet Explorer, but:

  • Implementation for the other browsers may be different
  • You have to implement your own system of “sleeps” and “waits” checking whether IE finished downloading/rendering the page
  • You have better access to elements – Selenium allows you to query elements by XPath of Css selector (like in jQuery)
  • Selenium can easily take screenshot of the whole page – not just visible part of the screen or active window, that cannot be higher than screen size
  • And it’s the last argument that forced me to take Selenium .NET as the final solution.

I was inspired to use Selenium for this purpose after reading great post of Jaykul: http://huddledmasses.org/did-you-know-powershell-can-use-selenium/ 
You can also find there some tips while having problem with IE protected mode.

Download

You can download full code snippet from here: WebCrawler.zip

Advertisements
Programming (back-end)

IronPython console in your Web application

It isn’t common that small things can bring lots of joy into your software. For sure the subject of this post is one of them.

While testing and supporting large applications you are sometimes forced to do actions like checking some specific logic or preparing the data. What’s more, if you deal with an environment deployed by the Continuous Integration process, or simply the build process on your local machine lasts too long, you need to do some operations by hand (I mean by SQL etc.). Thus you see, that the most comfortable way of communicating with your software is the code and provided API. Unfortunatelly you cannot (yet) write the code on a running system. But there is one thing you can do to improve your work – you can create your own administrator’s terminal.

The module is created using IronPython – a .NET python engine that can intergrate with your application. Maybe it’s not your favourite C# (or VB etc.) but it’s still close to your application/domain logic and for sure more fiendly than dealing with underlying data with SQL. Python is very programmer-friendly language. You need to get through the basic syntax and for the rest of the time you only use your API in a way you know from the source code.

You may think that creating such a terminal could be a hard work but it’s not. To achieve the goal you need to implement a mechanism, that takes a script from the user and gives the result of the script execution. In other words you need to imlement an interface like that:

public interface IScriptExecutor
{

//Execute IronPython code
string ExecuteScript(string script);

//Bind .NET objects to IronPython variables
void BindVariable(string key, object value);
}

With the IronPython library it is really simple. The complete source code of the script engine is presented below.

public class ScriptExecutor: IScriptExecutor
{
     private readonly ScriptEngine engine;
     private readonly ScriptScope scope;
     public ScriptExecutor()
     {
         this.engine = IronPython.Hosting.Python.CreateEngine();
         this.scope = this.engine.CreateScope();
     }

     public void BindVariable(string key, object value)
     {
           this.scope.SetVariable(key, value);
     }

     public string ExecuteScript(string script)
     {
         try
         {
             var scriptResult = this.engine
                     .CreateScriptSourceFromString(script, SourceCodeKind.AutoDetect)
                     .Execute(this.scope);
             return scriptResult != null ? scriptResult.ToString() : null;
         }
         catch (Exception e)
         {
             return e.Message;
         }
     }
 }

To use it you need to download mentioned IronPython framework. It’s kind a heavy library, especially if we use only three DLL’s: IronPython, Microsoft.Dynamic, Microsoft.Scripting.

In the constructor we initialize the PythonEngine. You can also bind some variables to the Python engine if you need. The most important method is of cource ExecuteScript where the most difficult thing is to obtain feedback from the python engine.

Let’s mix it together.

Now you can use the ScriptExecutor to create your administrator console. Because you use dependency injection, you probably have a static ServiceLocator (or something like that; if it’s not static you can bind it a proper container using BindVariable() method). Thus you have an access to each module you got in the application. Do you want to count the ‘Customer’ entities?

ServiceLocator.Resolve[ICustomerRepository]().GetAll().Count

Do you want to use your logic to create a new entity? Ok.

import clr
clr.AddReference("WebIronPythonConsole")
from WebIronPythonConsole.Domain import *
from WebIronPythonConsole.Infrastructure import *
repository = ServiceLocator.Resolve[IToDoItemsRepository]()

newItem = ToDoItem()
newItem.Name = 'Added by ironpython'
newItem.IsDone = False
repository.Add(newItem)

To get better the python language you may need some tips:

  1. Whitespaces are important
  2. When you need to work with generic method/classes, use [] istead of
  3. Lamba expressions are also called lambdas:

         C#: x => x.Contains(“hello”)

         Python: lambda x: x.Contains(“hello”)

  4. To use LINQ you need to import System.Linq from System.Core (see the code below)
  5. import clr
    clr.AddReference("System.Core")
    clr.AddReference("WebIronPythonConsole")
    from System.Linq import Enumerable
    from WebIronPythonConsole.Domain import *
    from WebIronPythonConsole.Infrastructure import *
    repository = ServiceLocator.Resolve[IToDoItemsRepository]()
    result = Enumerable.Where(repository , lambda x: x.Name.Contains("MyTask"))
    
    

Download

Feel free to play with an example application: WebIronPythonConsole.zip