Playing on a Grid

March 25th, 2009

Nick Rutherford

For my dissertation I'm working on a genetic algorithm solution to the guitar fingering problem. It's got just under 2 months to run now, and it's time to start wrapping it up.

Part of the process is getting results and evaluating them. I have running code now so it's time to start putting it through its paces.

The department provides students with access to a 'grid' computing setup, which is basically a batch processing system you access by ssh.

Unfortunately it's not a parallel architecture, but not having to play with threads is probably a good thing for me as they aren't something I am familiar with. The task I am addressing is very parallel, a favourite analogy of mine being natural selection occurring on a number of islands and the fittest individuals occasionally migrating between islands. Batch processing is sufficient to get similar effects, and I'll be toying with some ideas for that over the next week or two.

I thought I'd share some scripts and terminal output since they are pretty short and it's not something most people come across.

Scripting your application

Getting things running is deceptively simple. Make a script to run your program, I'm hoping to use Ruby later but went with (Ba)sh for now.

bash-3.2$ cat ./run_fingar
cd fingar
java -Xmx2g nruth.fingar.Run pop=400000 gens=500

The JRE arg is to set the maximum heap size to 2GB, rather than the default 128MB (though it should auto-configure itself to be higher because of something called Ergonomics that went into Java 5). Yes I do need that much memory, it's that kind of application. The grid notes have 3GB available so they can handle this, which is a bonus as it means I can increase population sizes greatly and see how they compare to smaller ones, and various other settings.

I/O

This is trivial really, write an app that uses Stdio and Stderr and when you run the job they will get piped into a specified log file. Mine just sit in the home directory for now. I will be changing that (with ruby, along with the rest of the statistical analysis scripting) to name by date, run parameters, code git revision, etc.

Submitting your job

Once I ssh into the distribution mechanism/scheduler I get presented with the usual home directory and terminal for my network account. There is a script to run which bolts on the grid commands I need, so I've made a wrapper for this & executing the above script to run my code. It's not ideal as I have to do this again to use qstat and so on, but that's just terminal script hackery and I am sure there is a work around (for example I have another script using source to load the settings into bash without having to look up the path every time).

bash-3.2$ cat ./run_fingar_et
#!/bin/bash
. /data/sungrid/default/common/settings.sh
qsub -cwd -l qp=LOW -P Basic -j y -o et_log run_fingar

qsub is the scheduling script for the grid, the parameters mean something along the lines of use the current working directory, queue priority low, basic profile (i.e. plebs not academics), pipe stdio stderr to et_log, and finally the command to run.

Watching and waiting

Once you submit there is a command qstat which will show you all the jobs you have pending, which is handy, as you know whether the results are ready or not.

bash-3.2$ ./run_fingar_et
Your job 7484 ("run_fingar") has been submitted
bash-3.2$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
   7484 0.00000 run_fingar   **ntr       qw    03/24/2009 23:08:29                                    1        
bash-3.2$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
   7484 0.56000 run_fingar   **ntr       r     03/24/2009 23:08:43 LOW.q@car2.expresstrain.dcs        1

I'm currently waiting for a more ambitious run to complete,

SimpleHandPositionModelGAEvolver: 
population size: 400000
locus crossover likelihood: 0.16
allele mutation likelihood: 0.02
generation 1 of 500

I started running into memory issues at 100k with the JRE defaults, so it'll be interesting to see how this goes. It's been running for an hour so far…

Sorry, comments are closed for this article.