Azure Package User Guide

The Azure Parameter Sweep package is an easy-to-use package that allows you to parallelize embarrassingly-parallel code over a number of Sho instances running in Azure. It is especially well-suited to parameter sweeps.

Installation and setup

In order to use the Azure package, your Azure administrator must first follow the directions in Admin Guide to get everything up and running on the cloud.

To install the package, simple copy the package directory to your Sho packages directory.

In order to use the package, it will need to know a couple of pieces of information about the Azure environment the Sho instances are running on. The following environment variables should contain the authentication information for your deployment. Ask your administrator for the values to set:

SHOAZUREACCOUNT = <your Azure storage account name>
SHOAZUREKEY = <the primary key for the storage account>
 

Usage

The parameter sweep package uses work functions that take two arguments: one “data” and one “parameter” value:

def workfn(data, param): ...
 

When you tell the package to run your work function in the cloud, it uses a single “data” value common to all of the various instances, and a list of parameter values, which are distributed among them. Here is an example showing how to use the package to distribute a very simple work function over a list of parameter values:

Simple example

For this simple example, we are going to use a very simple predefined work function that simply adds the data value to the parameter value.

>>> import paramsweep # import the paramsweep package
>>> paramsweep.addDemoDir() # this adds the path to the test functions
>>> import paramsweeptest # paramsweeptest contains some demo functions
>>> session = paramsweep.run(paramsweeptest.add, 100, [1,2,3,4,5])
 

The run function returns a parameter sweep session object:

>>> print session
ParamSweepSession('2c574185-9cd3-4ba1-a6f3-9668424374a8')
 

This session runs asynchronously on the cloud, leaving your Sho instance running so you can do work while you wait for it to finish. The isDone method will tell you if the session has finished. If it is, we can call getResults to get the results from each of the jobs. Alternately we can call the waitResults function, which blocks until it finishes, and then return the result values.

>>> session.isDone()
True
>>> session.getResults()
[101, 102, 103, 104, 105]
 

It is good practice to delete the server-side data associated with this session when you’re done.

>>> session.cleanup()
 

Of course, you aren’t limited to using integers as your data or parameter values:

>>> a = rand(4,4)  # This is a Sho array
>>> session = paramsweep.run (paramsweeptest.multiply, a, [1, 3, 5])
>>> for x in session.waitResults(): print x,"\n"
[ 0.4669 0.1889 0.4706 0.3542
0.7747 0.0344 0.3692 0.8417
0.4836 0.3800 0.0677 0.2211
0.6412 0.3844 0.4372 0.1443] 
[ 1.4006 0.5668 1.4117 1.0625
2.3240 0.1033 1.1076 2.5252
1.4509 1.1399 0.2030 0.6633
1.9235 1.1532 1.3116 0.4329] 
[ 2.3344 0.9447 2.3528 1.7709
3.8734 0.1722 1.8460 4.2087
2.4182 1.8999 0.3384 1.1055
3.2058 1.9221 2.1859 0.7215]
 

In addition to the return values, you can get back the data and parameter values:

>>> session.getData()
[ 0.4669 0.1889 0.4706 0.3542
0.7747 0.0344 0.3692 0.8417
0.4836 0.3800 0.0677 0.2211
0.6412 0.3844 0.4372 0.1443]
>>> session.getParams()
[1, 3, 5]
 

If you are planning to run the same work function over and over again in different sessions, you can simplify things further by creating a “sweeper” function from it. The sweeper function takes your work function and returns a thing that will automatically sweep your function over the set of parameters, block until they’re done and return the results:

>>> adder = paramsweep.sweeper(paramsweeptest.add)
>>> adder(10, [1,2,3,4,5])
[11, 12, 13, 14, 15]
>>> adder(100, [1,2,3,4,5])
[101, 102, 103, 104, 105]
 

Note that this is equivalent to:

>>> session = paramsweep.run(paramsweeptest.add, 10, [1,2,3,4,5])
>>> session.waitResults()
[11, 12, 13, 14, 15]
 

More complex examples

Oftentimes, you will want to run multiple parameter sweep sessions over the same large dataset. Since it can take a long time to upload the data to the cloud, you may want to explicitly upload the data once. Then you can tell the Azure instances to use the persisted cloud data, rather than upload it each time. This example uses the “topnsvd” test function, which returns the n largest singular values from a matrix.

>>> import azureutils
>>> arr = rand(1000, 1000)
>>> blob = azureutils.pickleToBlob(arr)
>>> session = paramsweep.run(paramsweeptest.topnsvd, blob, [1,2,3,4])
>>> session.waitResults()
[[ 500.0264], [ 500.0264 18.0877], [ 500.0264 18.0877 18.0564],
[ 500.0264 18.0877 18.0564 17.9239]]
 

All of the examples we’ve shown so far run work functions that take objects as the parameters and produce output in the form of return values. However, many times it is easier to express your work function in terms of input and/or output files. For these kinds of applications, we allow you to optionally specify an inDir and outDir argument to your work function. In this case, your work function should be specified as follows:

def workfn(data, param, inDir, outDir): ...
 

If you specify an inDir variable, the contents of that local directory will be uploaded to the cloud and then downloaded to your instance. For instance, indirlist is a function that returns a list of the files in inDir (the data and param values are ignored, and set to 0 here):

>>> session = paramsweep.run(paramsweeptest.indirlist, 0, [0], inDir="C:/tmp/myindir")
>>> session.waitResults()
[['a.txt', 'b.txt']]
 

If you want to produce output as files, outdirtest is an example:

>>> session = paramsweep.run(paramsweeptest.outdirtest, "foo", [1,2,3], outDir="C:/tmp/outdir")
>>> session.waitDone()
>>> session.downloadOutDir()
>>> ls("C:/tmp/outdir")
1.pickle 27 10/19/2010 5:38:33 PM
2.pickle 27 10/19/2010 5:38:33 PM
3.pickle 27 10/19/2010 5:38:33 PM
 

You can get a list of session objects that still store data on the cloud using the listSessions function, which will return a handle to all of the sessions started by you:

>>> for s in paramsweep.listSessions(): print s
ParamSweepSession('8aee1717-3d23-44b0-8033-21883a332f39')
ParamSweepSession('c6be1ea5-189e-441a-9f31-09a87ffc7102')
 

This is particularly useful if you forgot to clean them up:

>>> for s in paramsweep.listSessions(): s.cleanup()
 

Note that your work function must conform to one of the following forms, or it will not work:

def workFn(data, param): ...
def workFn(data, param, inDir): ...
def workFn(data, param, inDir, outDir): ...
 

Finally, you can retrieve any output text printed by your work function with the getOutputText() method on your session object:

>>> session.getOutputText()
[...]
 

Caching state and data

In order to make sure that one job doesn’t pollute the Sho environment and cause problems for subsequent jobs running on that Azure instance, the entire Sho environment is restarted between jobs by default. Also by default, the data value is re-downloaded for each job, so that and destructive changes one job makes doesn’t affect another job. Since both of these safeguards are frequently unnecessary and can take a significant amount of time, there are ways to disable them to get your jobs to run faster.

These options are controlled by setting special global variables in the main module of your work function. Setting the “can_reuse_state” variable to True tells the system that you can re-use the Sho instance between jobs from the same session. The “can_reuse_data” variable controls whether or not you can reuse the same data object between jobs. Finally, if you want to be able to share state and/or data values between different sessions, you can set the “state_token” variable to some unique value. Different sessions with the same “state_token” value will be able to reuse the Sho environment and data value as if they were jobs from the same session. Here is an example of the settings in the python file that contains the demo job functions:

paramsweeptest.py:

 

can_reuse_state = True
can_reuse_data = True
state_token = "parametersweeptest"
 

Storage Utilities

The paramsweep and auzreutils modules contain a number of useful utilities for dealing with Azure storage.

To store a data object in the cloud, call the azureutils.pickleToBlob() method:

>>> arr = <some data>
>>> blob = azureutils.pickleToBlob(arr)
 

This stores your data into a new blob with a unique name. If you want to pickle to data into a blob with a particluar name or at a particular URL, you can pass that in:

>>> azureutils.pickleToBlob(arr, blobname) # blobname is a filename or fully-qualified URL
 

In order to read the data back from the cloud, you can call the unpickle method:

>>> blob.unpickle()
 

If you don’t have a blob object, you can create one given the URI of an object in blob storage. Or, you can directly unpickle from a URI:

>>> blob = azureutils.BlobLocation("http://<blob-storage-url>")
>>> mydata = blob.unpickle()

or

>>> mydata = azureutils.unpickleBlob(http://<blob-storage-url>)
 

Finally, to delete a pickled blob, call the delete method or call the deleteBlob function:

>>> blob.delete()
>>> azureutils.deleteBlob("http://...")
 

You can retrieve a list of blobs you’ve pickled via the listPickleBlobs function, which returns a list of BlobLocation objects:

>>> azureutils.listPickleBlobs()
[<azureutils.BlobLocation instance at 0x000000000000002B>, <azureutils.BlobLocation instance at 0x000000000000002C>] 
 

Here are a few other useful functions:

uploadFile() – uploads a file into blob storage
downloadFile() – downloads a blob’s contents to a file
uploadDirectory() – uploads a directory into blob storage
downloadBlobDirectory() – downloads a directory of files from blob storage
 

Viewing diagnostic information

If you want to output diagnostic messages from your work function, first import the azureserverutil module, then call the tracemessage function. The message will be saved in Azure diagnostic message storage. Then, you can call the showLog method on your session object in order to view the messages.

def tracetest(data, param):
azureserverutil.tracemessage(str(data) + ": " + str(param))
 

Then, back at the console, call showLog to display the log messages:

>>> session = paramsweep.run(paramsweeptest.tracetest, "message with param value", [1,2,3])
>>> session.waitDone()
(wait a few minutes)
>>> session.showLog()
 
 
 

The results will be displayed in a Sho datagrid, with the results from each job grouped together. Note that there is a delay of a couple of minutes between when your work function calls tracemessage and when the corresponding log message is viewable. Passing in True as the optional second parameter to showLog will display internal trace messages that the service records, which can be useful when troubleshooting:

>>> session.showLog(True)
 
 

Alternately, if you want to show all recent diagnostic messages, regardless of which session they belong to, you can all the azureutils.showLog function. This can sometimes help you determine if the service is having problems. Be aware that the format of these diagnostic messages can make them difficult to read:

>>> azureutils.showLog(10) # show messages from the last 10 minutes
>>> azureutils.showLog(TimeSpan.FromHours(12))