Backend¶
The backend is a set of Python classes that manages jobs after the initial submission. This typically runs as a daemon on our ‘modbase’ machine, picking up submitted jobs from the frontend, submitting jobs to the cluster and gathering results, and doing any necessary pre- or post-processing.
The backend is implemented by the saliweb.backend
Python module.
Classes¶
Job¶
The Job
class represents a single job known to the backend. This
could have just been submitted by the frontend, it could be running on the
cluster, it could have finished and its results placed on long-term storage,
or the results from a very old job could have been deleted (only the job
metadata remains). Each job corresponds to a single row in the jobs database
table and a directory on disk.
For any web service, the Job
class must first be subclassed and then
one or more of its methods implemented to actually do the work of running jobs.
For example, the Job.run()
method will be called by the backend when
the job starts; it is expected to start the job running on the cluster,
typically by using a WyntonSGERunner
object. There are similar methods
that can be used to do extra processing before the job starts
(Job.preprocess()
), after it finishes running on the cluster
(Job.postprocess()
and Job.finalize()
) or when the job is moved
to long-term storage
(Job.archive()
), for example. Each method is run in the directory
containing all of the job’s data (i.e. any files uploaded by the end user
when the job was submitted, plus any output files after the job has run).
(The exception is Job.expire()
, which is called after the job directory
has been deleted.)
If any of these methods raises an exception, it is caught by the backend;
the job is put into a failed state and the server admin is notified. Thus,
exceptions should be used only to indicate a technical error in the web
service, not something wrong with the user’s input (in the latter case, the
job output should simply indicate what the problem is).
Note
Each of these methods is automatically run by the backend at the
correct time; they should not be run manually by any method
in the subclass. For example, to run a new job, call
Job.reschedule_run()
, not the Job.run()
method directly.
As mentioned above, a WyntonSGERunner
class is provided that takes care of
the details of running a script on the cluster and checking if it has
completed. Typically the script run here should use the local /scratch disk
on the cluster nodes if possible - this is not implemented automatically by
the framework, since the best usage of local and network disks is specific
to a given web service.
Database¶
Each job’s metadata is stored in a database; the Database
class
manages this database and creates Job
objects automatically when
requested to by other classes. The base Database
class interfaces
with a MySQL database on the ‘modbase’ machine and manages database tables
containing fields used by all web services. It can be subclassed to add
additional fields (for example, to store some extra job metadata in the
database rather than within the job’s directory) or (potentially) to use a
different database engine.
Config¶
The Config
class parses the configuration file for the web service
and stores all of the configuration information. It can be subclassed if
desired to read extra service-specific information from the configuration file,
usually by extending the Config.populate()
method.
WebService¶
The WebService
class provides high-level backend functionality.
The most commonly-used method is WebService.do_all_processing()
, which
simply runs in an endless loop, submitting new jobs to the cluster, collecting
the results of finished jobs, and archiving old completed jobs.
It is rarely necessary to subclass.
Job states¶
A single job in the system is represented by a row in the database table and a single directory that contains the job inputs and/or outputs. Each job can be in one of a set of distinct states, described below. In normal operation a job will move from the first state in the list below to the last.
The INCOMING state is for jobs that have just been submitted to the system by the frontend, but not yet picked up by the frontend.
Jobs move into the PREPROCESSING state when they are picked up by the frontend. At this point the
Job.preprocess()
method is called, which can be overridden to do any necessary preprocessing. Note that this method runs on the server machine (‘modbase’) and serially (for only a single job at a time), so it should not run any calculations that take more than a few seconds.Next, jobs usually move to the RUNNING state. At this point the
Job.run()
method is called, which typically will submit an SGE job to do the bulk of the processing.When the SGE job finishes, the job moves to the POSTPROCESSING state and the
Job.postprocess()
method is called. Like preprocessing, this runs serially on the server machine and so should not be computationally expensive.It may be decided in postprocessing that further runs are required, in which case the job moves back to the RUNNING state for another cycle. Otherwise, it moves to the FINALIZING state and
Job.finalize()
is called. This is the same as POSTPROCESSING except that is only called on the last cycle (no further runs can be started here).Next the job moves to the COMPLETED state and the
Job.complete()
method is called. If the user provided an email address to the frontend, they are emailed at this point to let them know job results are now available.After a defined period of time, the job moves to the ARCHIVED state and the
Job.archive()
method is called. At this point the job results are still present on disk, but are no longer accessible to the end user and may be moved to long-term storage.After another defined period of time, the job moves to the EXPIRED state, the job directory is deleted, and the
Job.expire()
method is called. At this point, only the job metadata in the database remains.
If a problem is encountered at any point (usually a Python exception) the job is moved to the FAILED state. At this point the server admin is emailed and is expected to fix the problem (usually a bug in the web service, or a system problem such as a broken or full hard disk).
Note also the Job.preprocess()
method can, if desired, signal to the
framework that running a full SGE job is unnecessary (by calling the
Job.skip_run()
method). In this case, the RUNNING and
POSTPROCESSING steps are skipped and the job moves directly from
PREPROCESSING to COMPLETED. Similarly, the Job.postprocess()
method can request that the framework runs a new job (by calling the
Job.reschedule_run()
method). In this case, the job moves from
POSTPROCESSING back to RUNNING.
Each job state (with the exception of EXPIRED) can be given a directory
in the service’s configuration file. Job data are automatically moved between
directories when the state changes. For example, the INCOMING directory
generally needs to reside on a local disk, and have special permissions so that
the frontend can create files within it. The RUNNING directory usually
needs to be accessible by the cluster, so it needs to be on the /wynton
disk. The ARCHIVED directory may live on long-term storage, such
as a park disk.
Examples¶
Simple job¶
The example below demonstrates a simple Job
subclass that, given a
set of PDB files from the frontend, runs an SGE job on the cluster that
extracts all of the HETATM records from each PDB. This is done by
overriding the Job.run()
method to pass a set of shell script commands
to an WyntonSGERunner
instance; this instance is then returned to the
backend. The backend will then keep track of the SGE job, and notice when
it finishes.
The subclass also overrides the Job.archive()
method, so that when the
job results are moved from short-term to long-term storage, all of the PDB
files are compressed with gzip to save space.
import saliweb.backend
import glob
import gzip
import os
class Job(saliweb.backend.Job):
runnercls = saliweb.backend.WyntonSGERunner
def run(self):
script = """
for f in *.pdb; do
grep '^HETATM' $f > $f.het
done
"""
r = self.runnercls(script)
r.set_options('-l diva1=1G')
return r
def archive(self):
for f in glob.glob('*.pdb'):
with open(f, 'rb') as fin, gzip.open(f + '.gz', 'wb') as fout:
fout.writelines(fin)
os.unlink(f)
Custom database class¶
The Database
class can be customized by adding additional fields to
the database table. This is useful if you need to pass small amounts of job
metadata between the frontend and backend, or between different stages of the
job, and the metadata are useful to keep after the job has finished.
Note
In many cases, it makes more sense to store job data as files in the job directory itself. For example, it is probably easier to store a PDB file as a real file rather than trying to insert the contents into the database table!
This example adds a new integer field number_of_pdbs to the database. The
field can then be accessed (read or write) from within the Job
object
by referencing self._metadata[‘number_of_pdbs’]. The _metadata attribute
stores all of the job metadata in a Python dictionary-like object; it is
essentially a dump of the database row corresponding to the job.
import saliweb.backend
import glob
class Database(saliweb.backend.Database):
def __init__(self, jobcls):
saliweb.backend.Database.__init__(self, jobcls)
self.add_field(saliweb.backend.MySQLField('number_of_pdbs', 'INTEGER'))
class Job(saliweb.backend.Job):
runnercls = saliweb.backend.WyntonSGERunner
def preprocess(self):
pdbs = glob.glob("*.pdb")
self._metadata['number_of_pdbs'] = len(pdbs)
def run(self):
script = """
for f in *.pdb; do
grep '^HETATM' $f > $f.het
done
"""
r = self.runnercls(script)
r.set_options('-l diva1=1G')
return r
Logging¶
It is often useful for debugging purposes to log progress of a job. While the
job is running on the cluster, the only way to do this is to write output into
a log file. For other steps in the processing, however, the standard Python
logging module
is utilized. Each job method (such as Job.run()
, Job.preprocess()
)
with the exception of Job.expire()
can use the Job.logger
object
to write out log messages. It is a standard Python Logger object, so supports
the regular methods of a Logger, such as warning()
and
critical()
to write log messages, and setLevel()
to set the threshold for log output.
By default, anything logged that exceeds the threshold will be written to a file
called ‘framework.log’ in the job’s directory. The file will only be created
when the first log message is printed. This behavior can be modified if desired
by overriding the Job.get_log_handler()
method.
import saliweb.backend
import logging
class Job(saliweb.backend.Job):
runnercls = saliweb.backend.WyntonSGERunner
def run(self):
# Uncomment to get all logging output
# self.logger.setLevel(logging.DEBUG)
self.logger.info('Starting run method')
script = """
for f in *.pdb; do
grep '^HETATM' $f > $f.het
done
"""
r = self.runnercls(script)
self.logger.warning('Setting SGE options to diva1=1G')
r.set_options('-l diva1=1G')
self.logger.info('Ending run method')
return r
Testing¶
The best way to test the backend is as part of the entire web service (see Testing).
However, the backend can be tested directly without invoking the frontend, by manually modifying the MySQL database. Note, however, that the interface between the backend and frontend, as well as the details of the MySQL tables, are not guaranteed to be stable (future iterations of the framework may change some of the details for performance or additional features), so this method could fail in future.
To manually submit a job:
Decide on a job name. This must be unique. Create a directory with the same name, as the backend user, under the web service’s incoming directory (as specified in the configuration file).
Put all necessary input files into this directory.
Connect to the MySQL server using the mysql client on modbase, and the username and password from the web service’s configuration file. Either the backend or frontend user can be used; the frontend user can only submit jobs and so is recommended, while the backend user can also delete or modify jobs, which is dangerous as it may break the service. For example,
mysql -u modfoo_frontend -p -D modfoo
.To actually submit a job use something like:
INSERT INTO jobs (name,passwd,user,contact_email, directory,url,submit_time) VALUES (a,b,c,d..., UTC_TIMESTAMP());a,b,c,d are the values for the columns, described below:
‘name’ is the name of the job, from above.
‘passwd’ is used by the frontend to protect job results. Any alphanumeric string can be used here.
‘user’ is the user that submitted the job. NULL can be used here.
‘contact_email’ is the email address that the backend will notify when the job completes, or NULL for no email notification.
‘directory’ is the filesystem directory containing the job inputs, which must match that created above.
‘url’ is a web link that the backend will include in the email it sends out, telling the user where the results can be downloaded. A dummy value can be used here, since the frontend usually handles this.
‘submit_time’ is the time (UTC) when the job was submitted. Usually, the MySQL function UTC_TIMESTAMP() is used here to put in the current time.
The job will only be run if the backend is running (use the bin/service.py script as the backend user in the installation directory). The backend polls periodically for new jobs. Alternatively, service.py can be used to restart the backend, to force it to check immediately.