Babar/CM2 A-to-Z at Manchester

James Werner

Warnning !!! This script is obsolete. See EasyGrid User Manual.

Script for Grid submission, status, and recovery


The grid works by dividing the work between several computers running in parallel. If you have to prepare 100 cookies and expend 100 minutes each, you will take 10,000 minutes to do the job. However, if you call your 5 friends you will take only 2,000 minutes to perform the tasks because the same effort is done in parallel. The grid concept is mostly the infrastructure to perform this distribution in a secure and reliable way.

To understand how it works, lets think about the example of 5 friends. For each friend you gave 20 cookies to bake. You want to control where the cookies are and who they are with. To achieve this, when your friends receive the task, they give you back a code (called HANDLER) that identifies the task for them.

If you want to know how the task is going, you have to give them the handler and they will answer "it is cooking", "it is done", or even "it is burnt!" (if something goes wrong).

When you give the cookies to your friends you use a box, and they will return the cookies in a box too. Grid world uses SANDBOX to send and retrieve files. I will use an Input Sandbox to send the files from your directory to the grid, and I will use an Output Sandbox to return the result files.

I submitted a different job for each file generated by relDatasetTcl. The number of files can be up to 500 for tau data. This means that you can have for one analysis program 500 handlers, 500 Input Sandboxes, and 500 Output Sandboxes.

The problem is when you submit in the grid the handler you receive back is like:


https://lcgrb01.gridpp.rl.ac.uk:9000/lBWj-5_hkC9PxjT5GGCRoA

Have you tried to remember or type it? No way!

This is the job my script does. You have to give the script the dataset name (for example SP-1005-Tau11-R14), and the script will do all the work: prepare data, submit, get handlers and store them, query for status and recover the results... You will see no handlers, and have no worries about what commands to type or the parameters you have to fill. The available grid resources will be acquired, and one optimisation algorithm will adapt software parameters (such as, number of events per TCL file, queue size, number of work nodes, etc) to obtain the best performance. This software will be developed from January 2005.

The representation of this process uses the concept of state machine. Every time you type the command:


./easygrid dataset_name

you are in the initial state. If you already have tcl files in the directory, you do not need to run relDatasetTcl, otherwise you go to the state DatasetTcl. After preparing the dataset tcl files, you need to know if you already submitted the data to the grid by looking for handlers. If you have handlers it is because you didn't submit yet. Then you go to Submission state and after submission you return to the initial state because is too early to verify if the results are ready.

Next time you type:


./easygrid dataset_name

you already have tcl files, and tokens. You go to the state status where the grid will answer questions about the status of each of your jobs.

They can be Ready, Scheduled, Running and, more important, Done. If all are Done, you are finished the processing, and are able to recover the results (Recover).

If not, you have to try again and again (OK, take it easy! Drink a coffee, read a good book about gauge theory,... and try again).

The script manages the handlers, and stores them in files and controls the state by the existence or not of these control files.

1.The control files generation software.

There are two main modules in a job submission software. The submission generator is responsible for searching the grid, finding optimal ways to run the task with the best results (this will be implemented next year), and generating all necessary files and structures to do the job (this was done now because I have only one computer element!). The second module consists of the handler manager and the status machine software. The first module is implemented by the following C program, working like a UNIX FILTER, called "gera.c":


      1   /*
      2           Geracao do jdl e dos comandos para GRID
      3              Author: Dr James Cunha Werner
      4            www.geocities.com/jamwer2002
      5             University of Manchester
      6   */
      7   #include <stdio.h>
      8   #include <stdlib.h>
      9
     10   int main (int argc, char *argv[])
     11   {
     12   FILE *arqjdl,*arqtcl,*arqsh;
     13   char nomearq[300],nomedata[300],nomesh[300],nomebase[300];
     14   int i;
     15
     16   strcpy(nomebase,argv[1]);
     17   for(i=strlen(nomebase)-1;i>0;i--)
     18     if(nomebase[i]=='.')
     19       nomebase[i]='\0';
     20
     21   strcpy(nomedata,"Job");
     22   strcat(nomedata,nomebase);
     23   strcat(nomedata,".tcl");
     24   arqtcl=fopen(nomedata,"w");
     25   fprintf(arqtcl,"set ConfigPatch Run2\n");
     26   fprintf(arqtcl,"set levelOfDetail cache\n");
     27   fprintf(arqtcl,"set BetaMiniTuple hbook\n");
     28   fprintf(arqtcl,"set histFileName %s.hbk\n",nomebase);
     29   fprintf(arqtcl,"set NEvent 1000000000\n");
     30   fprintf(arqtcl,"source %s\n",argv[1]);
     31   fprintf(arqtcl,"sourceFoundFile recodata.tcl\n");
     32   fclose(arqtcl);
     33
     34   strcpy(nomesh,nomebase);
     35   strcat(nomesh,".sh");
     36   arqsh=fopen(nomesh,"w");
     37   fprintf(arqsh,"#!/bin/bash\n");
     38   fprintf(arqsh,"echo Este programa foi executado no computador `/bin/hostname`\n");
     39   fprintf(arqsh,"echo Hora de inicio: `/bin/date`\n");
     40   fprintf(arqsh,"echo \n");
     41   fprintf(arqsh,"echo Inicialisacao de variaveis Babar\n");
     42   fprintf(arqsh,"local=`pwd`\n");
     43   fprintf(arqsh,"echo Diretorio com sandbox de entrada: $local \n");
     44   fprintf(arqsh,"if [ -f /etc/bashrc ]; then\n");
     45   fprintf(arqsh,"  . /etc/bashrc\n");
     46   fprintf(arqsh,"fi\n");
     47   fprintf(arqsh,"if [ -r /afs/hep.man.ac.uk/g/bfactory/etc/hepix/bashrc ]; then\n");
     48   fprintf(arqsh,"  . /afs/hep.man.ac.uk/g/bfactory/etc/hepix/bashrc\n");
     49   fprintf(arqsh,"else \n");
     50   fprintf(arqsh,"  echo BaBar setup file /afs/hep.man.ac.uk/g/bfactory/etc/hepix/bashrc not found\n");
     51   fprintf(arqsh,"fi\n");
     52   fprintf(arqsh,"echo \n");
     53   fprintf(arqsh,"echo -----------------------------------------------\n");
     54   fprintf(arqsh,"echo \n");
     55   fprintf(arqsh,"cd $BFDIST/releases/14.5.2 \n");
     56   fprintf(arqsh,"srtpath 14.5.2 Linux24RH72_i386_gcc2953 \n");
     57   fprintf(arqsh,"cd $local \n");
     58   fprintf(arqsh,"echo Arquivos disponiveis: $local \n");
     59   fprintf(arqsh,"ls \n");
     60   fprintf(arqsh,"echo \n");
     61   fprintf(arqsh,". ./fullboot.sh \n");
     62   fprintf(arqsh,"ln -s $BFDIST/releases/14.5.2 PARENT\n");
     63   fprintf(arqsh,"/exp_software/babar01/BetaMiniApp %s\n",nomedata);
     64   fprintf(arqsh,"echo \n");
     65   fprintf(arqsh,"echo ----------------------------------------------\n");
     66   fprintf(arqsh,"echo \n");
     67   fprintf(arqsh,"echo Hora de fim: `/bin/date`\n");
     68   fclose(arqsh);
     69
     70   strcpy(nomearq,nomebase);
     71   strcat(nomearq,".jdl");
     72   arqjdl=fopen(nomearq,"w");
     73   fprintf(arqjdl,"Executable=\"%s.sh\";\n",nomebase);
     74   fprintf(arqjdl,"InputSandbox={\"%s\",\"%s\",\"%s\",\"fullboot.sh\",\"recodata.tcl\"};\n",
     75     nomesh,nomedata,argv[1]);
     76   fprintf(arqjdl,"StdOutput=\"std.out\";\n");
     77   fprintf(arqjdl,"StdError=\"std.err\";\n");
     78   fprintf(arqjdl,"OutputSandbox={\"std.out\",\"std.err\",\"%s.hbk\"};\n",nomebase);
     79   fprintf(arqjdl,"Requirements = other.GlueCEUniqueID == \"bfa.tier2.hep.man.ac.uk:2119/jobmanager-lcgpbs-short\" ;\n");
     80   fclose(arqjdl);
     81
     82
     83
     84   printf("edg-job-submit --vo babar -r bfa.tier2.hep.man.ac.uk:2119/jobmanager-lcgpbs-long %s >> gridtokens\n",nomearq);
     85
     86   }
 

[Download Source Code.]

The program contains three steps of file generation. The first is the TCL file (lines 21 to 32) described in the tutorial -Table 6. The second (lines 34 to 68) is the script to perform the environment configuration and to then run BetaMiniApp. Finally, the JDL file is created (lines 70 to 80) to inform the grid manager what it has to do. The description of each part is available in the LCG2 User manual. The output of gera.c is the command line to submit the software (line 84).

You have to compile the software using:


gcc gera.c -o gera

2.The handle manager software.

The script executes the following tasks:

  1. Obtain a valid proxy for the user from the BaBar community.
  2. Verify if there are TCL data files available.
  3. Verify if there are handlers available.
  4. Verify if all tasks have finished.
  5. Recover the data into the user directory.

To implement the state machine described, there is the following script called "easygrid":


      1   #!/bin/bash
      2   #            Escript de submissao do GRID
      3   #            Author: Dr James Cunha Werner
      4   #            www.geocities.com/jamwer2002
      5   #               University of Manchester
      6   #
      7   #edg-voms-proxy-init
      8   echo Searching pre selected skimdata.
      9   if (! (ls $1.tcl>/dev/null 2>/dev/null || ls $1-*.tcl>/dev/null 2>/dev/null)) then
     10     echo skimData files not found ! Running BbkDatasetTcl.
     11     BbkDatasetTcl --site man -t 1000 -ds $1 -b $1
     12     rm -f $1.tokens > /dev/null 2>/dev/null
     13   fi
     14   echo Searching previous handlers.
     15   if (! ls $1.tokens>/dev/null 2>/dev/null) then
     16     echo Handlers not found. Submiting to GRID . Wait end of process...
     17     cp /home/jamwer/PgmCM2/workdir/bin/Linux24RH72_i386_gcc2953/BetaMiniApp /exp_software/babar01
     18     ls $1*.tcl > gridtrab
     19     rm -f gridgera gridsub gridtokens $1.tokens $1.subm $1.histo > /dev/null 2>/dev/null
     20     awk '// {print "./gera",$1," >> gridsub "}' gridtrab > gridgera
     21     chmod 700 gridgera
     22     ./gridgera
     23     chmod 700 gridsub
     24     ./gridsub
     25     awk '/https/ {print $2}' gridtokens >> $1.tokens
     26     mv gridtokens $1.subm
     27     subok=`cat $1.tokens | wc -l`
     28   if [ $subok == 0 ]
     29   then
     30       echo Submission abended with errors:
     31       rm -f $1.tokens 2>/dev/null
     32       cat $1.subm
     33       echo
     34       echo --------------------------------------------------------------------------------------------------
     35       echo Send the file $1.subm to jamwer@hep.man.ac.uk for diagnostic.
     36       echo --------------------------------------------------------------------------------------------------
     37       echo
     38   fi
     39   else
     40     echo Checking if jobs finished.
     41     rm gridpend gridstat aux > /dev/null 2>/dev/null
     42     awk '// {print "edg-job-status ",$1," >> gridpend"}' $1.tokens >> gridstat
     43     chmod 700 gridstat
     44     ./gridstat
     45     cat gridpend | awk 'BEGIN {final=0} /Status info for the Job/ {HandleName=$7; print "### Handle -> " $7} /Current Status:/ {print "    " $1,$2,$3; if ($3 != "Done") {final+=1; print HandleName " still pendent.";}} /Exit code:/ {print "    " $1,$2,$3} END {print final " jobs did not finished ! Try again later." }'
     46     final=`cat gridpend | awk 'BEGIN {final=0} /Current Status:/ {if ($3 != "Done") {final+=1}} END {print final}'`
     47     cat gridpend >> $1.histo
     48     if [ $final == 0 ]
     49     then
     50       echo All jobs done. Recovering results in your folder.
     51       rm -f /exp_software/babar01/BetaMiniApp
     52       rm gridrec $1.recres gridgetout > /dev/null 2>/dev/null
     53       cat $1.tokens | awk '// {print "edg-job-get-output --dir . ",$1," >> gridgetout"}' >> gridrec
     54       chmod 700 gridrec
     55       ./gridrec
     56       mv gridgetout $1.recres
     57       echo
     58       echo Results in the following folders:
     59       cat $1.recres | grep `pwd`
     60       rm -f $1.tokens > /dev/null 2>/dev/null
     61      fi
     62   fi
 

[Download Source Code.]

The trick is to generate scripts that will be run in the next line depending on the conditions!

Top

Last modified:
Copyright 2004 Manchester University
Feedback to: jamwer@hep.man.ac.uk