To understand how it works, lets think about the example of 5 friends. For each friend you gave 20 cookies to bake. You want to control where the cookies are and who they are with. To achieve this, when your friends receive the task, they give you back a code (called HANDLER) that identifies the task for them.
If you want to know how the task is going, you have to give them the handler and they will answer "it is cooking", "it is done", or even "it is burnt!" (if something goes wrong).
When you give the cookies to your friends you use a box, and they will return the cookies in a box too. Grid world uses SANDBOX to send and retrieve files. I will use an Input Sandbox to send the files from your directory to the grid, and I will use an Output Sandbox to return the result files.
I submitted a different job for each file generated by relDatasetTcl. The number of files can be up to 500 for tau data. This means that you can have for one analysis program 500 handlers, 500 Input Sandboxes, and 500 Output Sandboxes.
The problem is when you submit in the grid the handler you receive back is like:
https://lcgrb01.gridpp.rl.ac.uk:9000/lBWj-5_hkC9PxjT5GGCRoA
Have you tried to remember or type it? No way!
This is the job my script does. You have to give the script the dataset name (for example SP-1005-Tau11-R14), and the script will do all the work: prepare data, submit, get handlers and store them, query for status and recover the results... You will see no handlers, and have no worries about what commands to type or the parameters you have to fill. The available grid resources will be acquired, and one optimisation algorithm will adapt software parameters (such as, number of events per TCL file, queue size, number of work nodes, etc) to obtain the best performance. This software will be developed from January 2005.
The representation of this process uses the concept of state machine. Every time you type the command:
./easygrid dataset_name
you are in the initial state. If you already have tcl files in the directory, you do not need to run relDatasetTcl, otherwise you go to the state DatasetTcl. After preparing the dataset tcl files, you need to know if you already submitted the data to the grid by looking for handlers. If you have handlers it is because you didn't submit yet. Then you go to Submission state and after submission you return to the initial state because is too early to verify if the results are ready.
Next time you type:
./easygrid dataset_name
you already have tcl files, and tokens. You go to the state status where the grid will answer questions about the status of each of your jobs.
They can be Ready, Scheduled, Running and, more important, Done. If all are Done, you are finished the processing, and are able to recover the results (Recover).
If not, you have to try again and again (OK, take it easy! Drink a coffee, read a good book about gauge theory,... and try again).
The script manages the handlers, and stores them in files and controls the state by the existence or not of these control files.
1 /*
2 Geracao do jdl e dos comandos para GRID
3 Author: Dr James Cunha Werner
4 www.geocities.com/jamwer2002
5 University of Manchester
6 */
7 #include <stdio.h>
8 #include <stdlib.h>
9
10 int main (int argc, char *argv[])
11 {
12 FILE *arqjdl,*arqtcl,*arqsh;
13 char nomearq[300],nomedata[300],nomesh[300],nomebase[300];
14 int i;
15
16 strcpy(nomebase,argv[1]);
17 for(i=strlen(nomebase)-1;i>0;i--)
18 if(nomebase[i]=='.')
19 nomebase[i]='\0';
20
21 strcpy(nomedata,"Job");
22 strcat(nomedata,nomebase);
23 strcat(nomedata,".tcl");
24 arqtcl=fopen(nomedata,"w");
25 fprintf(arqtcl,"set ConfigPatch Run2\n");
26 fprintf(arqtcl,"set levelOfDetail cache\n");
27 fprintf(arqtcl,"set BetaMiniTuple hbook\n");
28 fprintf(arqtcl,"set histFileName %s.hbk\n",nomebase);
29 fprintf(arqtcl,"set NEvent 1000000000\n");
30 fprintf(arqtcl,"source %s\n",argv[1]);
31 fprintf(arqtcl,"sourceFoundFile recodata.tcl\n");
32 fclose(arqtcl);
33
34 strcpy(nomesh,nomebase);
35 strcat(nomesh,".sh");
36 arqsh=fopen(nomesh,"w");
37 fprintf(arqsh,"#!/bin/bash\n");
38 fprintf(arqsh,"echo Este programa foi executado no computador `/bin/hostname`\n");
39 fprintf(arqsh,"echo Hora de inicio: `/bin/date`\n");
40 fprintf(arqsh,"echo \n");
41 fprintf(arqsh,"echo Inicialisacao de variaveis Babar\n");
42 fprintf(arqsh,"local=`pwd`\n");
43 fprintf(arqsh,"echo Diretorio com sandbox de entrada: $local \n");
44 fprintf(arqsh,"if [ -f /etc/bashrc ]; then\n");
45 fprintf(arqsh," . /etc/bashrc\n");
46 fprintf(arqsh,"fi\n");
47 fprintf(arqsh,"if [ -r /afs/hep.man.ac.uk/g/bfactory/etc/hepix/bashrc ]; then\n");
48 fprintf(arqsh," . /afs/hep.man.ac.uk/g/bfactory/etc/hepix/bashrc\n");
49 fprintf(arqsh,"else \n");
50 fprintf(arqsh," echo BaBar setup file /afs/hep.man.ac.uk/g/bfactory/etc/hepix/bashrc not found\n");
51 fprintf(arqsh,"fi\n");
52 fprintf(arqsh,"echo \n");
53 fprintf(arqsh,"echo -----------------------------------------------\n");
54 fprintf(arqsh,"echo \n");
55 fprintf(arqsh,"cd $BFDIST/releases/14.5.2 \n");
56 fprintf(arqsh,"srtpath 14.5.2 Linux24RH72_i386_gcc2953 \n");
57 fprintf(arqsh,"cd $local \n");
58 fprintf(arqsh,"echo Arquivos disponiveis: $local \n");
59 fprintf(arqsh,"ls \n");
60 fprintf(arqsh,"echo \n");
61 fprintf(arqsh,". ./fullboot.sh \n");
62 fprintf(arqsh,"ln -s $BFDIST/releases/14.5.2 PARENT\n");
63 fprintf(arqsh,"/exp_software/babar01/BetaMiniApp %s\n",nomedata);
64 fprintf(arqsh,"echo \n");
65 fprintf(arqsh,"echo ----------------------------------------------\n");
66 fprintf(arqsh,"echo \n");
67 fprintf(arqsh,"echo Hora de fim: `/bin/date`\n");
68 fclose(arqsh);
69
70 strcpy(nomearq,nomebase);
71 strcat(nomearq,".jdl");
72 arqjdl=fopen(nomearq,"w");
73 fprintf(arqjdl,"Executable=\"%s.sh\";\n",nomebase);
74 fprintf(arqjdl,"InputSandbox={\"%s\",\"%s\",\"%s\",\"fullboot.sh\",\"recodata.tcl\"};\n",
75 nomesh,nomedata,argv[1]);
76 fprintf(arqjdl,"StdOutput=\"std.out\";\n");
77 fprintf(arqjdl,"StdError=\"std.err\";\n");
78 fprintf(arqjdl,"OutputSandbox={\"std.out\",\"std.err\",\"%s.hbk\"};\n",nomebase);
79 fprintf(arqjdl,"Requirements = other.GlueCEUniqueID == \"bfa.tier2.hep.man.ac.uk:2119/jobmanager-lcgpbs-short\" ;\n");
80 fclose(arqjdl);
81
82
83
84 printf("edg-job-submit --vo babar -r bfa.tier2.hep.man.ac.uk:2119/jobmanager-lcgpbs-long %s >> gridtokens\n",nomearq);
85
86 }
The program contains three steps of file generation. The first is the TCL file (lines 21 to 32) described in the tutorial -Table 6. The second (lines 34 to 68) is the script to perform the environment configuration and to then run BetaMiniApp. Finally, the JDL file is created (lines 70 to 80) to inform the grid manager what it has to do. The description of each part is available in the LCG2 User manual. The output of gera.c is the command line to submit the software (line 84).
You have to compile the software using:
gcc gera.c -o gera
The script executes the following tasks:
To implement the state machine described, there is the following script called "easygrid":
1 #!/bin/bash
2 # Escript de submissao do GRID
3 # Author: Dr James Cunha Werner
4 # www.geocities.com/jamwer2002
5 # University of Manchester
6 #
7 #edg-voms-proxy-init
8 echo Searching pre selected skimdata.
9 if (! (ls $1.tcl>/dev/null 2>/dev/null || ls $1-*.tcl>/dev/null 2>/dev/null)) then
10 echo skimData files not found ! Running BbkDatasetTcl.
11 BbkDatasetTcl --site man -t 1000 -ds $1 -b $1
12 rm -f $1.tokens > /dev/null 2>/dev/null
13 fi
14 echo Searching previous handlers.
15 if (! ls $1.tokens>/dev/null 2>/dev/null) then
16 echo Handlers not found. Submiting to GRID . Wait end of process...
17 cp /home/jamwer/PgmCM2/workdir/bin/Linux24RH72_i386_gcc2953/BetaMiniApp /exp_software/babar01
18 ls $1*.tcl > gridtrab
19 rm -f gridgera gridsub gridtokens $1.tokens $1.subm $1.histo > /dev/null 2>/dev/null
20 awk '// {print "./gera",$1," >> gridsub "}' gridtrab > gridgera
21 chmod 700 gridgera
22 ./gridgera
23 chmod 700 gridsub
24 ./gridsub
25 awk '/https/ {print $2}' gridtokens >> $1.tokens
26 mv gridtokens $1.subm
27 subok=`cat $1.tokens | wc -l`
28 if [ $subok == 0 ]
29 then
30 echo Submission abended with errors:
31 rm -f $1.tokens 2>/dev/null
32 cat $1.subm
33 echo
34 echo --------------------------------------------------------------------------------------------------
35 echo Send the file $1.subm to jamwer@hep.man.ac.uk for diagnostic.
36 echo --------------------------------------------------------------------------------------------------
37 echo
38 fi
39 else
40 echo Checking if jobs finished.
41 rm gridpend gridstat aux > /dev/null 2>/dev/null
42 awk '// {print "edg-job-status ",$1," >> gridpend"}' $1.tokens >> gridstat
43 chmod 700 gridstat
44 ./gridstat
45 cat gridpend | awk 'BEGIN {final=0} /Status info for the Job/ {HandleName=$7; print "### Handle -> " $7} /Current Status:/ {print " " $1,$2,$3; if ($3 != "Done") {final+=1; print HandleName " still pendent.";}} /Exit code:/ {print " " $1,$2,$3} END {print final " jobs did not finished ! Try again later." }'
46 final=`cat gridpend | awk 'BEGIN {final=0} /Current Status:/ {if ($3 != "Done") {final+=1}} END {print final}'`
47 cat gridpend >> $1.histo
48 if [ $final == 0 ]
49 then
50 echo All jobs done. Recovering results in your folder.
51 rm -f /exp_software/babar01/BetaMiniApp
52 rm gridrec $1.recres gridgetout > /dev/null 2>/dev/null
53 cat $1.tokens | awk '// {print "edg-job-get-output --dir . ",$1," >> gridgetout"}' >> gridrec
54 chmod 700 gridrec
55 ./gridrec
56 mv gridgetout $1.recres
57 echo
58 echo Results in the following folders:
59 cat $1.recres | grep `pwd`
60 rm -f $1.tokens > /dev/null 2>/dev/null
61 fi
62 fi
The trick is to generate scripts that will be run in the next line depending on the conditions!
|
|
|
Feedback to: jamwer@hep.man.ac.uk |