Execution Statistics¶

HTCondor DAGMAN Log Files¶

The HTCondor DAGMAN utility writes a variety of files to your $HOME/condor_scratch directory. One of the files is a log of all processing done through HTCondor. All log messages are in the form of records, which are delimited by an ellipses.

Records are delimited with an ellipses. Here is a portion of one of the files, with “updated”, “submitted” and “terminated” records:

006 (61916.000.000) 10/04 11:48:51 Image size of job updated: 120060
    0  -  ResidentSetSize of job (KB)
...
000 (61931.000.000) 10/04 11:48:56 Job submitted from host: <141.142.225.136:37267>
    DAG Node: A51
...
005 (62891.000.000) 10/04 12:54:29 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:11:24, Sys 0 00:00:05  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:11:24, Sys 0 00:00:05  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    2647  -  Run Bytes Sent By Job
    1912  -  Run Bytes Received By Job
    2647  -  Total Bytes Sent By Job
    1912  -  Total Bytes Received By Job
    Partitionable Resources :    Usage  Request
       Cpus                 :                 1
       Disk (KB)            :       12       12
       Memory (MB)          :               871

ctrl_stats package¶

The ctrl_stats package contains commands to ingest Condor log files into a database by taking the event “records” HTCondor emits during execution. The utilities group all of these records according to Condor job id in order to get an overview of what happened during the job. These records are then reduced into two tables, one that describes each submitted dag node and one that describes all of the events that happens to each condor job id. Note that the dag node can be submitted more than one time for a variety of reasons. All of this information is captured in the database records. Note that you have to have database write capabilities to execute the ingest command.echo

The following commands are available:

condorLogIngest.py

Takes a list of log files and ingests them into a database

usage: condorLogIngest.py [-h] -H HOST [-p PORT] -d DATABASE -f FILENAMES [FILENAMES ...] [-v]

example:

cd ~/condor_scratch/srp_2012_1211_090023
condorLogIngest.py -H lsst10 -p 3600 -d testing -f logs/worker-pre.log

condorDirectoryIngest.py

Given a directory in ~/condor_scratch, take the preJob, worker and postJob log files and add them into a database named after the directory.

usage: condorDirectoryIngest -H HOST -p PORT -d directory

example:

cd ~/condor_scratch
condorDirectoryIngest -H lsst10 -p 3600 -d srp_2012_0925_160117

condorLogInfo.py

A debugging utility to view record groups. This was used as a debugging tool, but is useful in viewing records without having to do a complete database ingest.

usage: condorLogInfo.py [-h] [-v] [-c CONDORIDS [CONDORIDS ...]] -f FILENAMES [FILENAMES ...]

example:

cd ~/condor_scratch
condorLogInfo.py -c 630.000.000 -f srp_2012_0925_160117/*nodes.log

Record Groups¶

The are examples of how these individual records are grouped:

Job was executing, increase it’s image size to 118472, and was evicted because job id 68101 (the dagman job which submitted this dag) was removed.

Submitted 2012-10-18 16:54:01 jobNum=68103.000.000 dagNode=A1
Executing 2012-10-18 16:54:03 host=198.202.102.162:53706
Updated 2012-10-18 16:54:12 imageSize=118472
Evicted 2012-10-18 16:57:04
Aborted 2012-10-18 16:57:04  removed because <OtherJobRemoveRequirements = DAGManJobId == 68101> fired when j

Job was submitted, but never executed, and was evicted when the dagman job was removed.

Submitted 2012-10-18 16:57:02 jobNum=68247.000.000 dagNode=A145
Aborted 2012-10-18 16:57:03  removed because <OtherJobRemoveRequirements = DAGManJobId == 68101> fired when j

Job was submitted, executed, updated its image size to 117796, then again updated it’s image size to 689000, and terminated.

Submitted 2012-10-03 10:57:10 jobNum=59179.000.000 dagNode=A1078
Executing 2012-10-03 10:57:12 host=198.202.101.70:58180
Updated 2012-10-03 10:57:21 imageSize=117796
Updated 2012-10-03 11:02:22 imageSize=689000
Terminated 2012-10-03 11:07:49

Job was submitted, executed on node 198.202.100.66, updated its image size to 100916, then again to 891028, communication to the node was lost, reconnection failed once, executing started on node 198.202.100.198, and then terminated.

Submitted 2012-10-04 09:39:07 jobNum=60540.000.000 dagNode=A6
Executing 2012-10-04 09:39:28 host=198.202.100.66:39897
Updated 2012-10-04 09:39:36 imageSize=100916
Updated 2012-10-04 09:44:36 imageSize=891028
SocketLost 60540.000.000 2012-10-04 09:53:55
SocketReconnectFailure 60540.000.000 2012-10-04 10:13:55
Executing 2012-10-04 10:13:58 host=198.202.100.198:43111
Terminated 2012-10-04 10:25:29

Submitted, executing on node 198.202.101.206, updated to image size 119476, an exception in the Shadow daemon occurs, execution starts on 198.202.101.110, updated image size to 102492, and then again to 682660, disconnected from node, reconnection to node failed, execution starts on 198.202.101.185, and then terminates.

Submitted 2012-10-04 09:41:41 jobNum=60660.000.000 dagNode=A126
Executing 2012-10-04 09:42:08 host=198.202.101.206:51636
Updated 2012-10-04 09:42:17 imageSize=119476
ShadowException 60660.000.000 2012-10-04 09:48:18
Executing 2012-10-04 09:48:45 host=198.202.101.110:48658
Updated 2012-10-04 09:48:53 imageSize=120492
Updated 2012-10-04 09:54:24 imageSize=682660
SocketLost 60660.000.000 2012-10-04 09:56:47
SocketReconnectFailure 60660.000.000 2012-10-04 10:16:47
Executing 2012-10-04 10:16:52 host=198.202.101.185:60256
Terminated 2012-10-04 10:29:13

Database Table Formats¶

Nodes Table¶

The following is the definition of the nodes table which ctrl_stats writes into:

CREATE TABLE IF NOT EXISTS `nodes` (
  `id` int(11) unsigned NOT NULL auto_increment,
  `condorId` varchar(24) default NULL,
  `dagNode` varchar(10) default NULL,
  `submitTime` datetime default NULL,
  `executionHost` varchar(24) default NULL,
  `executionStartTime` datetime default NULL,
  `executionStopTime` datetime default NULL,
  `updateImageSize` int(11) default NULL,
  `updateMemoryUsageMB` int(11) default NULL,
  `updateResidentSetSizeKB` int(11) default NULL,
  `userRunRemoteUsage` int(11) default NULL,
  `sysRunRemoteUsage` int(11) default NULL,
  `finalMemoryUsageMB` int(11) default NULL,
  `finalMemoryRequestMB` int(11) default NULL,
  `bytesSent` int(11) default NULL,
  `bytesReceived` int(11) default NULL,
  `terminationTime` datetime default NULL,
  `terminationCode` varchar(3) default NULL,
  `terminationReason` varchar(4096) default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

condorId - the number Condor assigns for this job
dagNode - the DAGMan node name associated with this job
submitTime - the time that DAGMan submitted this job to Condor
executionHost - The host on which this job started executing. This may be blank if the job was never executed.
executionStartTime - The time at which this job started executing. This may be listed as “0000-00-00 00:00:00” if the job was never started.
executionStopTime - The time at which this job stop executing. This may be listed as “0000-00-00 00:00:00” if the job did not terminate normally.
updateImageSize - Condor can sometimes send out information about the imageSize of the job. This is the last updated received before the job ended. This may be 0.
updateMemoryUsageMB - In updates, Condor sometimes adds additional information about the memory usage in MB. This may be 0, if this information is not reported.
updateResidentSetSizeKB - in updates, Condor sometimes adds additional information about the portion of the process’s memory that is held in RAM. This may be 0, if this information is not reported.
userRunRemoteUsage - At termination, the amount of remote user time that this job took to execute.
sysRunRemoteUsage - At termination, the amount of remote system time that this job took to execute.
finalMemoryUsageMB - At termination, the last reported memory usage statistic reported by Condor
finalMemroryRequestMB - At termination, the last reported memory request statistic reported by Condor. This may be 0.
bytesSent - the number of bytes set to this job.
bytesReceived - the number of bytes received from this job.
terminationTime - The time recorded for this job’s termination.
terminationCode - The Condor event code for this terminated job.
terminationReason - The text description reported by Condor as the reason for this job’s termination.

Event Code Table¶

Each time Condor outputs a record it puts out an event code.

Condor exit codes.¶
eventCode	eventName
000	Job submitted
001	Job executing
002	Error in executable
003	Job was checkpointed
004	Job evicted from machine
005	Job Terminated
006	Image size of job updated
007	Shadow exception
008	Generic log event
009	Job aborted
010	Job was suspended
011	Job was unsuspended
012	Job was held
013	Job was released
014	Parallel node executed
015	Parallel node terminated
016	POST script terminated
017	Job submitted to Globus
018	Globus submit failed
019	Globus resource up
020	Detected Down Globus Resource
021	Remote error
022	Remote system call socket lost
023	Remote system call socket reestablished
024	Remote system call reconnect failure
025	Grid Resource Back Up
026	Detected Down Grid Resource
027	Job submitted to grid resource
028	Job ad information event was triggered
029	The job’s remote status is unknown
030	The job’s remote statis is known again
031	unused
032	unused
033	Attribute update