TCS Performance Monitoring

dstat Notes

As of June-2015, we are running dstat every 10 seconds on the tcs1 and tcs2 machines. The logs are sent to the same place as the syslog/event logs for TCS.

To view the data seamlessly with dygraphs, the timestamp has to be in Unix-time, microseconds. In Dec-2015, modified the script to use unix time and to write to CSV.

The script is run as a service, installed in /etc/systemd/system/dstat.service:

[tcs@tcs1 2015]$ more /etc/systemd/system/dstat.service
[Unit]
Description=run dstat

[Service]
Type=simple
ExecStart=/usr/bin/su - tcs /home/tcs/bin/dstat.sh

[Install]
WantedBy=multi-user.target

To see that it's running:
[tcs@tcs1 ~]$ service dstat status
Redirecting to /bin/systemctl status  dstat.service
dstat.service - run dstat
   Loaded: loaded (/etc/systemd/system/dstat.service; enabled)
   Active: active (running) since Tue 2015-12-01 16:46:26 UTC; 2 weeks 1 days ago
 Main PID: 14330 (su)
   CGroup: /system.slice/dstat.service
           ‣ 14330 /usr/bin/su - tcs /home/tcs/bin/dstat.sh

The script that's actually executing:
#!/bin/sh

## runs as dstat.service from /etc/systemd/system
## use:
##    systemctl dstat.service      to stop/start/enable

## run dstat every 10 seconds, dumping the output
## to csv files in the /lbt/log directory (rollover daily)

while true ; do 
  logfile="$(date +/lbt/log/%Y/%m/%Y%m%d-$(hostname -s).dstat.log)"
  csvname="$(date +/lbt/log/%Y/%m/%Y%m%d-$(hostname -s).dstat.csv)"
  if [ "${csvname}" \!= "${csvname_old}" ]; then

    ## delete the decimal in the timestamp, we want milliseconds
    if [ "${csvname_old}" \!= "" ]; then
       `perl -p -i -e 's/\.//' $csvname_old`
    fi

    while [ \! -d $(dirname $csvname) ]; do
     sleep 1
    done
    if [ -n "${dstat_pid}" ]; then
      kill -9 $dstat_pid
    fi
    /usr/bin/dstat -Tamp --lock --ipc --socket --tcp --output $csvname --udp --vm  10 \
    >> ${logfile} &
    dstat_pid=$!
    csvname_old=$csvname
  fi
  sleep 1
done

If you use the dygraphs, "local" graph version we have, you can tell it to start parsing the csv file after all the header info: http://telemetry.lbto.org/visualization/graph_local.html?label_start=5&data_start=6

top Notes

In October-2015, we started running top on the tcs1 and tcs2 machines. The top.sh script is installed the same way as the above dstat service.

In January-2016, we found discrepancies in the top data, some spikes that only occur on the "first" iteration of top. Since we were running with a -n 1, we always had the "first iteration". We had to do this because it wouldn't dump to a file (flushing problem?) when we used -n 4300 (a day's worth of iterations).
But, it worked with piping the output instead of appending.

The script is now:

#!/bin/sh

## top script that runs for a whole day (-n 4320 iterations), then completes
## writes to dated filename in logs directory
##
## started as a cron job each day just after 5pm (00:00 UTC)
##

today=`date +%Y%m%d`
year=`date +%Y`
month=`date +%m`
host=`hostname -s`

filename="$(date +/lbt/log/%Y/%m/%Y%m%d-$(hostname -s).top.log)"

/usr/bin/top -d 20 -n 4320 -b | egrep 'top|Tasks|Cpu|Kib|PID|tcs|systemd|vmetad|dsm|gateway|ioc|log|caRepeater|shm' | egrep -v 'ssh|bash|dsm_|sleep' | tee ${filename} 

The top files can be used with dygraph after being digested by a perl script. The script below takes as input YYYY MM DD and finds the log file locally for both tcs1 and tcs2. It builds a csv file with a date in unix time (milliseconds). The tcs_programs variable can be modified to include more processes (GUIs for instance).

#!/bin/env perl
use strict;
use POSIX qw(strftime);
use Time::Local qw(timegm);

my @tcs_programs = qw(syslogserver LSS DDS IIF ENV ECS MCS PCS OSS PMCL PMCR PSFL PSFR GCSL GCSR AOSL AOSR );

my $year = shift;
my $month = shift;
my $day = shift;
die "${0} YYYY MM DD\n" unless $year and $month and $day;

print 'timestamp,';
print join(",",@tcs_programs);
print "\n";

for my $system qw( tcs1 tcs2 ) {
 open(my $fh,"${year}${month}${day}-${system}.top.log") or
  die "Could not open ${year}${month}${day}-${system}.top.log\n";
 my $time = 0;
 my %cpu;
 while(my $line = <$fh>) {
  $line =~ s/^\s+//;
  if($line =~  /^top\s*-\s*(\S+)\s*up/) {
    unless($time == 0) {
      print "${time}000,";
      print join(",",map { "${cpu{$_}}" } @tcs_programs); 
      print "\n";
    };
    my ($hour,$minute,$second) = ($1 =~ m/(\d\d):(\d\d):(\d\d)/g );
    $time = timegm($second,$minute,$hour,$day,$month - 1,$year - 1900);
  }
  next if $time eq '';
  if (grep { $line =~ m/$_/ } @tcs_programs) {
    # my @values = split(/\s+/,$line);
    my @values = split(/\s+/,$line);
    $cpu{$values[11]} = $values[8];
  }
 }
close($fh);
}

The resulting CSV file looks like this:
timestamp,syslogserver,LSS,DDS,IIF,ENV,ECS,MCS,PCS,OSS,PMCL,PMCR,PSFL,PSFR,GCSL,GCSR,AOSL,AOSR
1450051210000,,12.1,,,6.0,,18.1,,,,30.2,,12.1,,36.2,36.2,
1450051230000,,0.0,,,12.4,,24.9,,,,24.9,,18.6,,43.5,31.1,
1450051250000,,18.6,,,6.2,,18.6,,,,31.0,,12.4,,31.0,12.4,
  .......

The resulting CSV can be graphed with http://telemetry.lbto.org/visualization/graph_local.html :


TOPWithLegend.png
I Attachment Action Size Date Who Comment
TOPWithLegend.pngpng TOPWithLegend.png manage 65 K 16 Dec 2015 - 21:50 UnknownUser Graph of TCS subsystem cpu usage using top
Topic revision: r4 - 21 Jan 2016, KelleeSummers
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback