dstat Notes
As of June-2015, we are running
dstat
every 10 seconds on the
tcs1
and
tcs2
machines. The logs are sent to the same place as the syslog/event logs for
TCS.
To view the data seamlessly with dygraphs, the timestamp has to be in Unix-time, microseconds. In Dec-2015, modified the script to use unix time and to write to CSV.
The script is run as a service, installed in
/etc/systemd/system/dstat.service
:
[tcs@tcs1 2015]$ more /etc/systemd/system/dstat.service
[Unit]
Description=run dstat
[Service]
Type=simple
ExecStart=/usr/bin/su - tcs /home/tcs/bin/dstat.sh
[Install]
WantedBy=multi-user.target
To see that it's running:
[tcs@tcs1 ~]$ service dstat status
Redirecting to /bin/systemctl status dstat.service
dstat.service - run dstat
Loaded: loaded (/etc/systemd/system/dstat.service; enabled)
Active: active (running) since Tue 2015-12-01 16:46:26 UTC; 2 weeks 1 days ago
Main PID: 14330 (su)
CGroup: /system.slice/dstat.service
‣ 14330 /usr/bin/su - tcs /home/tcs/bin/dstat.sh
The script that's actually executing:
#!/bin/sh
## runs as dstat.service from /etc/systemd/system
## use:
## systemctl dstat.service to stop/start/enable
## run dstat every 10 seconds, dumping the output
## to csv files in the /lbt/log directory (rollover daily)
while true ; do
logfile="$(date +/lbt/log/%Y/%m/%Y%m%d-$(hostname -s).dstat.log)"
csvname="$(date +/lbt/log/%Y/%m/%Y%m%d-$(hostname -s).dstat.csv)"
if [ "${csvname}" \!= "${csvname_old}" ]; then
## delete the decimal in the timestamp, we want milliseconds
if [ "${csvname_old}" \!= "" ]; then
`perl -p -i -e 's/\.//' $csvname_old`
fi
while [ \! -d $(dirname $csvname) ]; do
sleep 1
done
if [ -n "${dstat_pid}" ]; then
kill -9 $dstat_pid
fi
/usr/bin/dstat -Tamp --lock --ipc --socket --tcp --output $csvname --udp --vm 10 \
>> ${logfile} &
dstat_pid=$!
csvname_old=$csvname
fi
sleep 1
done
If you use the dygraphs, "local" graph version we have, you can tell it to start parsing the csv file after all the header info:
http://telemetry.lbto.org/visualization/graph_local.html?label_start=5&data_start=6
top Notes
In October-2015, we started running
top
on the
tcs1
and
tcs2
machines. The
top.sh
script is installed the same way as the above
dstat
service.
In January-2016, we found discrepancies in the top data, some spikes that only occur on the "first" iteration of top. Since we were running with a
-n 1
, we always had the "first iteration". We had to do this because it wouldn't dump to a file (flushing problem?) when we used
-n 4300
(a day's worth of iterations).
But, it worked with piping the output instead of appending.
The script is now:
#!/bin/sh
## top script that runs for a whole day (-n 4320 iterations), then completes
## writes to dated filename in logs directory
##
## started as a cron job each day just after 5pm (00:00 UTC)
##
today=`date +%Y%m%d`
year=`date +%Y`
month=`date +%m`
host=`hostname -s`
filename="$(date +/lbt/log/%Y/%m/%Y%m%d-$(hostname -s).top.log)"
/usr/bin/top -d 20 -n 4320 -b | egrep 'top|Tasks|Cpu|Kib|PID|tcs|systemd|vmetad|dsm|gateway|ioc|log|caRepeater|shm' | egrep -v 'ssh|bash|dsm_|sleep' | tee ${filename}
The top files can be used with dygraph after being digested by a perl script. The script below takes as input
YYYY MM DD
and finds the log file locally for both
tcs1
and
tcs2
. It builds a
csv
file with a date in unix time (milliseconds). The
tcs_programs
variable can be modified to include more processes (GUIs for instance).
#!/bin/env perl
use strict;
use POSIX qw(strftime);
use Time::Local qw(timegm);
my @tcs_programs = qw(syslogserver LSS DDS IIF ENV ECS MCS PCS OSS PMCL PMCR PSFL PSFR GCSL GCSR AOSL AOSR );
my $year = shift;
my $month = shift;
my $day = shift;
die "${0} YYYY MM DD\n" unless $year and $month and $day;
print 'timestamp,';
print join(",",@tcs_programs);
print "\n";
for my $system qw( tcs1 tcs2 ) {
open(my $fh,"${year}${month}${day}-${system}.top.log") or
die "Could not open ${year}${month}${day}-${system}.top.log\n";
my $time = 0;
my %cpu;
while(my $line = <$fh>) {
$line =~ s/^\s+//;
if($line =~ /^top\s*-\s*(\S+)\s*up/) {
unless($time == 0) {
print "${time}000,";
print join(",",map { "${cpu{$_}}" } @tcs_programs);
print "\n";
};
my ($hour,$minute,$second) = ($1 =~ m/(\d\d):(\d\d):(\d\d)/g );
$time = timegm($second,$minute,$hour,$day,$month - 1,$year - 1900);
}
next if $time eq '';
if (grep { $line =~ m/$_/ } @tcs_programs) {
# my @values = split(/\s+/,$line);
my @values = split(/\s+/,$line);
$cpu{$values[11]} = $values[8];
}
}
close($fh);
}
The resulting CSV file looks like this:
timestamp,syslogserver,LSS,DDS,IIF,ENV,ECS,MCS,PCS,OSS,PMCL,PMCR,PSFL,PSFR,GCSL,GCSR,AOSL,AOSR
1450051210000,,12.1,,,6.0,,18.1,,,,30.2,,12.1,,36.2,36.2,
1450051230000,,0.0,,,12.4,,24.9,,,,24.9,,18.6,,43.5,31.1,
1450051250000,,18.6,,,6.2,,18.6,,,,31.0,,12.4,,31.0,12.4,
.......
The resulting CSV can be graphed with
http://telemetry.lbto.org/visualization/graph_local.html :