Why I love awk:
{
key=$1"_"$3"_"$4"_"$5;
if(!(x[key"-begin"])) {
x[key"-begin"] = x[key"-c"] = x[key"-end"]=$6;
} else {
x[key"-c"]=x[key"-c"]" "$6;
x[key"-end"]=$6;
}
} END { for(item in x) {
if(match(item,"begin")) {
split(item,y,"-");
split(x[y[1]"-begin"],t1,":");
split(x[y[1]"-end"],t2,":");
tim1=t1[1]*3600 + t1[2]*60 + t1[3];
tim2=t2[1]*3600 + t2[2]*60 + t2[3];
if((tim2 - tim1) >= 1800) {
split(y[1],z,"_");
print z[1]" "z[2]" "z[3]" "z[4] " = " x[y[1]"-c"]
}
}
}
}
This awk lines above takes a large file (i.e. 68000 lines) with lines that consist of:SIP SP P DIP DP TIME
SIP = Source IP
SP = Source Port
P = Protocol
DIP = Destination IP
DP = Destination Port
TIME = Time of event (HH:MM:SS.MS)
And it:
- Builds a pseudo multi-dimensional array (awk doesn’t do “true” multidimensional arrays) with TIME as the value and SIP_P_DIP_DP as the key.
- Checks the array and calculates the start time and end time and excludes events that don’t last at least 30 minutes (1800 seconds).
- Prints out matching lines with in this format:
SIP P DIP DP = <Start Time> <Middle Time> … <Middle Time> <End Time>
* – this also reduces the data set to 1620 lines, since invalid groups are eliminated and the relevant lines are collapsed into a single line.
AND it runs at ridiculous speed. Maybe on a faster box it would be ludicrous speed, but on my test box (rather old), it does this in about 2.25 seconds. Rather fast if you consider the job, I think.
Other tests:
Data Set Size / Time = Rate
67,741 / 2.25s = 30,107.10/sec
556,834 / 29.12s = 19,122.05/sec
1,117,668 / 91.95s = 12,111.67/sec