extractsyslog.sh: Extracting syslog messages
Usage
extractsyslog.sh <pcapfilename>
Description
This command will display the syslog message text from syslog packets contained in the libpcap file specified on the command line. You can only specify one file at a time due to tshark only processing one -r option and not reading from standard input.
Typical Usage
It is useful for extracting log messages sent to a non-existent log server (to avoid revealing the IP address of other hosts, which are potential targets).
Script
#!/bin/sh tshark -Tfields -e syslog.msg -r "$1" syslog
mergecaps.awk: Merging many libpcap files
Usage
awk -f mergecaps.awk [-v count=] [-v prefix=]
Description
This awk script generates a script to merge pcap files, with num number of pcap files (the default is 168) merged to create each output pcap file. The output files are prefixnnnn.pcap, where nnnn is a sequential number starting at 0000.
It reads a list of libpcap filenames from standard input (stdin) and outputs a series of mergecap commands to standard output (stdout).
Typical Usage
It can be used in situations where you have a large number of libpcap files to merge and specifying each of them on the command line will either exceed the shell’s command line length limit or cause mergecap to exceed the number of open files limit.
The default count of 168 will merge hourly pcap files created by the Honeynet Project‘s Honeywall, in to weekly pcap files. The 168 is 24 hours per day x 7 days a week.
You can generate a list of filenames using the ls -1 command (that option is the digit one, used to instruct ls to only output one filename per line, as opposed to its usual multi-column output when it is ran without any options). Another option would be to use the find command with the -print action.
Whatever command you use, it must generate a list consisting only of filenames, and the filenames must be paths to the libpcap files (accessible from the directory where you will run the output script of mergecap commands) that you wish to merge.
The command used with the Honeywall’s pcap files was ls -1 1*/log in the /var/log/pcap/ directory. The 1* (digit one) pattern matched all of the subdirectories containing the log pcap files.
Script
BEGIN { file = 0; if (count == 0) count = 168; } { if (NR % count == 1) { printf("\nmergecap -w %s%04d.pcap %s",prefix,file++,$0); } else { printf(" %s",$0); } }
Explanation
file is a variable used to increment a number on the end of the filename of the mergecap output file. Without this, each mergecap command in the output script would clobber the output of the previous mergecap commands.
count is an optional variable specified on the awk command line with ‘-v count=…’. This is the number of pcap files to merge in to each output pcap file. It defaults to 168 (24 * 7) if not specified, as I originally wrote it to merge hourly pcap files from a honeywall box in to weekly pcap files.
prefix is an optional variable, the value of which is used as the start of the filename of the output files from mergecap. It will have the sequential nnnn.pcap appended to it.
NR is a special awk variable that is equal to the current record number, that is the current line number in the input data. In this case, it will indicate which input pcap file we are processing (the first, second, third, etc.).
NR % 168 == 0 (% is the mod, or modulus, operator and returns the remainder after division) would be a way of saying ‘if NR is divisible by 168’ (that is, the remainder of NR divided by 168 is 0). As you can see, I didn’t use that, but opted for NR % 168 == 1 instead. This was to make the programming a bit more efficient.
The first input record is numbered, by awk, as 1. If I checked for a remainder of 0, then the line to output a new mergecap command wouldn’t run until line 168 and I would have to include a printf statement in the BEGIN block to output an initial mergecap command.
If, instead, I check for a remainder of 1, then the first line will match, as will line 169. Hence we will still get a new mergecap command every 168 lines, but it will start on the first input line instead of on the 168th. Basically it is just so that I don’t have to repeat the printf command to output a new mergecap command, in the BEGIN block.
The rest of the awk script prints a mergecap command with the current pcap filename as its first command line argument, if we are up to pcap file number 168n + 1. It then increments the file variable (which is equal to n in that equation actually) so that the next mergecap command that it outputs will use a different output filename.
For all other input lines, it simply appends the pcap filename to the end of the current mergecap command line.
The END block appends a newline character to finish the last mergecap command line.
getfieldtimes.sh: Generate timestamped list of a particular packet field
Usage
getfieldtimes.sh [-f ] [tsharkfieldname …]
Description
This command will generate a list of the frame.time_epoch, frame.time, and the specified fields tsharkfieldname (defaults to ip.dst if none specified) from particular packets which match pcapfilter (if specified, otherwise ‘ip’). The fields could be IP source address, destination TCP port, HTTP URL, MySQL query, SMTP command, or anything that can be referenced by a Wireshark display filter.
Typical Usage
To generate a list of each value of a particular packet field (IP source address, TCP destination port, HTTP URL, SMTP command, MySQL query, for instance), with the time that it was captured.
I used it to generate a list of all the MySQL queries, the time that they were captured, the IP address to which they were sent, and the source IP address and source port from which they came. It is useful to use the output files from this command as input to createtimeline.sh to generate a timeline. It can also be used with getattacks.sh to get all the values of a particular field, grouped by connection.
The output will be:
seconds since epoch (to enable chronological sorting) human readable timestamp IP destination address or specified fields.
You can modify the output fields as you wish, however, if you wish to use the output files as input to createtimeline.sh to generate a timeline, it needs the seconds since epoch, human readable timestamp, and the packet field of interest as the first three fields. getattacks.sh requires frame.time_epoch as the first field, and the remaining fields must include the source IP address and the source port (to identify different connections).
#!/bin/sh if [ "$1" == "-f" ]; then filter="$2" shift shift else filter="ip" fi file="$1" shift if [ $# -eq 0 ]; then set -- ip.dst fi tshark -r "$file" -Tfields -eframe.time_epoch -eframe.time `echo $* | sed 's/^\| / -e/g'` "$filter"
Explanation
After setting variables based on the value/existence of any command line options, it uses tshark to dump the seconds since epoch, human readable timestamp, and any fields named on the command line (or ip.dst if none are specified).
The echo and sed commands inside the backticks are there to add a ‘-e ‘ string before each remaining ($*) command line argument.
createtimeline.sh: Generate a timeline of the different values of a particular packet field
Usage
createtimeline.sh
Description
This command takes the output from getfieldtimes.sh and uses the seconds since epoch to determine when each value of the particular packet field was first seen, when it was last seen, and the number of times it was seen. Note that it also truncates strings of more than eight hex digits prefixed with ‘0x’, to a string of eight hex digits followed by ‘…’.
Typical Usage
I wrote this to determine when the MySQL attack queries was first seen so that I could see when a particular type of MySQL attack started happening. This command, when used with getfieldtimes.sh, would also be useful to look at attackers’ inbound HTTP requests, or malwares’ outbound HTTP requests.
#!/bin/sh # Note that delimiters are characters (ASCII 0x09) # After copying and pasting, you will need to change the character # immediately after the '\' character in the awk -F and cut -d options # from a space (or up to four spaces) to a . # You will then need a space character after the and before the # next character on the command line. awk -F\ ' { cnt[$3]++; if ($1 < minsecs[$3] || minsecs[$3] == 0) { minsecs[$3] = $1; time[$1] = $2; } if ($1 > maxsecs[$3]) { maxsecs[$3] = $1; time[$1] = $2; } } END { for (cmd in cnt) { start = minsecs[cmd]; end = maxsecs[cmd]; ts = time[start]; te = time[end]; print start "\t" ts "\t" te "\t" cnt[cmd] "\t" cmd; } }' | sed "/0x[0-9A-Fa-f]\{9,\}/s/0x\([0-9A-Fa-f]\{8\}\)[0-9A-Fa-f]*/0x\1.../g" | sort | cut -d\ -f2-
Explanation
It basically takes the seconds since epoch timestamp, human readable timestamp, and the MySQL query string, and uses associative arrays (or ‘hashes’ if you are a Python or Perl person — ‘hashes’ makes me think of cryptographic hashes, so I prefer the term ‘associative arrays’) to store the earliest time, latest time, and a count of how many times, that it saw each MySQL query string.
It uses seconds since epoch to index the array of human readable timestamps because the seconds since epoch is already stored (to determine chronological order) so we can fetch it for each query string, and it is a lot shorter than some of the MySQL query strings. Basically it is a memory saving tactic.
The sed command at the end will take a string of hex digits prefixed by 0x and truncate it to the first eight hex digits. This is in there because I was using this command to determine when certain MySQL queries were first seen, and a number of the attacks contained a command with what looked like a hex encoded Windows binary as an argument. The sed command above simply shortens those lines somewhat. If you are not processing any data with such hex encoded strings in it, then you can remove that sed command, or replace it with something more appropriate for your data.
sort sorts the entries chronologically using the seconds since epoch timestamp in the first column.
The trailing cut command removes the seconds since epoch timestamp because, except on occasions like 23:31:30 on Feburary 13th 2009 (UTC) when it was 1234567890, it generally isn’t that interesting to look at. Sad as it might be, I felt compelled to write an application for my mobile phone so that I could watch the occasion, as I was out trying to find a geocache at the time and hence couldn’t watch it on a UNIX box.
I later realised that I missed another interesting seconds since epoch time of 11:18:04 on 28th January 2010 (UTC) when the seconds since epoch time, when stored as a big-endian 32-bit number and read as a string of four ASCII characters, spelt my name. I must have been bored when I figured that out.
Seconds since epoch does make it easy to sort things chronologically though — better than some of the daft date/time formats that we humans use.
getattacks.sh: Group one field based on the value of two other fields
Usage
getattacks.sh <datafield>
Description
The script will read the output from getfields.sh, group the values specified by the datafield according to the srcip and srcport fields, and write them to files. It will also create a different output file for data fields that look like they were from a different connection (based on the timestamp) despite having the same srcip and srcport fields.
Typical Usage
To group data contained in a particular packet field in to separate files based on srcip and srcport. In other words, to group data by TCP/UDP connection which will, in most analysis cases, group it by attack. I used this to create separate files containing all the MySQL commands for each MySQL connection.
Script
#!/bin/sh # Note that delimiters are characters (ASCII 0x09) # After copying and pasting, you will need to change the character # immediately after the '\' character in the awk -F # from a space (or up to four spaces) to a . # You will then need a space character after the and before the # next character on the command line. if [ $# -eq 3 ]; then awk -F\ -v datafield="$1" -v srcip="$2" -v srcport="$3" '{ # $1: seconds since epoch # Other fields can contain arbitrary data but include # source IP address and source port somewhere cid = $srcip ":" $srcport; if ($1 - time[cid] > 3600) { time[cid] = $1; } print $datafield >>"attack_" time[cid] "_" cid; }' else echo "usage: $0 " fi
Explanation
This script creates a cid (connection ID) which is a string consisting of the source IP address, a ‘:’ character, and the source port. The data field is then output to a file based on this connection ID and the time that this connection was first seen.
The time associative array stores the timestamp of the first packet belonging to this connection. It assumes connections won’t last for more than 3600 seconds (1 hour). If they do, a second file is created. This is in case the same source IP address and port were later reused for a separate connection. We can’t use the TCP flags as we are missing that information.
Note that I have used the awk script to include information about each connection, in the filename of the output file rather than including it in the output file itself. This is because I want the contents of each output file to only contain the MySQL query data. This will enable me to use md5sum and create an MD5 hash of each of the output files to see which attacks were the same.
extractbins.sh: Extract hex encoded binary data from MySQL commands
Usage
extractbins.sh <pcapfile>
Description
This script searches through a text file looking for strings of hex digits with a leading ‘0x’ and a trailing ‘)’. It will output each such occurrence to a .bin file named after the input file and an incrementing sequence number for each such string contained in the input file.
Typical Usage
To extract hex encoded binary data from MySQL commands contained in a pcap file.
Script
#!/bin/sh ### # extractbins.sh <pcapfilename> # # Extract binary files from the '@a = concat('','0x...');' SQL commands # in a pcap file. # It will then spit out a binsrc file which will list MD5 sums of the # binaries and the IP address from whence they came. ### /usr/sbin/tcpick -r "$1" -wR 'port 3306' for file in `grep "0x" *dat | cut -d\ -f3`; do strings "$file" | awk ' /0x[0-9A-F]*/ { sub("^.*0x",""); output = 1; } (output) { hex = gensub("[^0-9A-Fa-f].*$","",1); printf("%s",hex); } /)/ && (output) { output = 0; printf("\n"); } ' > "$file.0x"; done awk ' { if (FILENAME != oldfilename) filenameidx = 1; hexdigits = "0123456789abcdef"; word = tolower($0); value = 0; for (nibble = 1;nibble <= length(word);nibble++) { char = substr(word,nibble,1); idx = index(hexdigits,char) - 1; value += idx * ((nibble % 2) == 1?16:1); if (idx == -1) printf("WARNING: Invalid hex digit %c\n",char) >"/dev/stderr"; if (nibble % 2 == 0) { printf("%c",value) > FILENAME "." filenameidx ".bin"; value = 0; } } filenameidx++; oldfilename = FILENAME; } ' *0x md5sum -b *bin |sort | cut -d_ -f1,2 | sed 's/\*tcpick_//' > binsrc
Explanation
The tcpick command is used to extract the data from TCP connections and save it to files. The -wR option specifies that the TCP connection data is to be written to the output files as it appears in the connection. The port 3306 is the filter telling tcpick that we only care about MySQL connections (3306 being the well-known port number for MySQL).
For each of the tcpick output files (.dat), the script checks to see if it contains the string 0x and if so, runs the strings command to extract all string data. The reason for running strings is because the raw TCP connection data contains the binary data from the MySQL network protocol, and not just MySQL commands.
The first embedded awk script looks for the first line containing 0x, removes everything up to and including the 0x and sets the output variable to 1 to indicate that the script should start printing output. This is necessary because some of the MySQL attacks were spreading the hex encoded data over multiple lines.
The second pattern-action rule will process the line if the output variable is set to 1 (that is, if it has seen the start of a hex string), and print all the characters up to the last contiguous string of non-hex characters on the line. This is included so that we can leave the ‘)’ bracket on the input line so that the next pattern-action rule can match it.
The third pattern-action rule checks for a ‘)’ character and if found, and the script is currently printing output, the output variable is set to 0 to stop the script from printing any more output as we have reached the end of this particular hex string. A newline is printed to mark the end of this hex string.
This first section creates a series of files ending in .0x for each of the tcpick output .dat files which contain the string 0x. Each of these .0x files will have one line per block of hex encoded data in the corresponding .dat file. These lines will consist purely of the hex encoded data itself.
The second awk script merely converts the hex encoded data in to binary data by first converting the input to lower case characters and then searching for each nibble (4-bit value, that is, a single hex digit) in the hexdigits string. The index of its location in this string will be its numerical value plus one (as awk starts numbering the character positions at 1, and not at 0 like C does).
This index is then either multiplied by 16 or not, depending on whether it is an odd numbered nibble (first nibble of a byte) or an even numbered nibble (the second nibble of a byte). In the case of the latter, the character is printed to the output file.
The filenameidx variable is incremented for each line in the file. Hence the output filename contains an incrementing number identifying which block of hex in the input file that this particular binary data came from.
The last line of the shell script creates an MD5 hash of each file, and extracts the MD5 sum and IP address (contained in the filename) from the md5sum output.