Ever wondered whether it would be possible to find strings that have been xored, without undoing the xor? After seeing Lenny Zeltzer demonstrate the ‘xorsearch’ command, I started to wonder if it was possible to find xored strings by considering how each byte differed from the byte next to it.
I should probably include a ‘#define waldo wally’ statement here for the benefit of Americans/Canadians, whom I believe call the constantly hiding red and white character ‘Waldo’ where as to us he is known as ‘Wally’.
As noted in my ‘The Usefulness of Strings‘ series, strings can often give away some useful clues about what a piece of malware may get up. Consequently it is quite common for malware to obfuscate strings, and usually by using a better method than that which this blog post describes. However, and unfortunately for you, that isn’t about to stop me from waffling on about it.
I watched Lenny Zeltzer‘s Malware Analysis Essentials using REMnux webcast, during which he demonstrated the ‘xorsearch‘ command. This command will search for an xored version of a given string by, I believe, brute forcing the xor byte.
If it finds a match, it will then xor the rest of the file using that xor byte to see what other strings it can find. This is quite nifty, however it requires a known string to start with. Although that probably isn’t too difficult to guess actually, like the ‘http’ example that Lenny used is probably a reasonably safe bet.
This got me musing about whether there was some sort of mathematical way to find xored strings without having to undo the xor. I’ve come up with something which isn’t brilliant, but it is a start.
My first thought was to check the (mathematical) difference between consecutive bytes, that is, subtract one byte from the byte next to it. My reasoning behind this was that if it is part of a string, then all the bytes will be reasonably close together as they will be between the ASCII code for ‘A’/’a’ and ‘Z’/’z’.
This reasoning was flawed, however, as that difference can change after an xor, depending on which byte the original byte is xored with. For example, the string ‘nop’ would be represented (in ASCII) as 0x6e 0x6f 0x70. Subtracting each byte from the byte next to it, and taking the absolute value (so it doesn’t matter whether we subtract byte ‘n’ from byte ‘n+1’ or byte ‘n+1’ from byte ‘n’), we get 1.
If we then xor each byte of our little string with the byte 0x01 (let’s keep it simple), we get 0x6f 0x6e 0x71. This maintains a difference of 1 between the first two bytes, however, the difference between the second and third bytes is now 3. This is not too bad, as it would still be within our test range of 0 – 25 (as a string consists of letters from our 26 character alphabet — ignoring the difference between upper case and lower case characters for the moment).
However, if the xor byte contains a bit that is more significant, say one whose value is greater than 26, then it will be possible for the difference between consecutive bytes after xoring to be greater than 26. For example, if we xor our little string with 0x30, then we get 0x5e 0x5f 0x40. Still a difference of 1 between the first two bytes, but there is now a difference of 31 (0x1f) between the second and third bytes. That won’t do at all.
What if we xor consecutive bytes? After all, xor has been described as addition without carry. Well it turns out that the result of xoring consecutive bytes before a simple (encrypting) xor yields the same result as xoring consecutive bytes after. Let me attempt to illustrate this.
Three bits walk in to a bar, a b c, followed closely by another three bits, d e f. We are going to use an xor key consisting of three bits x y z.
If we xor the two three bit words from the bar, before they get stuck in to the beer, we get the three bits:
(a ^ d) (b ^ e) (c ^ f)
If we xor the two three bit words with the xor key word of x y z, we get the two three bit words:
(a ^ x) (b ^ y) (c ^ z) (d ^ x) (e ^ y) (f ^ z)
Now, xoring those two ‘encrypted’ three bit words together, that is doing the xor of two consecutive bytes after ‘encrypting’ with the simple xor operation, we get:
((a ^ x) ^ (d ^ x)) ((b ^ y) ^ (e ^ y)) ((c ^ z) ^ (f ^ z))
Which, since the associative law holds for bitwise xor, is the same as:
((a ^ d) ^ (x ^ x)) ((b ^ e) ^ (y ^ y)) ((c ^ f) ^ (z ^ z))
Which simplifies to:
((a ^ d) ^ 0) ((b ^ e) ^ 0) ((c ^ f) ^ 0)
As (x ^ x) == 0, because (0 ^ 0) and (1 ^ 1) both equal 0.
Or, if you want to get really excited:
(a ^ d) (b ^ e) (c ^ f)
As (x ^ 0) == x, because (0 ^ 0) == 0, and (1 ^ 0) == 1. A fact that compilers often take advantage of to zero a register more efficiently than by using a mov instruction.
Notice that what we end up with is the same result as xoring the two unencrypted three bit words together. Hold on a second though — don’t go inciting any Mexican Waves just yet. We now need to figure out how to use such knowledge.
Let’s write a quick Python script to see what the results are of xoring every capital letter of our alphabet with every other capital letter, to see what kind of result we can expect if we are xoring consecutive bytes of a string. 0x41 is the ASCII code for ‘A’, 0x5a is the ASCII code for ‘Z’, and the ‘+ 1’ is because the Python range() function doesn’t include the last value in the range:
#!/usr/bin/python for i in range(0x41,0x5a + 1): for j in range(0x41,0x5a + 1): print "%c %c %02x" % (i,j,i ^ j)
Now, if we run that we get 676 (26 * 26) lines of output which looks like it is giving you the distance between the first letter and the second letter, but just getting it slightly wrong. We now want to know the range covered by the xor results in the third (last) column. This is where I get to demonstrate why I like UNIX:
$ ./xortst |cut -d\ -f3 | sort | uniq 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f
There, much nicer. This tells us that the result of xoring consecutive bytes of capital letters will be between 0x00 and 0x1f (inclusive). What about the case of capital letters next to lower case letters? A lower case letter has the same ASCII code of its capital counterpart ored with 0x20 (that is, with bit five set). Let’s modify our xortst (I called that little test script xortst, by the way) Python script to xor each capital letter with each lower case letter:
#!/usr/bin/python for i in range(0x41,0x5a + 1): for j in range(0x61,0x7a + 1): print "%c %c %02x" % (i,j,i ^ j)
This is the same as the previous version, only the inner for loop (j) now has a range of 0x61 (‘a’) to 0x7a (‘z’) plus one (the Python range() function doesn’t include the last value of the range). Note that 0x61 (‘a’) == 0x41 (‘A’) or 0x20, and that 0x7a (‘z’) == 0x5a (‘Z’) or 0x20. It’s not particularly important here, but it’s a neat trick that you can impress your friends with later. It is also handy to know as you will often see the constant ‘0x20’ in malware as it looks for a space character in a string, uses it to convert all characters to upper case/lower case, or does a case insensitive search.
Right then, running our new version of xortst reveals that the result of xoring consecutive bytes of different case letters results in a value between 0x20 and 0x3f (inclusive). Note that this is the same as xoring the same case characters together, only with bit 0x20 set. This is to be expected as bit five (0x20) is set in all of the lower case letters but not in the upper case ones, so it is bound to be set in the xor result.
If you’re feeling like it, you can also modify the outer for loop such that i ranges from 0x61 to 0x7a + 1 to see what happens when each lower case letter is xored with every other lower case letter. I can assure you the results are the same as xoring upper case letters together, but feel free to try it as there is nothing like learning by doing.
Now that we know this, we can attempt to do something useful with it. Let’s write a Python script to read a file, xor consecutive bytes together, and look for a result between 0x00 and 0x3f. Something that I bet you’ve been dying to do for ages:
import sys; # # Read the file in to a variable # f = open(sys.argv[1],"r"); data = f.read(); # # Initialise variables # d = 0 # xor value start = -1 # offset of first char of potential string # # for each byte in the file... # charcnt = 1; for i in range(len(data)): # ... except the first byte if i > 0: # # xor the previous byte with the current byte # prevc = data[i - 1]; c = data[i]; d = ord(prevc) ^ ord(c); # # if the result is between 0x00 and 0x3f (incl)... # if ((d >= 0x00) and (d <= 0x3f) and (charcnt < 3)): # # if the previous result was within 0x00 - 0x3f # make a note of the offset (start) # if (start == -1): start = i - 1; else: # # if the previous result wasn't within 0x00 - 0x3f # that is, we've found the end of a possible string, # then check to see if this was a run of four or more # 0x00 - 0x3f results # if ((start != -1) and (charcnt < 3)): if (((i - start) >= 4)): print "Possible string at offset 0x%02x - 0x%02x" % (start,i - 1); start = -1; if (charcnt >= 3): start = -1; # Note 1: Keep track of the number of consecutive bytes that are the same. if (prevc == c): charcnt += 1; else: charcnt = 1; f.close();
Now I’ve cheated a bit here and actually put a little bit of extra code in there to fix an issue that I found. I shall explain that a tad later. I’ve also tried to use comments to explain the script so that I don’t ramble on so much about it in the rest of the blog post.
Basically this script takes a file name on the command line, reads the contents of the file in to the data variable, and then proceeds to xor each byte (except the first, for a reason which will shortly become obvious) with the byte before it.
If the xor result is between 0x00 and 0x3f, and the previous xor result wasn’t, then it uses the start variable to note the offset at which this occurred. This will be the start of the potential string.
If the xor result is not between 0x00 and 0x3f, and the previous xor result was, then we’ve found the end of out potential string. At this point we check that the length of our potential string is at least four characters (the default that the UNIX strings command uses).
Now, a problem with this approach of finding strings is that it is really looking for consecutive bytes that are (mathematically) close together, as characters of a string would be. However, this also covers a number of constants, address offsets, and in some cases groups of instructions (for instance, a number of push and/or pop instructions in a row).
The check to make sure that the potential string is at least four characters long helps to weed out some of these false positives, however, there are still a large number of them left. This is where the charcnt variable and the code at ‘Note 1’ come in.
I noticed that my script was identifying large blocks of 0 bytes as being a string, and which according to the logic I’d used in the script, they could have been. To solve this, I originally said that the xor result had to be between 0x01 and 0x3f, but then that obviously broke when encountering duplicate letters in a string (as xoring any number with itself yields 0).
Instead, I had a bit of a think and couldn’t remember seeing an English word containing the same letter three times in a row. To make sure that this wasn’t just due to my grasp of the English language, and because it also sounded like a good excuse to pipe some UNIX commands together, I set about confirming this using the words file on my Linux box.
$ time (cat /usr/share/dict/words.pre-dictionaries-common | while read word; do echo "$word" | tr "[A-Z]" "[a-z]" | sed 's/./'"$word"' &\n/g' | sed '/^$/d' | uniq -c;done | tr -s " " | grep "^ [3-9]")
I’ve complicated that a little bit by running it in a subshell (using the parenthesis), and using the time command to time how long it took to complete, mainly because it seemed to be taking a while.
Twenty eight minutes later, the command completed. One of the thoughts that ran through my mind whilst waiting for that to complete, was how this problem was more suited to an awk script than my Bourne shell/sed combo, so I set about writing an awk script, just for comparison, and ran it using ‘awk -f a /usr/share/dict/words.pre-dictionaries-common’, where ‘a’ is the imaginative name of my awk script. That ran in a second, if that, so if you are pushed for time I’d recommend the awk script approach:
{ # start of a word so init vars cnt = 0; prevc = ""; # for each character in the word for (i = 1;i <= length($0);i++) { c = substr($0,i,1); # if it was the same as the last one if (prevc == c) { # increase char count cnt++; } else { # otherwise reset char count to 1 # as this is the first in this run # of this character cnt = 1; } # if we have seen more than two in a row # we want to know about it, so print # the word, the letter, and the count if (cnt > 2) print $0,c,cnt; # on the next iteration, this character # will be the previous one prevc = c; } }
Both of these scripts did show that the only words with three consecutive letters the same, are KKK’s and WWW’s. Whilst this will cause problems identify strings such as hostnames (www.mymaliciousserver.com, for instance), which are often found in malware, it will still identify other strings and in the case of that example, it will identify ‘mymaliciousserver’.
So the code at ‘Note 1’ was added and now the output is much more reliable, with the script not identifying blocks of zero bytes. I ran it on both a plain text version of good old ws2_32.dll, and on a version that I had ‘encrypted’ by xoring every byte of the file with a ‘key’ byte. The results were the same for both the unencrypted and encrypted versions of the file.
The script does still pick up a few non-string parts of the file, namely four byte constants/offsets that are scattered throughout it, but it did appear to be picking up all of the four or more character strings (of alphabetic characters only, so spaces, digits, punctuation characters aren’t included).
I tried to improve on this by getting the script to attempt to work out the ‘key’ byte used in the original xor operation. If we had this, we could decrypt the strings and potentially other parts of the malware.
To start with, I xored each byte of the possible strings with the bytes 0x41 – 0x5a (‘A’ – ‘Z’). That’s the good thing about xor — xoring the cipher text with the plain text will give us the key. Since we are looking for a string of bytes, we’ll try xoring each character thereof with each character of the alphabet — being the range of output that we are looking for. This will give us a set of possible key bytes for each byte, that result in an upper case letter.
If we then count how many bytes each candidate key byte successfully decrypts to an upper case letter, we can rule out any candidate key bytes that don’t successfully decrypt all the bytes. They are either incorrect, or the data that we are decrypting is not a string. This leaves us with a set of possible key bytes for the given possible string.
For each of those possible keys, k, k ^ 0x20 is also a possible key as we know from the initial test that xoring with k will produce a capital letter, hence xoring with k ^ 0x20 will produce a lower case letter (as the only difference between the ASCII codes of corresponding upper and lower case letters is the bit 0x20, which will be flipped by the xor 0x20 operation).
Another test we could do here is check to see if bit 0x20 is the opposite on the first character to what it is on the remaining characters. This suggests that the case of the first character is different to that of the remaining, and would suggest that the key byte is more than likely the one that decrypts the first character to an upper case letter, as English tends to capitalise the first letter of words rather than capitalise all but the first letter of words.
This counting to check the decryption of each byte helped to rule out some of the false positives, namely potential strings for which the script couldn’t find a key byte that would result in all of the characters of the potential string being decrypted to an alphabetic character.
My next step will be to assume that all of the potential strings have been xored with the same key byte. The script can then take the resulting candidate key bytes for each string, and look for key bytes that are common to all of them. After that, we may just have to decrypt the strings using the possible key bytes, assume that the strings are English words, and check for them in the UNIX words file.
This could help us deduce a most-likely key, but shouldn’t be used to rule out possible keys as the strings themselves may not actually be encrypted English words. For instance, base64 encoded data. Also, shell commands (which more than likely won’t be detected by the above algorithm anyway) often appear in malware, commonly to create a batch file to use the ftp or tftp commands to download more malware.
Obviously this approach won’t necessarily work if a multi-byte xor key is used, but that does raise the question of whether or not a similar approach is possible for such a problem.
So there is some food for thought. At some point I would like to continue work on my script and update this blog post, however I have lots of other interesting things that I would like to work on. I have at least discovered that finding strings in single byte xor encrypted data without undoing the xor, is not completely out of the question, but does this method really have any benefit over brute forcing the xor byte?