Beyond Automated Unpacking: Extracting Decrypted/Decompressed Memory Blocks

It’s been about a year and a half since I wrote about a behavioural approach to automated unpacking, and I figured it was time to add some more functionality to unpack.py. This time, I’m going to look at malware decrypting/decompressing code from within itself, and process hollowing, and see if we can capture the decrypted/decompressed/newly written memory. Let’s spruce unpack.py up a tad.

I noticed some malware calling RtlDecompressBuffer() and CryptDecrypt() to extract new code to run, and don’t forget the classic CreateProcess() / WriteProcessMemory() trick (process hollowing) to run malicious code under the guise of another process. I figured that with WinAppDbg‘s ability to hook API calls, we could capture the resulting memory data and save it to a file. To do this, we’ll take version 2013.02.26 of unpack.py, and add to it.

Since I’ve also been doing a lot of work writing Splunk search queries and reports recently, I started thinking about getting log data out of my unpack.py in a way that would make it useful and easy to process in something like Splunk. Consequently, we’ll add some extra log output to it while we’re going — all up, it’s shaping up to be a fun and exciting evening.

The plan

Whilst looking at some of the malware samples from my honeynet, I noticed the RtlDecompressBuffer() and CryptDecrypt() Win32 API calls being used to decompress and decrypt new code to run.

I also noticed behaviour known as process hollowing, which is where a process creates a new process in a suspended state, overwrites the new process’ memory image with (usually malicious) code, and then resumes the process. This is often used to run malicious code under the guise of a legitimate process.

It would also be useful if unpack.py spat out some machine readable log data. This would allow other utilities to process the log data more easily, and make it easier to automatically run unpack.py from a script and make decisions based on what unpack.py finds.

The plan then is as follows:

Look for calls to RtlDecompressBuffer() and CryptDecrypt(), and capture the compress/encrypted memory and the decompressed/decrypted memory.
Look for calls to the CreateProcess() functions creating processes in a suspended state, and subsequent resumption of those processes.
Look for calls to WriteProcessMemory() to capture the memory written to other process’ address space.
Spit out log data in JSON format (time will tell whether this was the best format to use).

Note that this post continues on from my Automated Unpacking: A Behaviour Based Approach post (http://malwaremusings.com/2013/02/26/automated-unpacking-a-behaviour-based-approach/), which talks about the previous version of unpack.py. This post only discusses the additional functionality in the new version.

Reading decrypted and decompressed memory

The first step in obtaining decrypted and decompressed memory blocks, is to hook the API functions responsible for doing the decryption and decompression, namely CryptDecrypt() and RtlDecompressBuffer(). These hooks are created by adding entries to the apiHooks{} dictionary in section C.1 of unpack.py. Something to add in the future would be detection for other encryption/decryption library calls, such as those found in OpenSSL.

Now, for each of these API calls, we’ll install a pre_ and post_ (default WinAppDbg) hook handler (sections C.3 and C.4):

pre_CryptDecrypt()
post_CryptDecrypt()
pre_RtlDecompressBuffer()
post_RtlDecompressBuffer()

The pre_ handlers will be called on entry to the API calls, and the post_ handlers will be called when the API calls return (this is automatic WinAppDbg behaviour).

The reason for hooking both entry and exit to the API calls, is so that we can get both the encrypted and decrypted, and compressed and uncompressed data. RtlDecompressBuffer() actually has two separate parameters for the compressed and uncompressed buffers, but we’ll still create a pre_RtlDecompressBuffer() handler in case the calling program specified the same buffer for both. I haven’t checked if this is valid and/or will actually work, but just because something isn’t valid, doesn’t mean someone won’t do it. In fact, if someone is trying to get access to a system, then they’re probably more likely to do something that isn’t valid.

pre_CryptDecrypt() (section C.3) uses WinAppDbg‘s Process.read_uint() method to dereference the pdwDataLen argument, giving us the size of the buffer.

Apart from that difference, both pre_CryptDecrypt() (section C.3) and pre_RtlDecompressBuffer() (section C.4) use WinAppDbg‘s Process.read() method to read the data buffers (pbData, CompressedBuffer, and UncompressedBuffer) from the process’ address space, and write them to disk.

Both the CryptDecrypt() and RtlDecompressBuffer() hook handlers save the memory buffers to files named after the exe file that unpack.py is running — they simply append .memblk<bufferaddressinhex> followed by .enc (encrypted), .dec (decrypted), .comp (compressed), .decomp (decompressed). These output file names are also logged in both the human readable log output, and in the JSON log output.

Tracking potential process hollowing

Next we’ll turn our attention to process hollowing behaviour. Process hollowing is where a process creates another process (usually in a suspended state), overwrites its memory image with (usually malicious) code, and then resumes the process. This makes the malicious process appear to be that of a legitimate executable file (in process listings anyway). We’re going to see if we can capture the (usually malicious) code that is used to overwrite the initial process’ image.

We’ll start by logging processes created using the CreateProcess() API calls. To capture the injection (the overwriting of the initial process’ image), we’ll also log calls to WriteProcessMemory(), and save the written data.

A (long) note about the CreateProcess() API call. The CreateProcess() call is one of a number of Win32 API calls which is actually implemented as two versions — an A version which handles ASCII strings, and a W version which handles wide-character (WCHAR) strings. The API only contains CreateProcessA() and CreateProcessW() — it does not contain CreateProcess().

Microsoft’s documentation documents CreateProcess(), which covers both versions, as the only difference is the type of string passed to the call — CreateProcessA() takes ASCII strings, whereas CreateProcessW() takes wide-character (WCHAR) strings.

Like Microsoft’s documentation, I shall use CreateProcess() when referring to the functionality or something which applies to both versions, and name the version of the API call specifically when referring to something which needs the actual API call name (such as function hooking, or symbol references).

For instance, if I talk about hooking CreateProcess(), then I mean hooking both versions. This usually means that you’ll need two hook handlers — one for CreateProcessA() and the other for CreateProcessW().

Just to complicate this further, I’ve created a post_CreateProcess() function (see below) to act as a hook handler for both versions (CreateProcessA() and CreateProcessW()), as the parameters and functionality are the same apart from the string format.

Right — back to unpack.py. We’ll start by adding some new variables, which can be seen in sections A.9 – A.11 (inclusive). These variables are to store information on processes that are created:

createdprocesses
createsuspended

Next, we’ll add CreateProcessA(), CreateProcessW(), and WriteProcessMemory() to the apiHooks{} dictionary (section C.1), causing WinAppDbg to hook them. We’re only interested in hooking the CreateProcess() calls after they’ve completed. This way we can record information such as whether the call was successful, and if so, the process id of the new process.

Similarly, if we hook the end of WriteProcessMemory() then we can log whether or not it was successful and the number of bytes that it wrote.

Section C.5 contains the CreateProcess() hook functions. Since CreateProcessA() and CreateProcessW() differ only in the types of strings that they handle, we’ll create a hook function, post_CreateProcess(), to do the work, and create post_CreateProcessA() and post_CreateProcessW() as wrappers to post_CreateProcess(). These wrapper functions will call post_CreateProcess() with either fUnicode set to False, in the case of post_CreateProcessA() or with fUnicode set to True, in the case of post_CreateProcessW(). This approach will save having to duplicate most of the functionality to handle both CreateProcessA() and CreateProcessW().

The post_CreateProcess() function (the ultimate callback function for the CreateProcessA() and CreateProcessW() hooks) uses WinAppDbg‘s Process.peek_string() to extract information from the CreateProcess() arguments and from the PROCESS_INFORMATION structure, and logs the call.

It also adds information to the createdprocesses[] array, which will enable us to access the information from within the WriteProcessMemory() hook callback.

Now, if you have a read of the post_CreateProcessA() and post_CreateProcessW() functions, you’ll also notice that they check the dwCreationFlags parameter to see if bit 2 (0x04) is set. This bit corresponds to the CREATE_SUSPENDED flag, and instructs the CreateProcess() calls to create the process without starting it. This is often done by malware which is about to hollow out the process and replace parts of the process’ address space with, wait for it, malicious code.

If the CREATE_SUSPENDED flag is set, we use WinAppDbg‘s Debug.hook_function() method to dynamically create an API hook for ResumeThread(). The hook_function() call specifies hook_createsuspendedresume() as the pre-call-back (called before the API function’s code is executed) function. hook_createsuspendedresume() can be found after the CreateProcess() hooks in section C.5.

Process hollowing makes the creation of a suspended process potentially of interest, so we’ll log it. Not only that, but by instructing WinAppDbg to hook the ResumeThread() API call, we can log the address from where the suspended thread of the new process is resumed. The idea behind this logging is to give the analyst some ideas of where to look in the disassembled code, and where he/she may want to set breakpoints to do further analysis.

The process creation is logged by the CreateProcess() hooks (in post_CreateProcess()), and the resumption of a process created in a suspended state is logged by the dynamically created ResumeThread() hook, hook_createsuspendedresume().

Now, the reason for explicitly specifying a call-back function (using the preCB parameter to hook_function()) when hooking the ResumeThread() call (rather than using the default pre_ResumeThread() handler), is to allow us to potentially hook ResumeThread() for other reasons.

By using a separate hook function (hook_createsuspendedresume()) we know that ResumeThread() was potentially called to resume a process created in a suspended state (as the hook was created in the CreateProcess() hook), so we look for an entry in the createsuspended[] array. If we want to hook ResumeThread() for other reasons, then the corresponding hook function won’t need to check for a suspended process.

The createsuspended[] array (declared in section A.10) stores the process ids of processes that are created in a suspended state, and is indexed by a tuple containing the process id of the process calling CreateProcess() (that is, the new process’ parent process), and the thread handle of the created process’ thread (hThread from the PROCESS_INFORMATION structure) returned by CreateProcess(). The createsuspended[] array is updated by the post_CreateProcess() hook function.

The WriteProcessMemory() hook (post_WriteProcessMemory() in section C.6), after doing the standard logging about it being called, uses guarded_read() (my wrapper to WinAppDbg‘s Process.read() — see Issues below) to dereference the lpNumberOfBytesWritten argument, giving us the size of the buffer being written.

post_WriteProcessMemory() then extracts information that the CreateProcess() hook (post_CreateProcess()) stored in the createdprocesses[] array. It extracts the process id, application name, and command line of the new process. This is done purely for logging reasons.

The main work of post_WriteProcessMemory() is to log the written memory to a file, so it uses guarded_read() to dereference the lpBuffer argument and read the written memory. The written memory is written to a file with the suffix .memblk0x<baseaddr>-<pid>.wpm (an acronym of WriteProcessMemory), where baseaddr is the target address in the target process, and pid is the process id of the target process.

That is all that this version is going to do as far as process hollowing/injection goes, but it paves the way for investigation in to automated detection of process hollowing/injection, and the automated analysis of any injected code.

Machine readable log data

Another improvement that we’re going to make, is the addition of some machine readable log data. The idea behind this is to make it easier to capture and subsequently search through the log data using log indexing/searching software such as Splunk (which I’ve been doing a lot of lately).

I’m trying JSON as a log data format, as that provides the keyword/value information that Splunk can use, and also makes it easy to load it in to JavaScript for web interfaces and the like. Time will tell whether this was the best format to use.

To create the JSON output, we’re going to add an array, eventlog[] (section A.11), which we’ll throw JSON formatted event information in to. At the end of the script, we’ll log all the events from the array.

We also have some event types, so that we can later search/find logged Win32 API calls, or just the original entry point in the unpacked code, for instance. At the moment, we’ve got:

Win32 API
WinAppDbg Event
unpack event

Win32 API is used to log Win32 API calls, such as CreateProcess(), WriteProcessMemory(), e.t.c., along with any meaningful arguments passed to them.

WinAppDbg Event is used to log events for which WinAppDbg calls a call-back function (and for which we were interested enough to create the call-back function). For instance, process creation (create_process()), process exiting (exit_process()), thread creation (create_thread()), DLL loading (load_dll()).

unpack event is an event generated by unpack.py itself when it finds the unpacking loop. It logs the unpacked entry point, the calling address, the calling instruction, and the name of the file to which it has dumped the unpacked code.

Issues

It wasn’t all plain sailing. Here are some of the issues that I came up against while adding the new functionality.

Trying to read from guarded memory

The RtlDecompressBuffer() and CryptDecrypt() calls both contain an address of one, or two, memory buffers. We want to be able to read these memory buffers to save the decompressed/decrypted data to files. This is fine, until you remember that the memory buffers may reside in some of the newly unpacked memory that our automated unpacking code identified, and hence there may be a memory breakpoint set which covers the memory that we’re about to read from. Don’t worry — those calls won’t make it hang, they’ll only make it feel like reading, from some unguarded memory, from some unguarded memory.

No seriously, that can actually be a problem, and was something that had me foxed enough to go putting print statements in to WinAppDbg‘s code to try and figure out why it was telling me that I couldn’t read the memory that I was trying to read. It wasn’t until I got WinAppDbg to dump the memory map which pointed out the different protection bits on some of the regions that I was trying to read, and noticed that the protection differed only by the PAGE_GUARD bit, that I finally twigged as to why it was failing. Enter the guarded_read() function (section B.2).

The guarded_read() function was created to overcome this issue. This function will check all of the memory breakpoints for one which covers either the first address in the requested range, or the last address in the requested range. If either the first or last address of the requested memory range is covered by a memory breakpoint, then the function disables the breakpoint, performs the read, and then enables the breakpoint again.

Thinking about it afterwards, this algorithm won’t cover the case where a memory read spans more than two pages. This will need fixing in a future version, but as it is, it works for the samples I’ve been testing it against.

Missing debugee address information for VirtualAlloc() calls

VirtualAlloc() calls VirtualAllocEx(-1,…), so the original unpack.py only hooked VirtualAllocEx(). Problem with that is that we lose information, because the log output shows VirtualAllocEx() being called from within kernel32.dll then, rather than from within the code of the debuggee. The new version hooks and logs VirtualAlloc() as well as VirtualAllocEx().

Way too slow

unpack.py was completing within a reasonable amount of time when running on the sample with the unpacking loop. However, if it was run on a sample which didn’t have an unpacking loop, then it would take yonks to run!

The reason for the slowness was because when unpack.py sees newly allocated executable memory being written to, it sets the processor’s trap flag to put it in to single-stepping mode. This causes the processor to execute a single-step exception (interrupt 1 from memory) after each instruction. The single-step exception was then running our Python function. No wonder things were slowing down!

To get around this, I changed the tracing variable (section A.3) from being a flag, to being a counter. Consequently I then changed its initial value to -1 (an invalid counter value), to signify that we’re not tracing. This variable is incremented every time the single_step() exception handler function is called, and acts as an instruction counter.

If this instruction counter reaches 250,000, then we assume that either there isn’t an unpacking loop (or at least not one that is going to be found using this algorithm) and disable the single-stepping mode.

The 250,000 value may need some tweaking (I should probably make it a configuration constant somewhere — an exercise for a future version). I picked this value because it was the next highest round (base 10) number from the number of instructions executed before the particular sample I had called the unpacked code.

The End (finally!)

Oops — I accidentally clicked on Publish rather than on Save draft (anyone remember the Lemmings level Careless clicking costs lives!?), so this was published a bit sooner than I was ready to publish it and I’ve just had to rush to get the actual code posted. You may notice some updates as I fix things that aren’t quite right (but right now, I have to eat!).

If you want to have a play with my unpack.py script, you can find this updated version at http://malwaremusings.com/scripts/unpack-py/. The previous version (v2013.02.26) can also be found in the drop-down Scripts tab.

Malware Musings

Thoughts on malware and malware analysis