Certain memory conditions have to be met before malware can unpack code and run it — the memory has to be writeable to unpack code to it, and executable to be able to execute it. The question is, can we use Win32 API calls to detect malware creating these conditions, and subsequently not only detect and identify unpacked code, but also find the original entry point?
My honeynet was showing attacks, mostly from the same couple of class C networks, which were downloading malware from hosts on a different but consistent class C network. The file name of the downloaded file would change periodically, but the MD5 hash of the downloaded file seemed to change on a daily basis regardless of whether the file name changed or not.
This was suggesting that the attacks were mostly downloading the same malware, but that the malware was changing slightly on a daily basis in order to make it harder to identify later downloads as being the same as previously captured samples.
I wanted to test this theory, so I started reversing some of the samples and it wasn’t long before I noticed the same pattern of behaviour:
- Aquire a block of writeable, executable memory.
- Unpack code to the newly allocated memory.
- Transfer execution to the unpacked code in the newly allocated memory.
In order to be able to unpack code, malware needs a block of writeable memory. It has to write to that memory. It has to pass control to that block of memory. The first condition requires allocation of a block of memory with write permissions. The second requires a write memory access to that block. The third requires that the block of memory have execute permissions and also for the processor to start fetching instructions from that block of memory.
In a previous couple of blog posts, Self Modifying Code: Changing Memory Protection and Self Modifying Code: Changing EXE File Section Characteristics, I investigated the conditions that are necessary for a program to modify code which is subsequently going to execute, and how these conditions can be created.
In this post, I am going to discuss how a scriptable debugger can be used to automate detection of the above pattern of behaviour. I first used PyDbg as the debugger, but switched to WinAppDbg as I added more analysis functionality, because I found that WinAppDbg made it easier to debug multiple processes at once.
Acquiring a block of writeable, executable memory
I saw the malware samples using VirtualAlloc() with an flProtect argument of PAGE_EXECUTE_READWRITE, and wondered why would it request executable memory if it wasn’t about to unpack, or otherwise obtain and write code to it?
So, the first step in automating this process is to detect a VirtualAlloc() request for executable memory. This can be done using an API hook. The API hooks are performed in part C (search for ‘# C.’) of my unpack.py script. Note that VirtualAlloc() calls VirtualAllocEx() with hProcess = 0xffffffff (-1), so I hooked VirtualAllocEx() (to support some further analysis functionality which isn’t included here as this post is about automated unpacking).
To hook API calls in WinAppDbg, add them to the apiHooks{} associative array (dictionary/hash), which is indexed by library file name.
WinAppDbg calls the post_VirtualAllocEx() hook callback function when the VirtualAllocEx() call returns. By hooking this call on exit rather than on entry, we have access to its return value, being the address of the allocated memory.
We need the address and size (passed as an argument to the VirtualAllocEx() call) of the allocated memory block for the next part — determining when the unpacked code is written to the allocated block of memory.
C.2 The VirtualAllocEx() hook handler
The VirtualAllocEx() hook handler, post_VirtualAllocEx(), can be found at block C.2 in unpack.py, and is called by WinAppDbg when the VirtualAllocEx() library call returns (or is about to return).
At this point, we know that VirtualAllocEx() was called and since it is about to return, we have access to its return value, being the address of the allocated memory.
If flProtect contained any of PAGE_EXECUTE (0x10), PAGE_EXECUTE_READ (0x20), PAGE_EXECUTE_READWRITE (0x40), or PAGE_EXECUTE_WRITECOPY (0x80), then the request was for executable memory which suggests that the malware is about to unpack (or otherwise obtain) code to store there.
Our plan is to catch the malware writing to the newly allocated memory, as that will help to identify the unpacking loop, and to also catch when the processor jumps to (starts executing instructions from) the newly allocated memory.
So the next step, is to set a memory breakpoint. Memory breakpoints (in this case) use guard pages to alert us when the allocated block of memory is accessed. In this case we want to know when it is written to, or you could say when we have new code on the block (and that should have generated a bad joke exception).
Memory breakpoints are created in WinAppDbg using the debugger instance’s watch_buffer() method. Now be careful with this as I spent some time trying to figure out why my memory breakpoints weren’t working.
Eventually I examined the WinAppDbg source and realised that it is using watch_buffer()‘s address and size parameters to (incorrectly) calculate an end address. This is because it needs to know which memory pages are affected by the breakpoint because guard pages can only be set on whole pages.
The problem arises because of the old fence-post problem. The size of a memory page (on 80×86 processors) is 4096 bytes, however, these bytes are addressed (numbered) from 0 – 4095. That means that if you calculate which pages are affected by simply adding the the size on to the base address of the first page, which WinAppDbg does (in breakpoint.py), you get one too many pages.
In other words, WinAppDbg’s watch_buffer() functionality is calculating the end address incorrectly, by adding its address and size parameters. As an example, in the case of a one page guard page starting at address 0, we have an address of 0, and a size of 4096 (the size of a page). Adding them together gives you 4096, which is actually the start address of the second page. WinAppDbg thinks that it needs to create two guard pages, one for the page starting at address 0 to cover addresses 0 – 4095, and then a second page starting at address 4096 to cover address 4096.
This off-by-one error was causing my watch_buffer() calls to fail. So s/long story/short/, that is why I use dwSize – 1 in the watch_buffer() call in unpack.py. Just thinking about it, that makes me wonder if it will ignore access to the last byte of the range now that I have subtracted 1. That is, is it setting the range correctly, but calculating affected pages incorrectly — there’s something to test.
Unpack code to the newly allocated memory
After setting up guard pages using WinAppDbg’s watch_buffer(), WinAppDbg will call our guard page exception handler, guard_page(), at E.1, whenever that block of memory is accessed.
Guard page exceptions occur because a guard page was accessed. This access could be due to a memory read, memory write, or an instruction fetch (executing an instruction).
We don’t care about memory reads, but if the access was due to a memory write (see block E.1.2) then we want to attempt to find the unpacking loop. If the access was to execute instructions (see block E.1.3), then we want to log both the address, and the instruction that caused the processor to start executing instructions at that address.
Any write access is logged with the address and disassembly of the instruction that caused it. This is to log all the locations from which the allocated memory is written and make it easier to find the unpacking code in the original malware binary. Care is taken to make sure that we only log each instruction(s) once, as it (they) will more than likely be inside a loop and actually execute a number of times.
Since the instruction(s) that wrote to the newly allocated block of memory are more than likely inside some sort of unpacking loop, let’s have some fun and attempt to find the bounds of the loop. To do this, unpack.py calls the WinAppDbg debugger instance’s start_tracing() method to turn on single-step debugging for the thread that caused the guard page write exception.
Single-stepping causes the processor to issue a single-step exception after executing each instruction. WinAppDbg catches these exceptions and calls our single-step exception handler, single_step(), at E.2.
E.2 Single-step exception handler
The single-step exception handler is used so that we can attempt to find the unpacking loop, and the address of the instruction that transfers control to the unpacked code.
The single-step exception handler, single_step(), is located at E.2 in unpack.py. I’m not overly happy with my scripting for single_step(), as I’m not convinced that I’ve arrived at the most elegant algorithm, and it isn’t perfect. Under some circumstances it can fail and end up single-stepping through library functions. It is, however, the best that I’ve come up with to date, so I’m sticking with it for now.
The single-step exception handler attempts to find the bounds of the unpacking loop by assuming that the write instruction (that caused the guard page write exception) is in the middle of the loop. It watches the exception address, which is the address of the instruction that the processor just finished executing, which, under normal processing, should keep incrementing.
When it notices the exception address go backward, it assumes that the processor has reached the end of the loop body and is either looping back to re-evaluate the loop condition, or that it has just re-evaluated the loop condition and it is looping back for another iteration. Either way this lower address is stored as being the start of the loop. The previous execution address (that is the address before it dropped back) is stored as being the end of the loop, as that address obviously holds the instruction that caused the processor to jump back to the lower address.
Any subsequent instructions that are between these two addresses, that is, they are part of what we suspect is the unpacking loop, are then disassembled (if they haven’t already been disassembled — remember, they’re in a loop, and we only want to disassemble them once).
The final trick of the single-step handler is to remember the last two addresses that were executed. We need to know the second last address to log the address and instruction that transferred control to the unpacked code. The single-step debugging continues until we notice the processor executing instructions from the newly allocated memory.
Now I’m going to attempt to explain why the single-step exception handler needs to remember the last two addresses in order to find the address that passes control to the newly allocated memory. Since I got up at 03:50 after around four hours of sketchy sleep, this could be interesting!
The single-step debug exception occurs after an instruction has executed, and the exception address/return address is that of the next instruction to be executed. This means that when the single-step handler returns, control will resume with the next instruction. The guard page exception occurs upon access to the protected memory.
Consider the following sequence of instructions, but first, allow me to set the scene. It’s a dark and dreary night, seems like nothing’s going right, and this code has just called VirtualAlloc() to request some memory of the executable variety. It has just finished unpacking and making a nice cosy nest for itself in the new block of memory, and is about to execute the following instructions.
The instruction at address 0x3c1530 is an unpacked instruction in the newly allocated block of memory, which the call edx instruction at 0x405153 is hoping to surprise us all by passing control to:
; edx contains 0x3c0000 at this point 0x40514d: adc edx,0x1530 0x405153: call edx ; transfer control to unpacked code 0x405155: add esp,0x0c ... 0x3c1530: push ebp
Here is what I believe happens. The processor executes the instruction at address 0x40514d and generates a single-step debug exception. WinAppDbg catches this exception and passes control to our single-step debug exception handler which goes by the imaginative name of single_step() (the name is set by WinAppDbg).
The exception address, in the case of single-step debug exceptions, is the address to which the handler will return which, since we don’t want to execute the same instruction again, will be the address of the next instruction to execute. In this case, that will be 0x405153.
Not wanting the call edx instruction to feel left out, and the fact that it is the next instruction probably has a lot to do with it, the processor executes the call edx instruction at 0x405153. Again, our single-step debug exception handler gets control but this time the exception address, being the address of the next instruction to execute, will be 0x3c1530 — the destination of the call instruction.
So at the end of single_step(), lasteip[0] will be 0x405153, lasteip[1] will be 0x3c1530.
When the processor attempts to fetch the instruction from address 0x3c1530, it triggers the guard page and our guard page handler, guard_page(), is called. Since the single-step exception handler gets the address of the next instruction to be executed, our guard_page() function sees 0x3c1530 as being the last address to execute. Hence we need to keep track of the last two addresses.
Transfer of execution to the unpacked code
Right, back to the guard page exception handler, guard_page(), at E.1. If we look out for a guard page exception caused by an instruction fetch (instruction execution) (see block E.1.3), then we can easily find the original entry point of the unpacked code.
The entry point will be the first address to cause a guard page execution exception, in the newly allocated memory. guard_page() checks for this condition (see block E.1.3) and if found, logs the entry point, and the address and disassembly of the instruction which transferred control to the new memory block (see E.1.3.1).
After logging the above information, guard_page() searches the list of executable memory blocks (see E.1.3.2) that were allocated by VirtualAllocEx() and recorded by post_VirtualAllocEx() (see C.2.2.1), until it finds the block of memory containing the address that the processor just jumped to (see E.1.3.3). It deletes the memory breakpoint (see just after E.1.3.3), stops the single-step debugging (see E.1.3.4), resets the variables used to find the unpacking loop (see E.1.3.5), and dumps the block of memory to a file (see E.1.3.6).
Demonstration
I shall use some malware samples which were all obtained in a similar manner, that is, via the similar looking attacks and subsequent download methods mentioned at the start of this post.
I’m going to call the malware samples sample1 through sample3, and I’m going to run them using my unpack.py script, on my Cuckoo VM. I haven’t yet created a Cuckoo package to use unpack.py, but that would be an easy way to automate this.
sample1
c:\unpack.py sample1.exe [*] <1572:1576> 0x7c809af9: VirtualAllocEx (0xffffffff,0x0,102400,0x3000,0x040) = 0x9c0000 [*] Request for EXECUTEable memory [*] VirtualAlloc()d memory address 0x9c0000 written from 0x40510c (and byte [edi+esi], 0x0) [*] VirtualAlloc()d memory address 0x9c0000 written from 0x405110 (xor [edi+esi], dl) 0x405102: mov dl, [esi+ebx] 0x405105: inc ecx 0x405106: sub dl, [ecx+sample1!0x4057] 0x40510c: and byte [edi+esi], 0x0 0x405110: xor [edi+esi], dl 0x405113: sub ecx, 0x1 0x405116: jz sample1!start+0xd9 0x40511a: sub ecx, ecx 0x40511c: inc esi 0x40511d: cmp esi, eax 0x40511f: jb sample1!start+0xc1 [*] Found unpacked entry point at 0x9c1530 called from 0x405154 (call esi) [-] Unpacking loop at 0x405102 - 0x40511f [-] Dumping 102400 bytes of memory range 0x9c0000 - 0x9d8fff [*] <1572:1576> 0x7c809af9: VirtualAllocEx (0xffffffff,0x0,25600,0x3000,0x004) = 0x9e0000 [*] <1572:1576> 0x402a6e: VirtualAllocEx (0x6c,0x0,4,0x1000,0x040) = 0xf40000 [*] Request for EXECUTEable memory ... [*] Exit process event for pid 1572 (sample1.exe): 0 [*] Terminating
This log snippet shows what looks like a VirtualAlloc() call. The reason it looks like a VirtualAlloc() call is because it was called from 0x7c809af9 which looks more like the address of library code than of user written code. Plus the first parameter to VirtualAllocEx() is 0xffffffff (-1) and VirtualAlloc() calls VirtualAllocEx() with hProcess == 0xffffffff (-1). Anyway, I digress.
The log snippet starts off with a VirtualAlloc() request for executable memory, as indicated by the flProtect argument of 0x040 (PAGE_EXECUTE_READWRITE). The returned value shows that the memory was allocated at address 0x9c0000. The numbers separated by a ‘:’, in angled brackets, are the process id and thread id respectively.
The next couple of lines show two instructions, and their address, which perform memory write operations on the allocated memory block at 0x9c0000. As you can see, the first and instruction has a second operand of 0, so it will be initialising the bytes of memory to 0 before the next write instruction, xor, modifies the byte by storing the contents of the dl register there (as anything xored with 0 equals itself).
Following that, unpack.py believes that it has identified the unpacking loop between addresses 0x405102 and 0x40511f (inclusive) and has disassembled that block of addresses. It certainly looks like an unpacking, well, more of a simple decryption, loop.
The log snippet then shows that unpack.py has identified the entry point in to the decrypted code, at address 0x9c1530, and that it was a call esi instruction at address 0x405154 that transferred control there.
It then informs us that it is dumping a block of memory, and shows another VirtualAlloc() call which is ignored as flProtect is 0x004 (PAGE_READWRITE). Following this is an actual VirtualAllocEx() call (notice it is called from 0x402a6e, which looks much more like our malware code than library code, and that hProcess is 0x6c).
From previous analysis I can tell you that it has found explorer.exe and that that last VirtualAllocEx(0x6c,…) call is attempting to allocate executable memory inside explorer.exe, but I won’t go in to that here.
Eventually the process terminates (after injecting code in to explorer.exe, again from previous analysis).
sample2
[*] <828:1744> 0x7c809af9: VirtualAllocEx (0xffffffff,0x0,102400,0x3000,0x040) = 0x3c0000 [*] Request for EXECUTEable memory [*] VirtualAlloc()d memory address 0x3c0000 written from 0x40510b (and byte [edx+eax], 0x0) [*] VirtualAlloc()d memory address 0x3c0000 written from 0x40510f (xor [edx+eax], bl) 0x405101: mov bl, [eax+ecx] 0x405104: inc esi 0x405105: sub bl, [esi+sample2!0x4057] 0x40510b: and byte [edx+eax], 0x0 0x40510f: xor [edx+eax], bl 0x405112: sub esi, 0x1 0x405115: jz sample2!start+0xd8 0x405119: sub esi, esi 0x40511b: inc eax 0x40511c: cmp eax, edi 0x40511e: jb sample2!start+0xc0 [*] Found unpacked entry point at 0x3c1530 called from 0x405153 (call edx) [-] Unpacking loop at 0x405101 - 0x40511e [-] Dumping 102400 bytes of memory range 0x3c0000 - 0x3d8fff [*] <828:1744> 0x7c809af9: VirtualAllocEx (0xffffffff,0x0,25600,0x3000,0x004) = 0x3e0000 [*] <828:1744> 0x402a6e: VirtualAllocEx (0x6c,0x0,4,0x1000,0x040) = 0xe80000 [*] Request for EXECUTEable memory ... [*] Exit process event for pid 828 (sample2.exe): 0 [*] Terminating
sample3
[*] <624:620> 0x7c809af9: VirtualAllocEx (0xffffffff,0x0,102400,0x3000,0x040) = 0x890000 [*] Request for EXECUTEable memory [*] VirtualAlloc()d memory address 0x890000 written from 0x40110f (and byte [ebx+edi], 0x0) [*] VirtualAlloc()d memory address 0x890000 written from 0x401113 (xor [ebx+edi], al) 0x401105: mov al, [edi+ecx] 0x401108: inc edx 0x401109: sub al, [edx+sample3!start+0x4016] 0x40110f: and byte [ebx+edi], 0x0 0x401113: xor [ebx+edi], al 0x401116: sub edx, 0x1 0x401119: jz sample3!start+0xdc 0x40111d: sub edx, edx 0x40111f: inc edi 0x401120: cmp edi, esi 0x401122: jb sample3!start+0xc4 [*] Found unpacked entry point at 0x891530 called from 0x401157 (call esi) [-] Unpacking loop at 0x401105 - 0x401122 [-] Dumping 102400 bytes of memory range 0x890000 - 0x8a8fff [*] <624:620> 0x7c809af9: VirtualAllocEx (0xffffffff,0x0,25600,0x3000,0x004) = 0x8b0000 [*] <624:620> 0x402a6e: VirtualAllocEx (0x44,0x0,4,0x1000,0x040) = 0x15f0000 [*] Request for EXECUTEable memory ... [*] Exit process event for pid 624 (sample3.exe): 0 [*] Terminating
Notice how all three samples implement the same algorithm but use different registers to do so. Also the instructions are at different addresses. Now for the moment of truth, were all the unpacked/decrypted contents the same?
$ md5sum -b *memblk* | cut -d\ -f1 | sort | uniq f3dce8f739fc86b1395ff6458092bf33
Yes they were.
Conclusion
Wrapping up then, as this post is getting long and I really need to get some sleep. My unpack.py script checks for malware using VirtualAlloc()/VirtualAllocEx() to request executable memory. It then watches that memory for writes and instruction fetches and, with the help of some single-step debugging, is able to find the instructions that write to the memory, often find the unpacking/decrypting loop, save the resulting memory block to a file, and identify the entry point in to the unpacked/decrypted code along with the address and instruction that transferred control to it.
There are a number of samples that unpack.py doesn’t work on. Some of the ones I tried didn’t seem to exit, nor did unpack.py detect control being passed to the executable memory (I’m assuming it did, but I haven’t analysed it yet). Having said that, it did still prove reasonably useful in that it still identified what looked like the unpacking/decrypting loop of some samples, and showed how some of those loops were pretty similar.
unpack.py doesn’t make any attempt to hide the debugger, and it doesn’t (yet) use WinAppDbg‘s hostile mode to counter some anti-debug tricks. Plus it is possible for malware to foil the single-step debugging and guard pages.
For an encore, I created an automation script to take a batch of malware samples and for each one, start a VM, run the sample using unpack.py, and monitor its log file and shut the VM down when the process terminated (or after a predetermined amount of time if the sample didn’t terminate). The log files were written to a shared folder (yes, I know, possibly not the best idea). I then used the md5sum command to identify which samples were ultimately running the same unpacked/decrypted code, and celebrated with a small Mexican Wave involving only myself.
I’m tempted to start work on a Cuckoo package to run samples using unpack.py, and use that to replace my automation script. I also added more functionality to my unpack.py script so that it analyses more than just the unpacking/decrypting component.
I also want to add the ability to detect malware unpacking/decrypting code in to another process, and injecting code in to another process which then unpacks/decrypts code. A good place to start would be to actually add the API hook code for VirtualProtect() to unpack.py, to go with the fancy looking comment that I have in there. That would be one way to bypass unpack.py as it is — VirtualAlloc() non executable memory and then call VirtualProtect() to make it executable.
I’ll save all that (and more) though, for another time. Now this post is even longer and I’m still not sleeping — time to stop typing and see if I can stop thinking about this stuff for long enough to get some sleep.
This is very cool. One question, how can I analyze a piece of malware in a DLL?
For example, let’s say the malware is typically launched with this command (and entry point of course):
rundll32.exe malware.dll,DllMain
Hi Mick,
If you are talking about analysing it with my unpack.py, then you can’t at the moment as you’ve pointed out something that I didn’t think of — passing command line arguments to the sample.
I’ve just quickly checked the WinAppDbg documentation and it certainly looks possible, so I’ll work on that at a later date. Unfortunately at the moment, we have a nine day hot air balloon festival going on which is getting me six or so hours of paid work a day, but leaving me reasonably tired.
I shall endeavour to add this functionality to unpack.py though, as some malware requires command line arguments before it will run properly.
If you are wanting to analyse DLL files that were obtained as the result of a MySQL attack, then you can use my misql.py Cuckoo package.
I have noticed similar MS-SQL attacks, and have started work on a similar MS-SQL package.
Musingly,
Karl.
Pingback: Testing, Testing, Is This Thing On?! | Malware Musings
Pingback: Beyond Automated Unpacking: Extracting Decrypted/Decompressed Memory Blocks | Malware Musings
Pingback: My Malware Analysis Setup | Malware Musings
Pingback: #Life2.0 | Malware Musings