I was asked if I could look at a WordPress website which wasn’t displaying correctly. It was showing an index of files in the document root directory, rather than showing the home page. This suggested that the index.html (UNIX), index.htm (Windows), or in the case of WordPress, index.php file was missing. Read on and I’ll talk you through how I recovered the site.
The first step is to start constructing a timeline so that we can start to piece together what happened, and when. In digital forensics (and I suspect in other forensics fields too), there is a notion of the order of volatility. Basically, some pieces of evidence are more volatile (easily lost) than others and, just like the observer effect in physics (yes, would you believe I ended up reading up -on quantum mechanics whilst checking my facts for this blog post!), the act of observing evidence/artefacts can alter/destroy evidence/artefacts.
I found various mentions of the order of volatility when I went looking, and while it is a good idea to have a documented order for evidence gathering to help maintain repeatability and so that you can demonstrate how you obtained the evidence, a good technical understanding of the system that you’re examining can help you determine the best order for gathering evidence. However, you need to realise that you don’t necessarily know (and can’t necessarily know) what it is that you don’t know.
For instance, if you didn’t realise that you didn’t know about last accessed time stamps (atime), and didn’t have any prior experience or knowledge to suggest that they may exist, then you’re not likely to go looking for ways to read a file without modifying the atime, and hence your method of gathering evidence may go clobbering them. This, while understandable, is regrettable.
It is a good idea to have a documented methodology, because it can be useful to either help you prove the reliability of your evidence, or to allow a defence attorney to poke holes in it, depending on your methodology. The latter may be bad for you, but if you did make a mistake, or there was an error in your logic, your documentation of your method may prevent an innocent person from being sentenced, which is a good thing. Anyway, I digress.
As I mentioned above, our first step is to start piecing a timeline together, and one of the quickest ways to kick that off is to record all of the file time stamps. It is also a good idea to do this before doing anything else so that we don’t inadvertently clobber file time stamps by reading files. So we’ll do this before taking a backup, because taking a backup will read the files, altering the last accessed time stamp (atime) in the process. Reading the file time stamps will only read the file inodes, and leaves the time stamps intact.
I’ve found that that the UNIX find(1) command (or more precisely, the Linux find(1) command, because I use -printf which is a GNU extension and not POSIX compliant) is a quick and easy way to get all of the file time stamps, and I’ll often use it to run a file hashing command like md5sum(1), or probably better these days, sha256sum(1). Now since this website was in a multi-tenant hosted environment, all of the files pertaining to our website are sitting under our user’s home directory, so we’ll run the find(1) command from there and save the output to a file in the /tmp/ directory.
Note that there are some important points to make here. Firstly, we’ll unset(1) any history related environment variables that are set, to stop the shell from writing to the .bash_history file (or equivalent) and hence clobbering the last modified time stamp thereof; secondly it is important to NOT save the output to a file in the user’s home directory, nor to any subdirectory thereof, because doing so will alter the last modified time stamp of the directory that the file is written to; and finally, we’ll change the umask(1) so that when the shell creates the output file, it isn’t readable by everyone else on the system!:
# get a list of file time stamps unset `set |grep HIST |cut -d= -f1` umask 077 cd "$HOME" find . -printf "%h/%f\t%TF %TT\t%AF %AT\t%CF %CT\n" > /tmp/filetimes.txt # sort it by last modified time stamp (second column) sort -t\ -k2 /tmp/filetimes.txt > /tmp/timeline-s.txt
Before we do anything else, it is a good idea to copy that file off that host and to a safe place. We’ll also discuss what it is we’ve just done.
That find(1) command printed (-printf) the following details for each item (file, directory, symbolic link, etc.) that it found underneath the user’s home directory:
%h
: The path to the item%f
: The name of the item (this, with the preceding ‘%h/’ prints out the path to the file/item)\t
: A tab character — we’ll use tab rather than space because we’ve used a space in the time stamps. Using a tab character here will make it easier to pull/sort by individual columns later. Alternatively, we could have separated the date and time using a different character (‘T’ is a popular choice), and then we could have used a space (instead of a tab character) to separate the columns%T
: This prints a component of the last modified time stamp, with the following letter dictating which component. %TF prints the date in ISO format, and %TT prints the time in 24-hour format. Combine the two and you get the last modified time stamp in YYYY-MM-DD HH:MM:SS format
: This prints a component of the last accessed time stamp, and is otherwise the same as %T
%A%C
: This prints a component of the inode’s last changed time stamp, and is otherwise the same as %T\n
: A newline character. Without this, all of the file names and time stamps would be printed on a single line — try it!
The three time stamps — last modified, last accessed, and inode last change time, are often referred to as the MAC times of a file/directory. Whenever I record file system time stamps, I’ll always record them in that order, to coincide with the MAC acronym, so that there’s no confusion over which time stamp is which. We can tell the UNIX sort(1) command to sort by a different column if we want to reorder them based on last accessed time for instance.
Speaking of sorting by different columns, this is an example of another reason why I separated the fields — file path, and the three time stamps — using a tab character, rather than a space. Using a tab character not only lets us use a space character in between the date and the time (making the time stamps easier to read), but ever since Windows 95, users (mostly Windows users, rather than UNIX users) seem awfully keen to put spaces in file names, where as they’re not so likely to put tab characters in file names. Using a tab character as a delimiter helps protect our columns against spaces in file names.
Now that we’ve got a record of the three time stamps (mtime — last modified, atime — last accessed, and ctime — inode last change time) of all of the files, and we’ve saved it in a safe (and private) place, we’ll take a backup of the site in its compromised state. This is us essentially preserving the current state as best we can. We want to preserve it for two reasons: firstly, in case we bollocks something up whilst attempting to repair it; and secondly, in case we determine that the site was attacked, in which case we’ve preserved evidence.
The following tar(1) command will clobber the last accessed time stamps (atime), but we have them in the timeline file (filetimes.txt) that we created above, should we need to refer to them, consequently I didn’t bother with tar(1)‘s –atime-preserve option.
It is worth noting here that tar(1) has an option –atime-preserve which will preserve the last accessed time stamp of the file. However, it has two ways of doing so: one is to read the atime, read the file, then write the atime back to what it was; and two, to pass the O_NOATIME option to the open(2) system call, which tells the operating system to not change the atime when it reads the file. Since the atime is stored in the file’s inode, the former approach — reading it then writing it back again — will actually modify the ctime (inode’s last change time) of the file. The latter approach (specified by using tar(1)‘s –atime-preserve=system command line option) will preserve all three time stamps.
Just as an aside, this also got me wondering if the find(1) command, and/or the ls(1) command for that matter, modify a directory’s atime when they read the files in the directory. I did a test, and used the -tu ls(1) command line options to display the last accessed time stamp (atime), and neither the find(1) command, nor the ls(1) command modified the atime (possibly because they use opendir(3)/readdir(3) rather than open(2)/read(2)) — at least not on my system).
# take a backup, again storing it outside of our home directory cd $HOME tar -zcvf /tmp/backup-yyyy-mm-dd.tar.gz .
In this case, I was fortunate because there were a few backups taken before the website software was upgraded. So I took the fresh backup that I had just taken, and compared it with the latest previous backup which turned out to be about four months old.
Comparing the contents of the two backups will give us a bit of an idea as to which files have been deleted, created, and modified over the previous four months (between the two backups), and hence may give us a clue as to what happened to the website. Bearing in mind that at this point, I didn’t know that the website had been hacked — it looked like it could have been a configuration problem, or an administrator having accidentally deleted a file (or more). When you work in security, it’s easy to fall in to the trap of thinking that there is malicious intent behind everything (I was reminded once, that there is a town in Australia called Port Hacking!).
All I had to go on at this point was the symptoms — namely that the web server was giving a list of files in the directory, rather than showing the website page. This suggested that the index.html, or similar page like index.php in the case of WordPress, is missing or not readable by the web server.
Here we go — I’ll stop waffling on and get down to the nitty gritty. I was running these commands on a different system — one which had the backup files on it, and not on the website server itself. Here then, is what you’ve all come here to see — some hard core UNIX commands:
# record the contents of each of the two backups -- just the file names to start with tar -ztf backup-6.18.2020_20-47-22_username.tar.gz |sed 's#^[^/]*##' |sort > 6.18.2020.txt tar -ztf backup-10.10.2020_08-35-09_username.tar.gz |sed 's#^[^/]*##' |sort > 10.10.2020.txt # what's the difference between them? # exclude /homedir/mail/ because that just shows loads of mail files, # which don't affect the website # and all the output may make it harder to spot something relevant diff -u 6.18.2020.txt 10.10.2020.txt |grep -v "^/homedir/mail/" > diffme # get a list of the directories from which files were removed # and replace the leading directory with '.' because our timeline has paths # starting with './' and we want to be able to search for these directories in # our timeline grep "^-/" diffme |sed 's#^-/homedir#.#;s#[^/]*$##' |sort |uniq # search for those directories in our timeline to see when they were last modified # which indicates file creation/deletion within the directory (either that or # someone doing something stupid like running vi(1) on the directory and doing a # write command) # sort the output by last modified time stamp (the second column/field -- hence # the '-k2' option) # that is a tab character after the '-t\' -- note that if you are using the bash(1) # shell, or other Bourne shell derivative you can enter a tab character by pressing # CTRL-V then the TAB key grep "^-/" diffme |sed 's#^-/homedir#.#;s#/[^/]*$# #' |sort |uniq |grep -Ff - timeline-s.txt |sort -t\ -k2 > missingfiledirmtimes.txt
That last command spat out something interesting:
./public_html/wp-content/plugins/wp-file-manager 2020-06-19 08:39:53.2659755220 2020-10-06 20:42:09.4449154700 2020-06-19 08:39:53.2659755220 ./public_html/wp-content/plugins/wp-file-manager/js 2020-06-19 08:39:53.2659755220 2020-10-06 20:42:09.4569154700 2020-06-19 08:39:53.2659755220 ./public_html/wp-content/plugins/wp-file-manager/languages 2020-06-19 08:39:53.2839755220 2020-10-06 20:42:09.4569154700 2020-06-19 08:39:53.2839755220 ./public_html/wp-includes/Requests/Exception/HTTP 2020-09-05 06:36:54.6429875680 2020-10-06 20:42:09.4689154700 2020-09-05 06:36:54.6429875680 ./public_html/wp-content/plugins/the-events-calendar/src/views/day 2020-09-05 06:37:33.5899875140 2020-10-06 20:42:09.4099154700 2020-09-05 06:37:33.5899875140 ./public_html/wp-content/plugins/wp-file-manager/lib/php 2020-09-06 04:14:27.0149976020 2020-10-06 20:42:09.4459154700 2020-09-06 04:14:27.0149976020
Now, notice anything interesting about those last modified time stamps (the second column/first time stamp on each line)? There is an almost three month gap where no directories were modified, then some get changed on the 05th September 2020. The fact that the time stamps aren’t recent suggests that these directories are not normally modified during day-to-day operation of the site, in which case the time stamp suggests some admin or other activity. This then makes the modifications starting on the 05th September, suspicious, especially when there was no recorded admin activity/changes taking place at that time.
This is also a classic example of why it is a good idea to document what changes are made to a system, along with when they were made. I know from such data that the File Manager plugin (wp-file-manager) was upgraded on the 19th June 2020, which would account for the plugins/wp-file-manager/ directories being modified then, but there was no administrative reason why the other directories were modified on the 05th September, suggesting that these changes need to be investigated.
Just before I continue with the more interesting part of investigating suspicious behaviour, that last step was a case-in-point highlighting the importance of documenting your investigation procedures. I am creating this blog post using notes that I made when investigating this incident back in October 2020 — notes which included the commands that I used to generate the log data and conclusions.
I just noticed that the sort(1) command that I used for the last step didn’t actually do what I described — I was missing the ‘-t\ ‘ command line option telling sort(1) to use the tab character as the field/column delimiter, so it was sorting using its default delimiter of a non-white space to white space transition (that is, using white space as the field/column delimiter).
This was a problem because some of the file/directory names contain spaces (and the directory names that contained spaces? …’/wp-file-manager/lib/themes/windows – 10′ — I’m saying nothing). Consequently the output (missingfiledirmtimes.txt) wasn’t completely sorted by last modified time stamp. I corrected this in the command that I included above, but that highlighted my point about the importance of documenting your investigation procedure and the steps that you took — it allowed me to see that my method was incorrect and hence any conclusions that I had drawn from that output may also be incorrect. Also, because I had kept all of the original files, I was able to regenerate correct output and fix the mistake.
Now here’s some UNIX trickery to list all directories that were last modified after the 01st September 2020 00:00:00 (after the File Manager plugin upgrade and just before the suspicious activity), and then look for files that were deleted from those directories:
# find mtimes after a certain time -- let's go with 2020-09-01 00:00:00 echo "cut 2020-09-01 00:00:00" |cat - missingfiledirmtimes.txt |sort -t\ -k2 |sed '1,/^cut /d' # now add to that to find files that were deleted from those directories: echo "cut 2020-09-01 00:00:00" |cat - missingfiledirmtimes.txt |sort -t\ -k2 |sed '1,/^cut /d;s#^\.#-/homedir#;s# .*#/[^/]\\+$#' |grep -f - ./diffme > deletedfilesafter20200901.txt
Also, just as a side note, I’ve learnt that it’s a good idea to document your sed(1) commands, so that when you come back to them three months later, you don’t have to spend ages figuring out what they do!
That last command prefixes (using the cat(1) command with a ‘-‘ to signify standard input — in this case from the piped echo(1) command) the missingfiledirmtimes.txt log data (which is our list of directories from which files were deleted, along with their last modified time stamps) with the word ‘cut’ followed by a tab character followed by a time stamp.
The word ‘cut’ is arbitrary, but should be distinct enough to guarantee that it doesn’t occur in the missingfiledirmtimes.txt file (before we add it that is). The tab character is important, as that puts the following time stamp in to the second field/column so that it will be sorted with the other last modified time stamps in the missingfiledirmtimes.txt file.
The sort(1) command then sorts the ‘cut’ line along with the missingfiledirmtimes.txt file, by the second field/column — the last modified time stamp — and passes the output to the sed(1) command. The sed(1) (UNIX) command runs a sed(1) script (between the single quotes (‘…’)) consisting of three sed(1) script commands, separated by semicolons (;). The sed(1) command is building a list of regular expressions to extract lines from our original diff(1) output (diffme), corresponding to files deleted from directories with a last modified time stamp after 2020-09-01 00:00:00:
1,/^cut /d
: That command deletes the lines from the first line down to the line starting with ‘cut’ followed by a tab character — that is, from the first line down to where our ‘cut’ line got sorted, which will be at the point in the missingfiledirmtimes.txt file corresponding to a last modified time stamp of 2020-09-01 00:00:00.s#^\.#-/homedir#
: That command undoes what we did earlier (when we replaced the leading ‘-/homedir’ with a ‘.’ to match our timeline). This command replaces the leading ‘.’ with ‘-/homedir’, because we want to find corresponding lines in our original diff(1) output file (diffme).s# .*#/[^/]\\+$#
: This command replaces everything after the first tab character, that is, everything after the directory name, with the regular expression string ‘/[^/]\+$’ — there are two ‘\’ characters in the sed(1) script because the first ‘\’ character is to escape the second ‘\’ character, causing sed(1) to output the second ‘\’ character rather than interpreting it as escaping the ‘+’ character after it. The ‘/[^/]\+$’ regular expression matches a ‘/’ character (being the UNIX directory separator), then one or more occurrences (‘\+’) of a non-‘/’ character (‘[^/]’), followed by the end of the line (‘$’). That is, lines that contain file names after our list of directory names, but no ‘/’ characters — so files in our listed directories, but not subdirectories thereof.
Instead of adding the ‘cut 2020-09-01 00:00:00’ line, resorting, then using sed(1) to delete all lines up to and including that extra line, we could have used awk(1) and asked it to only print out lines where the last modified time stamp ($2) is after 2020-09-01 00:00:00 ($2 >= “2020-09-01 00:00:00”). However, by the time I’d thought of that, I was already busy trying to construct a cool looking sed(1) script to do it, and you must admit, the non-sed(1) approach looks cooler (albeit more complicated and harder to understand) than an awk(1) script. Plus I was already using sed(1) for the next part, so I just tacked it on.
Right — still with me?!
So the plan now is to extract the public_html/ files that have been deleted since the backup, and restore them to the public_html/ directory on the server.
# extract the files deleted since 2020-09-01 from the older backup file (tar) cat deletedfilesafter20200901.txt |sed 's/^-/backup-6.18.2020_20-47-22_username/' |tar -zxvf ../backup-6.18.2020_20-47-22_username.tar.gz -T - # tar up the ./public_html/ files to copy them to the server tar -cvf public_html-old.tar ./public_html/ scp -p public_html-old.tar username@server:/tmp/
You can then copy the public_html-old.tar file to the server and extract it, but before we extract the files we’ll do a sanity check to make sure that none of the files exist (they shouldn’t, because the list of files was generated by determining which files had gone missing between the backups — hence it’s a sanity check).
# on the server cd ls -l `tar -tf /tmp/public_html-old.tar |grep -v "/$"` 2>&1 |grep -v "No such file or directory" # the lack of any output tells us that all of the files that we are about to # restore, are missing from the server, which is what we expected, so let's extract tar -xvf /tmp/public_html-old.tar
We’ll also extract and copy the public_html/index.php file from the old backup to the server, because the existing one looks weird (it contains a PHP @include directive to include the file /home/username/public_html/wp-content/updraft/emptydir/.684f3ddd.ico using some octal escapes (\nnn) to unnecessarily (they were standard ASCII characters) specify some of the characters in the file name.
# extract the old index.php file, as the current one seems weird tar -zxvf ../backup-6.18.2020_20-47-22_username.tar.gz backup-6.18.2020_20-47-22_username/homedir/public_html/index.php # copy the extracted file to the server and overwrite the existing (weird) # index.php file scp -p backup-6.18.2020_20-47-22_username/homedir/public_html/index.php username@server:public_html/
So, the story so far:
- We determined all the files that had been deleted between the backup-6.18.2020_20-47-22_username.tar.gz and backup-10.10.2020_08-35-09_username.tar.gz backups (so between 20:47:22 on the 18th of June 2020, and 08:35:09 on the 10th October 2020), excluding files under the homedir/mail/ directory.
- We processed the list of deleted files to create a list of directories from which files were removed.
- We looked the directories (from which files were deleted) up in our timeline (timeline-s.txt) to determine when the files may have been deleted (strictly speaking, the mtime on a directory tells you when the directory was modified, so while the mtime on a directory is modified when a file is deleted from that directory, it is also modified when a file is created in that directory — we’re making an assumption here that the mtimes indicate when the last file was deleted).
- We excluded any directories with modification times before 2020-09-01 00:00:00 (because the gap in our timeline suggested that something happened just after that time).
- We then extracted a list of deleted files from those remaining directories (that is, from directories with modification times after 2020-09-01 00:00:00)…
- … and extracted those files from the older backup file (backup-6.18.2020_20-47-22_username.tar.gz).
- We tar(1)ed up the ./public_html/ directory and copied the restored files to the server (the copy wasn’t shown).
- We extracted the old index.php file because the existing one looked weird.
Now, when I got to this point, I was expecting the website to be hunky-dory again, but it wasn’t. However, because this blog post is getting long, vi(1) is telling me that I’m only 8% through my notes, and I want to play around with some honey and comb that I stole off my bees last weekend, not to mention cook dinner and relax in front of the TV while sipping my first attempt at making mead (which just highlighted why I’m an IT engineer and not a brewer) before the weekend is over, I’m going to break the whole story up over what I’m guessing will be three blog posts, and finish this one here.
In the next post, I’ll explain what I had to do to get the website up and running again, including some tests that I did to help assure myself that there weren’t any other malicious artefacts lying around.
In the post after that, I’ll explain how I determined that the website was attacked, how I think the attackers gained entry, and the associated artefacts.
Pingback: Week 5 – 2021 – This Week In 4n6