Recovering data from a broken NTFS hard drive

Today I want to tell you about my recent adventure in data recovery. A friend of mine had a broken USB disk that was no longer readable. The single 230 GB partition was formatted with NTFS and neither Windows nor Ubuntu (with the NTFS-3g driver, I assume) were willing to read it. This disk contained photos, audio files and videos, the mission was to at least restore the photos.

Make a disk image

The first step for a data recovery project should be to make an image of the drive or partition: when a drives starts to lose data because of physical errors on the disk these errors tend to spread, things are getting worse and you will not get any data out of anymore at some point. And every tool will encounter the same problems when trying to read defective sectors on disk, and it will not be possible to repair these.

By the way, if your data is really valuable you shouldn’t even try to recover it yourself, that is, if the drive shows signs of some physical damage. You should disconnect it as soon as possible and hand it to a professional data recovery company – if it is worth a few hundred or thousand Euros.

The data I was working with, however, wasn’t business critical. So, after consulting $SEARCH_ENGINE, I did an image of the drive with dd_rescue – this works similar to a normal “dd” but handles I/O errors more gracefully. But here things started to get confusing: there are two programs for this purpose with an almost identical name:

Kurt Garloff’s original dd_rescue tool uses the executable named “dd_rescue”, the Debian/Ubuntu package is named “ddrescue”.
Antonio Diaz Diaz new and improved GNU ddrescue provides an executable named “ddrescue”, the Debian/Ubuntu package carries the name “gddrescue”.

The latter is the one to choose: you get much better progress information – copying hundreds of gigabytes takes quite some time, so you want to know what’s going on – and the capability to interrupt the process and continue where you left off with the help of a log file. After finding out about that the hard way I got my image with

sudo ddrescue -r3 /dev/sdb1 hdimage logfile

where “-r3” means: “in case of an error, retry 3 times” and /dev/sdb1 is the name of the partition of the USB disk, obviously.

Unfortunately, the resulting image still couldn’t be mounted. “ddrescue” only reported a few bad sectors on the disk, but it was obviously enough to make file access impossible. Another idea that I wasn’t able to pursue: it might have been possible to repair the NTFS filesystem with a virtualized windows instance running in VirtualBox – but VirtualBox only takes complete disks as images, not single partitions. If I had done an image of the complete disk instead, including the partition table, this might have worked out. I didn’t feel like copying the 230 GB over into a new image with a partition table and also didn’t have enough free disk space to do it.

Recovery tools: file carvers

The next step was trying to recover as much of the data as possible. I had successfully used a “file carver” before to recover images from my digital camera’s memory card after the FAT filesystem became corrupted. A file carver is a program that scans a raw binary stream for the headers of known file types, like that of JPEG images or MP3 audio files, and tries to extract the contents, completely ignoring the file system. The advantage is that it doesn’t matter how broken your filesystem is – the program doesn’t have to know anything about the filesystem’s structure. It can also recover deleted files. The disadvantage is that you lose all information that is stored in the filesystem, the file name and directory structure. It’s also prone to errors for fragmented file systems, which also means that you’re less likely to succeed when recovering large files.

I tried two tools from this category, “foremost” and “photorec”. “foremost” is a simple command line tool, you call it like this:

foremost -i hdimage -o recovered -v

and it will sort the files it can find by file type into sub folders of “recovered”.

Photorec has a curses interface. It also takes hints about the structure of the image, like presence of a partition table or filesystem type. It is part of the “testdisk” package. The command line invocation shows that this tool was ported from DOS:

photorec /log /debug /d output-directory hdimage

Recovery tools: Sleuth Kit

Researching further, I stumbled upon the Sleuth Kit and Autopsy. These are forensic analysis tools and therefore are designed to recover data that someone deliberately tried to hide or destroy. The Sleuth Kit is a suite of command line tools which Autopsy is a web frontend for. Autopsy comes with its own web server. I started it with these commands:

mkdir my-autopsy-dir/
autopsy -d my-autopsy-dir/
firefox http://localhost:9999/autopsy

Getting around the web interface can be a bit confusing: you have to create a “case” first, then add a host to investigate and finally a hd image to look at. Anyway, the time it took me to get used to autopsy wasn’t wasted because I now was able to see the complete contents of the original NTFS filesystem! I was able to look at the data, browse the filesystem, download single files and compute MD5 sums. However, autopsy offers no feature for copying whole directory trees. This is because it is intended for forensic analysis rather than data recovery. So you, the computer forensics expert, are supposed to look at every single file and make notes about it which in turn are then recorded in the “case”.

I wasn’t really interested in a forensic analysis of the contents of my friend’s drive so I took a closer look at the command line tools. The relevant commands from the Sleuth Kit are “fls” for listing files in an image and “icat” for getting at the contents. You use “fls” like this:

fls -urp hdimage

where -u means that I’m not interested in deleted files, -r that I want a recursive listing and -p that I need to have the full path for every file. The output looks something like this:

d/d 180-144-8:  some-dir
d/d 5192-144-1: some-dir/some sub dir
r/r 5190-128-3: some-dir/some sub dir/some_file.exe
r/r 5188-128-3: some-dir/some sub dir/another_file.jpg

The funny numbers in the second column are the “inode” of the file, which you need to feed into “icat” to get the contents. So how do you recover a whole directory tree with these tools? What I should have done is using a script like this one:

#!/bin/sh
IMAGE=hdimage
fls -urp $IMAGE | 
while read type inode name; do
    case $type in
        d/d) mkdir "$name" ;;
        r/r) icat $IMAGE $(echo $inode | sed 's/://g') > "$name" ;;
    esac
done

But I was lazy and so I saved the file listing in a text file which I turned into a big shell script using Emacs’ rectangle functions, regular expressions and keyboard macros. This wasn’t working so well: there were some funny characters in the file names I forgot to escape, like single quotes and backticks. So, as always, it turned out to be more work doing it “the easy way”. However, in the end I was able to completely recover the data from the partition.

Analyzing the data

Since now I got all the data back, having already tried other methods of recovery before, this can serve as a nice real world benchmark of the usefulness of the file carving tools I used.

Just counting how many files these tools think they’ve found doesn’t help us much, we also need to know if the recovered files were really complete and undamaged. I did a quick check with the files the Sleuth Kit recovered, and all files I checked seemed to be ok: the photos were fine and the videos and mp3s played without any hiccups. So, let’s assume that the data I got from the Sleuth Kit is really genuine. To find out about the identity of the recovered files, I computed the MD5 hash for all of them with this little script:

for tool in foremost photorec sleuthkit; do
    find $tool -type f -print0 | xargs -0 md5sum | tee md5sums/${tool}.txt
done

And here’s a script I hacked together to do some analysis on these files:

#!/bin/bash
 
md5s_by_ext() {
    local ext=$1
    shift
    grep -hi "\.${ext}\$" "$@" | awk '{ print $1 }' 
}
unique_md5s_by_ext() {
    md5s_by_ext "$@" | sort | uniq
}
unique_md5s() {
    cat "$@" | awk '{ print $1 }' | sort | uniq
}
clean_wc() {
    wc -l | sed 's/ //g' 
}
 
common_files() {
    local ext="$1"
    echo -ne "${ext}\t"
    echo -ne $(unique_md5s_by_ext $ext sleuthkit | clean_wc) "\t"
    for tools in photorec foremost "photorec foremost"; do
        echo -ne $(unique_md5s_by_ext $ext $tools | clean_wc) "\t" \
            $(comm -12 <(unique_md5s             $tools) <(unique_md5s_by_ext $ext sleuthkit) | clean_wc)"\t"\
            $(comm -12 <(unique_md5s_by_ext $ext $tools) <(unique_md5s_by_ext $ext sleuthkit) | clean_wc)"\t"
    done
    echo
}
 
common_files_total() {
    echo -e "total\t"\
         $(unique_md5s sleuthkit         | clean_wc) "\t"\
         $(unique_md5s photorec          | clean_wc) "\t"\
         $(comm -12 <(unique_md5s photorec) <(unique_md5s sleuthkit) | clean_wc) "\t\t"\
         $(unique_md5s foremost          | clean_wc) "\t"\
         $(comm -12 <(unique_md5s foremost) <(unique_md5s sleuthkit) | clean_wc) "\t\t"\
         $(unique_md5s photorec foremost | clean_wc) "\t"\
         $(comm -12 <(unique_md5s foremost photorec) <(unique_md5s sleuthkit) | clean_wc)
}    
 
echo -e "\tsleuthkit\tphotorec\t\t\tforemost\t\t\tphotorec+foremost"
 
common_files_total
for i in jpg gif mp3 avi mpg zip rar exe cab dll txt htm rtf pdf doc xls; do
    common_files $i
done

And here are the results as a really ugly table:

	sleuthkit	photorec			foremost			photorec+foremost
		found	matching	matching +ext	found	matching	matching +ext	found	matching	matching +ext
total	4391	6600	3669		1210	771		6960	3718
jpg	831	768	711	711	853	747	747	901	755	755
gif	1	1	0	0	46	1	1	47	1	1
mp3	3218	4697	2851	2851	0	0	0	4697	2851	2851
avi	128	5	0	0	5	0	0	10	0	0
mpg	1	207	0	0	1	0	0	208	0	0
zip	5	3	3	3	13	0	0	16	3	3
rar	25	29	24	24	30	8	8	50	24	24
exe	37	60	4	4	78	6	6	83	6	6
cab	0	3	0	0	0	0	0	3	0	0
dll	10	69	6	6	71	7	7	80	8	8
txt	12	699	4	4	1	0	0	700	4	4
htm	6	0	2	0	3	1	1	3	2	1
rtf	1	2	0	0	0	0	0	2	0	0
pdf	0	1	0	0	1	0	0	1	0	0
doc	7	15	6	5	16	0	0	31	6	5
xls	0	2	0	0	0	0	0	2	0	0

This table needs a bit of an explanation:

“found” means the number of files the tool extracted from the image
“matching” means the number of files the tool found that are identical with files recovered with the sleuth kit
“matching+ext” means that we’ve also got the extension right

Foremost recovered almost 90% of the images, Photorec following close behind 85%, and it found only 8 photos that foremost couldn’t identify. Looking at other data types, Photorec is clearly superior: it found 24 of the 25 RAR files present in the image, foremost only got 8 of them right. And only photorec was able to recover any mp3s: it found 89% of them, but we also got quite some false positives here. Neither of the tools was able to recover any movies – possibly because they were fragmented on disk.

Conclusion

So here comes the take home message:

Use GNU ddrescue to make a hard drive image first.
If your filesystem is not mountable, try the sleuth kit – you might get all your data back including the file names and directory structure.
If the sleuth kit fails and you’re trying to recover some photos, “foremost” might help you.
If the sleuth kit fails and you’re looking for something other than images, give “photorec” a shot. Anyway, it’s less likely in this case that you’ll get your data back.

7 Comments

Zinovsky says:

April 27, 2009 at 3:44 pm

Hi,
Nice article , you have explained ecerything in details, good job.
_Ana says:

July 2, 2009 at 11:27 am

This tutorial deserves great popularity
+ my closer attention…
It’s a full-scale
proof of competence ;^)
Pingback: 嘀咕火兔(2009.09.17-09.18) » 閱思網 - 心理學與生活 Ritzy Studio 博客
Nathan Rodriguez says:

May 26, 2010 at 3:43 am

Data Recovery is a very costly option that is why you should always check your storage media for any signs of wear and tear.”.~
peter says:

October 21, 2010 at 3:16 am

excellent, thank you very much for this post!
Pingback: Cannot access HD from Ubuntu Live CD
Pingback: 救回部份檔案… – Kan-Ru's Blog

Comments are closed.