Friday, September 13, 2013

Recovering Data from Deleted SQLite Records: Redux

Rising from the Ashes

I’ve received many, many inquiries about recovering deleted records from SQLite databases ever since I posted an article about my first attempt to recover deleted data. Well, the hypothesis of checking the difference between the original database and a vacuumed copy seemed sound at the time and did in fact yield dropped record data, but it also included data from allocated records. The main thing I learned was that I had much to learn about the SQLite database file format.

Since that time, I’ve run into more and more SQLite databases, and the issue of recovering dropped records has become paramount. I have learned how to do so, and I’ll share some of the secrets with you now. But first, you need to know a little about SQLite databases…

This article is not a treatise on the SQLite file format. The best resource for that is located at SQLite.org. I hope to put the salient points here so you can understand the complexity of the task of recovering dropped records from SQLite databases.

SQLite Main Database Header

The first 100 bytes of a SQLite database define and describe the database. The key value to record recovery is the page size, a 16-bit (two-bytes) big-endian integer at byte offset 16. SQLite databases are divided into pages, usually matching the underlying file system block size. Each page has a single use, and those containing the records that are the interest of forensic examiners is the table b-tree leaf page, which I’ll refer to as the TLP. The TLP is distinguished from other page types by its first byte, \x0d or integer 13.

Thus, we can find the TLPs with the knowledge of the database page size we obtain from the database header and check the first byte of each page for \x0d. In python, that might look like:

Python 3: Finding table b-tree leaf pages
from struct import unpack

with open('some.db', 'rb') as f:
    data = f.read()

pageSize = unpack('>h', data[16:18])[0]

pageList = []

for offset in range(0, len(data), pageSize):
    if data[offset] == 13;
        pageList.append(offset)
Note
The code above prints the offset of TLPs. Make sure you are using Python 3 if you want to try this for yourself.

Table B-Tree Leaf Pages

The TLPs hold the records, and consequently, the dropped (deleted) records data when they occur. Each page has an 8-byte header, broken down as follows:

Table 1. Table b-tree leaf page header
Offset Size Value

0

1

Page byte \x0d (int 13)

1

2

Byte offset to first freeblock

3

2

Number of cells

5

2

Offset to first cell

7

1

Number of freebytes

The header introduces some terms that need explaining. A freeblock is unallocated space in the page below one or more allocated records. It is created by the dropping of a record from the table. It has a four-byte header: the first two bytes are a 16-bit big-endian integer pointing to the next freeblock (zero means its the last freeblock), and the second two bytes are a 16-bit big-endian integer representing the size of the freeblock, including the header.

Cells are the structures that hold the records. The are made up of a payload length, key, and payload. The length and key, also known as the rowid, are variable length integers. What are those? I’m glad you asked:

Variable-length Integers

A variable-length integer or "varint" is a static Huffman encoding of 64-bit twos-complement integers that uses less space for small positive values. A varint is between 1 and 9 bytes in length. The varint consists of either zero or more byte which have the high-order bit set followed by a single byte with the high-order bit clear, or nine bytes, whichever is shorter. The lower seven bits of each of the first eight bytes and all 8 bits of the ninth byte are used to reconstruct the 64-bit twos-complement integer. Varints are big-endian: bits taken from the earlier byte of the varint are the more significant and bits taken from the later bytes.

http://www.sqlite.org/fileformat2.html
— SQLite.org

I won’t go into varints any further in this post, because I will not be discussing cell decoding in this post. Suffice it to say that with the payload length, we can define the payload, which itself is made up of a header and columns. The header is a list of varints, the first describing the header length, and the remainder decribing the column data and types. The page header contains number of cells and the offset to the first cell on the page.

The last value in the header, freebytes, describes the number of fragmented bytes on the page. Fragmented bytes are byte groupings of three or less that cannot be reallocated to a new cell (which takes a minimum of four bytes).

Immediately following the page header is a cell pointer array. It is made up of 16-bit big endian integers equal in length to the number of cells on the page. Thus, if there are 10 cells on the page, the array is 20 bytes long (10 2-btye groupings).

Page Unallocated Space

There are three types of unallocated space in a TLP. Freeblocks and freebytes we’ve discussed, and the third is the space between the end of the cell array and the first cell on the page referred to in the SQLite documentation as simply "unallocated". Freeblocks and unallocated can contain recoverable record data, while freebytes are too small for interpretation. Thus, knowing the first freeblock (defined in the page header), the length of the cell array (interpreted from the number of cells defined in the page header) and the offset to the first cell (yep, you guessed it, defined in the page header), we can recover all the unallocated space in the page for analysis.

Python 3: Finding table b-tree leaf page unallocated space
for offset in pageList:
    page = data[offset: offset + pageSize]

    pageHeader = unpack('>bhhhb', page[:8])
    pageByte, fbOffset, cellQty, cellOffset, freebytes = pageHeader

    # get unallocated
    start = 8 + cellQty * 2
    end = cellOffset-start
    unalloc = page[start:end]
    print(offset, unalloc, sep=',')

    # get freeblocks, if any
    if fbOffset > 0:
        while fbOffset != 0:
            start, size = unpack('>hh', page[fbOffset: fbOffset + 4])
            freeblock = page[fbOffset: fbOffset + size]
            print(offset, freeblock, sep = ',')
            fbOffset = start

With the lines from the two code boxes, we have coaxed the unallocated data from the "some.db" SQLite database. We have printed the offset of each unallocated block and the contents (in python bytes format) to stdout. With just a little manipulation, we can turn this into a script a reuseable program, and the content can be grepped for strings. At bare minimum, we now have a way to determine if there is deleted content in the database related to our investigation, e.g., we could grep the output of the Android mmssms.db for a phone number to see if there are deleted records. Searching against the whole database would not be valuable because we cannot separate the allocated from the unallocated content!

Now, this obviously does not reconstruct the records for us, but recovering the unallocated data is a good start. In future posts I will describe how to reconstruct allocated records with an eye towards reconstructing unallocated records.