Useful Kindle Stats

I use a script to export highlights and notes from my kindle’s My Clippings.txt file to an sqlite3 database.

Having these clippings in a database allows me to get some interesting and potentially useful statistics.

Highlight (Information) Density

I created this measurement to show how densely packed the highlights I’ve made are, and therefore how densely packed the information (good enough to highlight) in the book is. If there are a lot of highlights in a few pages, that’s dense. Books with few highlights but many pages are fluffy.

Of course, kindle books don’t have page numbers, so I go by locations instead.

I find it useful sometimes to rank books from densest to fluffiest. When given the choice, I’d rather spend time reading or reviewing a dense book, rather than one full of fluff.

As part of the database, I have a table for information about the books themselves. Here’s a minimal version of that table:

CREATE TABLE IF NOT EXISTS "books" (
    `id`	INTEGER,
    `book`	TEXT,
    `Title`	TEXT,
    `SubTitle`	TEXT,
    `Author`	TEXT,
    `density`	REAL,
    `ptscore`	REAL,
    PRIMARY KEY(`id`)
);

To calculate the highlight density of a book I use the following simple python function:

def calc_hl_density():
    cur.execute("select id, density from books")
    for bid, density in cur.fetchall():
        cur.execute("select min(location), max(location), count(clipID) from clips where bid = ?", (bid,))
        for minl, maxl, count in cur.fetchall():
            if count > 2:
                density = float(count)/(float(maxl)-float(minl))
                cur.execute("update books set density = ? where id = ?", (density, bid))
    con.commit()

Page-Turners

Since the kindle time-stamps all highlights, it’s also possible to measure the number of locations read over time. A larger number should mean that I found the book hard to put down, so I kept reading it. This is a good thing for a book, but it’s not necessarily going to be caught by the highlight density. It’s possible for a book to be both dense and boring.

The worst is probably a combination of being boring and fluffy. Sometimes I find myself slogging through a boring book because I think there may be some gems hidden in the depths somewhere. This probably isn’t the best use of my time. I figure it represents a high opportunity cost.

NB: Sometimes my kindle loses track of time and resets back to thinking it’s 1970, so I have to filter out all those dates. Filtering out any date before 1980 is overkill, but will work nicely. I’d have to not notice the problem for ten years for it to fail!

from datetime import datetime

def calc_pageturn_score():
    cur.execute("select id, ptscore from books")
    for bid, ptScore in cur.fetchall():
        #find first highlight/note
        cur.execute("select min(datestring), max(datestring) from clips where bid = ? and datestring > '1980-01-01'", (bid,))
        firstDate, lastDate = cur.fetchone()
        # bail out if only one day of reading
        if firstDate[:10] == lastDate[:10]:
            continue
        # find start location and end location
        cur.execute("select location from clips where bid = ? and datestring = ?", (bid, firstDate))
        firstLoc = cur.fetchone()[0]
        cur.execute("select location from clips where bid = ? and datestring = ?", (bid, lastDate))
        lastLoc = cur.fetchone()[0]
        # calculate locations read
        locsRead = lastLoc - firstLoc
        # calculate time between first and last clips (format is YYYY-MM-DD HH:MM:SS)
        fDate = datetime.strptime(firstDate, "%Y-%m-%d %H:%M:%S")
        lDate = datetime.strptime(lastDate, "%Y-%m-%d %H:%M:%S")
        dDiff = lDate - fDate
        dDays = dDiff.total_seconds() / 86400.0
        # calc page-turn score (locations read per day)
        ptScore = float(locsRead) / dDays
        # update database
        cur.execute("update books set ptscore = ? where id = ?", (ptScore, bid))

Now you have some kind of a score to judge how much of a page-turner a book is. Of course, there are plenty of other factors that contribute to this, so this isn’t a perfect measurement by any means. For example, if you’re on vacation, then any book you’re currently reading might get a boost in its page-turn score just because you have more time for reading. Also, fiction books where you may not make many highlights will be horribly skewed. But it’s a start.