The 4 Memory Tests And Their Implications For Flashcard Applications

In a nutshell
Rationale
- Desirable difficulty
The four memory tests
New material
- Managing new-material workload
‘Learned’ material
- Maintaining a lower recall ability
- Python wordfreq and the zipf score
Upon failure
- Got it wrong (Interference)
  - The confusion card
- No Clue
Ways to test memory in detail

In a nutshell

New items and low-priority items
- Learn initially with Aided Recall or Recognition Tests
- Graduate higher-priority items to Recall Testing
Failure on a recall test
- Remembered incorrectly (interference)
  - score as a failure
  - record incorrect answer
  - possibly create new item(s) to address source of interference
- No clue
  - switch to an Aided Recall or Recognition test
    - Failure
      - score as failure
    - Success
      - score as failure
      - present the situation to the user as a ‘win’

Rationale

New items are hard because of their new-ness. Beginning with hint-based, or recognition-based activities will allow you to more easily build familiarity with an item before escalating to a harder recall task. This is like starting out with light weights when you start a new exercise, or starting with training wheels when you’re learning to ride a bike; it’s less painful, and more motivating.

More items mean a higher workload, so it’s wise to realize that straight recognition may not be needed for all items. Set priorities accordingly.

If you have no clue what the answer is when reviewing a card, you might still be able to recall the answer if you switch to an aided recall or recognition test. I hypothesize that doing so may increase memory strength more than the typical response of showing yourself the answer. I base this on the idea that these tests involve more desirable difficulty than just showing yourself the answer. As desirable difficulty correlates with a greater increase in memory strength after review, this tactic may correlate with a greater memory strength.

Desirable difficulty

Desirable Difficulty is a concept found in the Bjork’s (of UCLA) theory of memory, which they call the New Theory of Disuse. Basically, the harder you have to work to to recall a memory, the greater that act of retrieval strengthens that memory.

The four memory tests

It’s helpful to know that there are only four ways that memory researchers can measure memory. (see Your Memory by Kenneth Higbee)

(Strict) Recall
Aided Recall (recall, but with a hint)
Recognition (multiple choice)
Relearning (and measuring how much less time it takes now)

Recall is the hardest task and the least sensitive measurement. Aided Recall and Recognition are of intermediate difficulty and sensitivity. Testing how much time has been saved in relearning material, however is extremely sensitive. This is what Ebbinghaus studied, and it’s capable of measuring the effects of memories so weak that test subjects themselves are not aware of them (even after receiving hints and clues).

Knowing these four testing methods and their characteristics leads, I think, to some possible conclusions about spaced repetition techniques.

New material

Perhaps start with aided recall or recognition testing.

For new material that is difficult to learn, you might consider initially using aided recall or recognition tests in your study sessions. Giving yourself an easier task should increase your ‘win’ rate and that may have an important psychological impact on your study motivation.

Starting with these easier tasks should also increase the amount of time between reviews while still maintaining a high enough success rate to stay motivated. This spacing out of reviews should allow you to to learn more items at once without increasing your overall review load.

After achieving some pre-set criteria, you could then switch to the more challenging recall testing for an item.

e.g. After correctly passing an aided recall or recognition test for an item 12 times, you might then retire the aided recall or recognition test activity and switch to reviewing the item using strict recall.

Managing new-material workload

I haven’t done any simulations of this, but it seems to me that by initially using an easier task to learn new material, you can space out reviews more aggressively while still maintaining a high ‘win’ rate (which I believe to be important for motivation).

Maintaining new material at a lower, easier, level (e.g. at an “I can remember it when I have a hint” level rather than an “I can remember it without hints”), should allow for longer spacing intervals for newly learned material (while still maintaining the desired ‘win’/’loss’ ratio). This in turn, should lower the overall workload.

This initial learning period could be phased out after some criteria is met (e.g. 12 successful recalls) and replaced with straight recall work.

‘Learned’ material

Select an activity that reflects the memory level at which you want to maintain a memory.

For material where you are concerned with efficiency, use some sort of spaced repetition algorithm to decide which items to review and/or when to review them. I believe that for motivational reasons you should target a success rate of 80-90%.

For important information, such as high-frequency words, an 80-90% chance of recall is probably a reasonable goal.

In fact, for the highest-frequency words (e.g. the, he, it, etc.), you may want to maintain them at an even higher level, aiming for what you might call fluency (although their extreme frequency means that you’ll probably get more than enough exposure to them without worrying about it).

Languages contain very large numbers of infrequently used vocabulary. It’s reasonable to emphasize efficiency for such vocabulary, but maintaining them at a high recall level may not be necessary.

With highly infrequent vocabulary, you may well decide that you are unlikely to come across such words out of context, and therefore, a recognition test might be appropriate. This way, you know such words well enough to (hopefully) recognize them when you see them in context, but you don’t spend time maintaining them at a higher degree than necessary.

Maintaining a lower recall ability

An obvious way to maintain material at a lower level might be to try to space an item’s review such that you have maybe a 50% chance of recalling it. The problem with this, is that you’re going to have a 50% failure rate, and you’ll probably find these reviews to be very difficult.

Difficult reviews with a low chance of success are hard to stick with.

However, if we test ourself using aided recall or recognition, then we can achieve a more motivating 80-90% success rate, while maintaining a much lower level of memory strength compared to straight recall.

Python wordfreq and the zipf score

Wordfreq is a handy library for python that can estimate word frequencies for words in several languages. It also estimates something called the zipf score. (The Zipf scale was proposed by Marc Brysbaert, who created the SUBTLEX lists.)

Reasonable zipf values range between 0 and 8, with higher values corresponding with more frequent words. It’s a more human-friendly logarithmic scale than a raw frequency per billion words.

Let’s say you wanted to quickly categorize words into high, medium, and low frequencies. By splitting the range 0-8 into thirds, you could decide that:

5.33-8.00 = high frequency words
2.67-5.33 = medium frequency words
0.00-2.67 = low frequency words

Now you might decide to use recall testing for high frequency words, aided recall (using only the first letter as a hint) for the medium frequency words, and recognition testing (multiple-choice) for the low frequency words.

Upon failure

The different ways of testing memory may also have useful implications for how we respond to a failure to recall an item.

However, first, I think it’s important to realize that there are two ways we can fail a review:

We stare at the prompt and have no clue what the answer is.
We can get the wrong answer.

Got it wrong (Interference)

Log of wrong answers
Consider a confusion card.

If we get the wrong answer, it means there’s some sort of interference going on. Somehow, the wrong information interfered with our recall process and we remembered the wrong thing.

If we just ignore this, view the correct answer, and move along, we may not address this interference; it may continue to linger and cause problems.

A better solution may be to log the wrong answer somehow. Imagine a flashcard application where if you get the wrong answer, you hit w and then type in the incorrect answer. Later, you could review these logs, find common sources of interference and try to address them.

The confusion card

One type of flashcard that I’ve found very useful is what I like to call a confusion card. Basically, it presents two things that I tend to confuse and asks which is which.

e.g. Which word means “making-up” and which means “including”: 1. comprise 2. compose

(answer) comprise=”including”; compose=”making-up”

(note: saying “comprised of” is an abomination made acceptable by common usage)

It’s helpful to include some sort of function to randomize the order of the words and definitions. That way, you can’t use the order they’re listed in as a short-cut.

No Clue

Try Aided Recall or Recognition
- Possibly more effective than being shown the answer
  - desirable difficulty
- Gives you another shot at a psychological ‘win’
  - more motivating?

So if we’re not wrong; we just have no clue, the standard approach is to just show ourselves the answer, try to remember it for next time, and move on.

I suspect that switching to an easier memory test may be more effective for two reasons.

First, although the initial task apparently gave us more than enough desirable difficulty, following up with an easier, but still somewhat difficult task, may have a stronger effect on increasing memory storage. This is highly speculative on my part, as I haven’t come across any research that studied this.

It would easily be possible to do such a study, and I may even try this using myself as the guinea pig (as Ebbinghaus did). You would just need to study two sets of items: one which you’d study in a traditional way, and the other where after failure, you’d use aided recall or recognition testing. Then after a delay, test yourself on both items to see if your memory for one group is better than the other. N.B. As relearning is a more sensitive test, that may be the best metric to use to compare the two sets (just remember to establish a baseline for comparison).

Second, getting a second chance at recalling the item (this time in an easier fashion), may be helpful in maintaining motivation even if there’s no benefit memory-wise.

As far as our spacing algorithm goes, we’d want to treat this review as a fail even if we won the rematch, so to speak. But, the rematch should give the outward appearance of a regular win (with any bells and whistles the program would display for a normal win).

Ways to test memory in detail

Recall

Hardest
Least sensitive

Recall tasks are just what they sound like. I ask you a question and you have to recall the answer. The entire memory retrieval system has to work correctly for this to happen, so this is a difficult task. This kind of test cannot detect memories that are only partially retrievable.

To get around the insensitivity of recall testing, many flashcard applications ask users to rate how easy or hard an item was. I’m not aware of any evidence that such subjective ratings are reliable. Furthermore, what I know of cognitive biases, and the general inaccessibility of such mental process to the conscious mind, suggests to me that these ratings are highly unreliable. I therefore conclude that these self-rating schemes are a waste of time and mental effort, but remain open to any data to the contrary.

Aided Recall

Medium difficulty
Medium sensitivity

Aided recall is simply a recall activity where the user is given some sort of hint or hints. This could a mnemonic device (preferable, in my opinion), or the first letter of the answer (which has the advantage of being easily scriptable).

Because aided recall tasks can identify memories with relatively weaker strength, it is a more sensitive test than straight recall.

Recognition

Medium difficulty
Medium sensitivity

This is where you recognize the correct answer when it’s listed among other choices. Similar to aided recall, this is easier and more sensitive than straight recall.

N.B. Multiple choice tests are by no means always recognition tests, although they are often described that way be people who don’t like them (as part of a straw man fallacy). Difficult multiple choice tests can include tempting answers that are close to, but not quite true. Such tests can be extremely useful in testing students for misconceptions, or misunderstandings related to slight details. Evaluating similar answers to choose the best one is far from simply recognizing the correct answer. Likewise, multiple choice questions that require extensive calculation or reasoning in order to identify the answer are not simple recognition tests either.

Relearning

Most sensitive

When Ebbinghaus performed his now famous experiments, he measured the amount of time it took him to memorize lists of nonsense syllables and how much study time was subsequently saved when re-memorizing the same list later on. Ebbinghaus’s term for the amount of study time saved by previous study sessions was “savings”, and it’s the decline of savings, not recall probability that you see charted in his works.

This was perhaps a fortunate choice, as such relearning tests are the most sensitive test of memory storage. Even memories so weakly stored that a person has no conscious awareness of them, can lead to a savings in study time for that person. This has been demonstrated experimentally.

Unfortunately, relearning isn’t really a specific activity, so implementing it in a flashcard application isn’t a straight-forward proposition. It would be possible to create a system that records study time for a set of items, and then attempts to schedule that set for a later date, presumably targeting a certain savings level. This, however, is not what any existing flashcard application does, and would probably confuse most users.