Metrics&Measurement
Want to hear more from Steven Just on “the good, the so-so and the ugly” of exam design? Check out the recording of the May 16, 2025 LTEN Webinar, Exam Quality Improvement: The Good, the So- So and the Ugly.
The basics for creating valid assessments are well-known; They must be based on good questionwriting practices; cover all the key topics, usually by writing questions that reflect learning objectives; have a balance of difficulty levels; avoid certain question types (such as “all of the above”); set a valid passing score; and post-exam results must be reviewable.
Based on years of consulting in assessment design, I’ve encountered a range of requested exam features — some excellent, some misguided. Let’s explore these ideas, separating what genuinely improves exam quality from what might sound appealing but ultimately does more harm than good.
Most exams use closed-response questions, overwhelmingly multiple-choice. Is this a good thing? No. It’s very difficult to test at higher reasoning levels using closedend questions (not impossible, but difficult). In an ideal world we would test a learner’s ability to analyze and respond. So, why are most tests multiple choice? Simple: They are easy to score.
So, how can we test at higher cognitive levels? How can we assess learners’ abilities in data analysis, judgment, process explanation and decision-making? Clearly, the best way to do this is to use open-response or essay questions.
For example, here’s a basic foundational level multiple choice question:
Which chamber of the heart receives oxygen-poor blood from the body?
Right atrium
Left atrium
Right ventricle
Left ventricle
This could be converted into an open-response question that requires a higher level of process explanation:
Explain the role of the right atrium and how it works with the other three chambers of the heart to pump blood throughout the body.
You might think that it would be simple to ask open-response questions, and at some level it is. But it also is a lot more complicated than just that. To guard against one individual rater’s “scoring bias,” you probably need to allow for multiple reviewers to read and score the responses and enter comments. You also need to adjust your scoring algorithm because, clearly, an essay question is “worth more” than a multiple-choice question. And so on.
Historically, open-response items have been avoided due to the time and cost of manual scoring. But the landscape is changing. AI-powered scoring tools have made significant strides in accuracy, consistency and efficiency, making essay-style assessment a viable option, even at scale.
A typical multiple-choice question contains four choices: the correct answer and three incorrect choices (distractors). But what if each test-taker sees the same correct answer but different distractors? The idea is to make it more difficult to cheat — similar to scrambling question order within the test and choice order within questions.
But it turns out that this feature would lead to creating invalid exams. Why? It’s a little realized feature of multiple-choice questions: Incorrect choices influence the difficulty of a question.
Look at this multiple-choice question:
Who was the 12th president of the United States?
Zachary Taylor
Jimmy Carter
Ronald Reagan
Bill Clinton
Now compare it with this version:
James Polk
Millard Filmore
Franklin Pierce
Same question, but version two is much more difficult than version one because the distractors are more historically plausible. Having the “same” question but with different difficulty levels compromises test fairness and legal defensibility.
One of the absolute rules of certification testing is that every test-taker must see a comparably difficult exam (this is known as test equivalence). If not, you are opening your company up to legal jeopardy. This also applies to random subset questions from a larger pool of questions.
I advise providing three tries to pass an exam. So, what happens when someone fails? Typically, the test-taker gets a score, feedback and sees the same exam again. And if they fail again? Then they will again see their score, get feedback and take the same exam a third time.
But intuition tells us something is wrong with that process. If someone keeps seeing the same questions with repeated feedback, they will just memorize the correct answers even if they don’t understand why the answers are correct. Parallel questions allow us to ask the same question but in a different form. For example:
First Attempt: Which chamber of the heart receives oxygen-poor blood from the body?
Second Attempt: The function of the right atrium is to:
Receive oxygen-poor blood from the body
Pump oxygen-poor blood to the lungs
Receive oxygen-rich blood from the lungs
Pump oxygen-rich blood to the body
It’s the same question but in two different forms, precluding question/answer memorization.
This is a nice feature, but in my experience, few test creators use it. Why? It’s hard enough to write the first set of questions, doubling or even tripling the number of questions is an effort no one wants to go through.
So, this is perfectly valid but be careful about the test equivalence issue for this feature as well because it is possible the second attempt question will be at a different difficulty level from the first attempt.
Multiple choice questions are easy to score, one point for a correct answer, zero points for an incorrect answer. But what about “all that apply” questions? If a question contains three correct choices (and three distractors) and a learner selects all three correct answers clearly, the learner gets full credit. But what about a learner who selects two out of three? Should this learner get twothirds of a point or no credit? Plausible arguments can be made either way, though in my experience learners do get upset if they don’t get partial credit.
That’s straight-forward, but what if someone chooses the three correct responses but also one of the incorrect responses? Should they get the same full credit as the person who chooses the three correct responses but none of the incorrect ones? That doesn’t seem fair, so in addition to giving (partial) points for correct responses you also need to subtract (partial) points for incorrect responses.
The bottom line is that partial credit is a fine idea and usually welcomed by the learners but be careful how you do the math.
A well-meaning client once told me, “I expect employees to know everything — they need all of it to do their jobs.” They insisted on a 100% passing score. This is an invalid idea for multiple reasons.
To begin with, and this should be obvious, no one is perfect. What I wanted to ask (but didn’t) was: “Did you graduate from college? Did you get an A in every course, a 100% on every test?” Of course, they didn’t and we shouldn’t expect that from our employees either.
We need to set reasonable standards, not super-human standards no one can possibly achieve. And even if someone were to master all the material and score 100%, we know from decades of research that within a week they will forget a significant percentage of what they knew (the Ebbinghaus Forgetting Curve).
There are scientifically-proven, legally-defensible methods for setting passing scores but arbitrarily picking a number — and it doesn’t matter what that number is — is not one of them.
Exam design is both an art and a science. Incorporating new features like open-response questions, parallel forms and partial credit scoring can improve exam validity and learner engagement. But tread carefully. Features like randomized distractors or excessive passing thresholds, though well-intentioned, can compromise the integrity of your assessment program.
As testing technology evolves, so should our strategies. With thoughtful design and a commitment to fairness, exams can become not just gatekeepers of knowledge — but powerful tools for learning and growth.
Steven Just, Ed.D., is CEO and principal consultant at Princeton Metrics. Email Steven at sjust@princetonmetrics.com or connect through LinkedIn at linkedin.com/in/steven-just-081b76.