METRICS&ANALYSIS
By Steven Just, Ed.D.
We all know that life sciences products emerge from a rigorous, scientific process. Research results must be valid and reliable and stand up to external scrutiny. The consequences of a non-rigorous research & development (R&D) process can be substantial.
It’s reasonable to recognize that learning & development (L&D) is not R&D. But, as learning professionals, shouldn’t we attempt to be scientific and rigorous in our practice as well?
This is especially true when we are administering assessments. If we are giving exams with potential job or career consequences, shouldn’t we be able to demonstrate that our exams are fair, valid and reliable? And if we can’t, have we not potentially placed the company in legal jeopardy?
We know from experience that companies vary widely in how seriously they treat exams. At one end of the spectrum, some companies treat their assessments as mild annoyances to be dispensed with as effortlessly and as quickly as possible (a check-the-box approach), while at the other end some companies follow rigorous, defensible processes.
So, what makes a process defensible? These are the questions (with answers!) I most frequently encounter:
There are several types of validity: construct, predictive, content and face. For the most part we are concerned with content validity. Content validity is the adequacy with which a domain of content is tested. What’s important is that it is not a quantitative measure. It flows from a process, including properly constructed questions, domain coverage, setting a valid passing score, ensuring fairness and using a “sensible” method of testing.
Unlike validity, reliability is a number. It varies from 0 (no reliability) to 1 (perfect reliability). It refers to “replicability.”
That is, if we gave the same test again to the same or an equivalent group of people, would we get the same results? Since you almost never give the same exam to the same people twice, there are statistical methods for calculating reliability from a single exam delivery.
Exams can be reliable but not valid. For example, if we give a calculus test to a second-grade class, the results will be very poor (to say the least). If, two days later, we again give the same exam to these students the results would be the same. This exam is reliable, but not valid. (It doesn’t have construct validity.) But, for an exam to be valid it must also be reliable.
No. Only exams with consequences (also known as high-stakes exams). If you are using assessments for quizzing, pre-testing, gamification, adaptive questioning, module-level tests or for diagnosing learning gaps and targeting remediation, the assessment does not need to be validated.
That said, the questions in any assessment, including low-stakes assessments, should be written according to valid question writing rules.
This is very important. It depends on how you arrived at that number. Was your reasoning something like: “well that’s what it’s always been at our company, so that’s what I use”? If it is, you could be placing your company in legal jeopardy.
If exam failure has job or career consequences, you need to be able to justify that number, and the only way to justify a passing score is through a legally-adjudicated, validated process.
The simple answer is, as long as it needs to be to cover the important content. I have validated more than a hundred product information or package insert (PI) certification exams. Typically, I find that an exam in the 40-question range will cover a PI. Other exams could be shorter or longer depending on the content being tested.
As a rule of thumb, I generally limit exams to no more than 40 or so questions. If more questions are required to cover the content, then I break the exam into two parts.
For certification exams, the answer is absolutely yes. Since most exams are delivered remotely and are not proctored, if you do not time the exam, it will become effectively open book.
Timing by the whole exam is equivalent to making the exam open book. Why? Because the testtakers will go through the exam, answer the questions they know, and then take the leftover time and look up the answers to questions they don’t know.
In an unproctored exam, the only way to ensure exam integrity is to time by the question. If your testing system does not have this basic feature, then you may want to think about acquiring a new testing system.
Side note: I have found that many companies use their learning management systems for testing and these systems frequently lack key features, including question timing.
Two reasons: When “all of the above” is a choice, it is more often than not the correct answer and in a multiple-choice question, with four or more choices, you only need to know that two of the choices are correct, to know that “all of the above” is the correct answer.
I recommend using “select all that apply” questions when there are multiple correct responses. Be sure to include at least one incorrect response.
Yes, but see above. There is one problem with “all that apply” questions, however. They are more difficult than multiple-choice questions (unless you give partial credit, but very few testing systems have this feature).
Here’s a trick: If you want to simplify an “all that apply” question, tell the test-taker how many responses are correct. (“Select the three correct answers from the list below.”)
I’ve heard people say not to use it, but I have no problem with it as a choice if you don’t overuse it. It actually makes the question more difficult.
Why? In a standard multiple-choice question, even if you are not sure of the correct answer, you know one of the choices must be correct. But if you include “none of the above” as a choice, the correct response could be anything, not just one of the listed choices.
T/F questions are a valid question type. However, as we well know, the test-taker has a 50% chance of guessing the correct answer. So, you can use them, but sparingly.
There is a variation on the T/F question that I like: the “T/F because” question, in which you must select why the statement is true or false.
Creating fair, valid and reliable assessments is a science. As learning professionals, we need to incorporate this science into our practice.
Steven Just, Ed.D., is the CEO and principal consultant of Princeton Metrics. Dr. Just can be reached at sjust@princetonmetrics.com.