Test Reliability

Lucy C. Jacobs, Ph.D.

Would your students get the same scores if they took one of your tests on two different occasions? Would they get approximately the same scores if they took two different forms of one of your tests? These questions have to do with the consistency with which your classroom tests measure students’ achievement. The generic name for consistency is reliability. Reliability is an essential characteristic of a good test, because if a test doesn’t measure consistently (reliably), then one could not count on the scores resulting from a particular administration to be an accurate index of students’ achievement. One wouldn’t trust bathroom scales if the reading fluctuated according to the temperature or humidity or if the scales had a loose spring. Similarly, we can’t trust the scores from our tests unless we know about the consistency with which they measure. Only to the extent that test scores are reliable can they be useful and fair to students.

Technically, reliability shows the extent to which test scores are free from errors of measurement. No classroom test is perfectly reliable because random errors operate to cause scores to vary or be inconsistent from time to time and situation to situation. The goal is to try to minimize these inevitable errors of measurement and thus increase reliability.

Sources of Error

What are some of the factors that introduce error into measurement?

(1) Item Sampling — Bacause any test is only a sample of all possible items, the item sample itself can be a source of error. Longer tests are typically more reliable because we get a better sample of the course content and students’ performance. Suppose an instructor, to measure achievement of the unit in biology, gave a one-question test. Students who knew this one question would have perfect achievement, but students who didn’t would fail. Obviously, a one-question test would not provide a reliable estimate of the students’ knowledge. But as more and more questions were added, one would obtain a sample that better fits the unit of instruction and yields scores that more accurately reflect real differences in achievement. So by increasing the length of the test (the size of the sample) we increase the consistency of our measurement.

A longer test also tends to reduce the influence of chance factors such as guessing. If an instructor gave, say a ten-item multiple choice test, a student might know six of the items and guess at the other four. If the student happened to guess correctly, he/she would show perfect achievement. If the student happened to guess incorrectly, he/she would show 60 percent achievement. If that test, however, had 100 items, the student’s correct guesses would be balance by incorrect guesses, and the score would be a more reliable indication of real knowledge.

There is a caveat at this point: Lengthening a test improves reliability only when the additional items are good quality and as reliable as the original ones. Adding poor quality items will actually induce error and lower reliability. Furthermore, there is a point of diminishing returns — if we add too many items, we risk student fatigue that will lower reliability.

(2) Construction of the Items — Another major threat to reliable measurement is poorly worded or ambiguous questions or trick questions. Consider the following simple examples:

1. What is the best index of reliability for the classroom teacher to use?

a. Split-half
b. Kuder-Richardson
c. Coefficient of stability
d. Coefficient of equivalence
e. Standard error of measurement

2. (true or false) Reliability of a test depends on its length.

3. Silas Marner was written by _____________________ .

For item 1, the correct answer would depend on what the instructor means by “best.” “Best” could refer to ease of calculation, meaningfulness, or something else. Item 2 is ambiguous, and may be either true or false, depending upon the interpretation. In item 3, the answer “George Eliot” seems obvious, but the question is phrased in such a way that several other answers are technically correct: “Mary Ann Evans,” “a woman,” “1861,” and “by hand.”

Test questions that permit widely varying interpretations of what is expected are not likely to yield highly reliable scores.

(3) Test administration–Environmental factors such as heat, light, noise, confusing directions, and different testing time allowed to different students can affect students’ scores, The more such factors interfere with a student’s performance, the less faith we can have in the accuracy of the test scores.

(4) Scoring — Objectivity or the extent to which equally competent scores obtain the same score is a factor affecting reliability. An objective test is more reliable because the test scores reflect true differences in achievement among students and not the judgment and opinions of the scorer. Typically, essay tests have lower reliability than multiple choice tests because the subjectivity in scoring lowers reliability. This does not mean, however, that instructors should not use essay tests. There are things we can do to improve the reliability of essay tests.

(5) Difficulty of the Test–A test that is either too easy or too difficult for the class taking it will typically have low reliability. This occurs because the scores will be clustered together at either the high end or the low end of the scale, with small differences among students. Reliability is higher when the scores are spread out over the entire scale, showing real differences among students.

(6) Student Factors–Student fatigue, illness, or anxiety can induce error and lower reliability because they affect performance and keep a test from being a measure of their true ability or achievement.

Measures of Reliability

Reliability measures are concerned with determining the degree of inconsistency in scores due to random error. The calculation of reliability indices is beyond the scope of this discussion. The item analysis for objective tests that BEST provides to faculty includes three indices of reliability that BEST’s Digitek Test Scoring and Item Analysis Program provides to faculty includes three indices of reliability. Two of these, the Spearman-Brown and the Kuder-Richardson, provide estimates of the extent to which students would receive similar scores if they were re-tested with an equivalent form of the test. The Spearman-Brown reflects consistency due to item sampling only. The Kuder-Richardson (K-R 20) measures consistency of responses to all the items within the test and reflects two error sources: item sampling and heterogeneity of the content domain sampled. Both of these indices report reliability as a coefficient ranging in size from 0.00 (no consistency) to 1.00 (perfect consistency). Obviously, the larger the coefficient, the better. The extent to which the coefficient falls below 1.00 is the extent to which errors of measurement are present.

Although it is not possible to obtain perfectly reliable scores in measuring classroom achievement, some instructors are able to construct tests that have reliability coefficients of 0.90 and above. We should strive to write tests that yield reliability coefficients of at least .70.

Another way to express reliability is in terms of the standard error of measurement. This measure provides an estimate of how much an individual’s score would be expected to change on re-testing with the same or an equivalent form of the test. Based on the assumption that any test score contains an error component, the standard error of measurement is used to estimate a band or interval within which a person’s true score would fall, that is the score (hypothetical) the student would receive if there were no error of measurement. Using an interval takes into account the fact that errors can result in a person’s score on any test appearing higher or lower than it really should be.

For example, assume that Student A has a score of 82 on a test with a standard error of measurement equal to 4. We use the latter index to place a confidence band of +- one standard error of measurement around the observed score and say that we are 68% confident that the student’s true score would be in that range (78 to 86), or we can say with 95% confidence that the student’s true score lies in an interval within two standard errors of measurement of the observed score (between 74 and 90). An alternative interpretation states that 95 times out of 100 the student’s score on a re-test would be between 74 and 90.

The standard error of measurement is very important because it alerts instructors to the fact that test scores are not exact, but that they always contain some error. Because of the imprecision of classroom measurement, test scores should be considered as an estimate of students’ achievement level. How good an estimate depends on the magnitude of the standard error of measurement.

There is no ideal standard error of measurement. Because the size of the standard error of measurement indicates how far an obtained test score might be from the real error-free score, the smaller it is the better. The size of the standard error of measurement is inversely related to the size of the reliability coefficient. If a test has a low reliability coefficient, one would expect large variations in students’ scores (large standard error of measurement), but a high reliability coefficient indicates little variation in scores and hence al low stand error of measurement.

Improving the Reliability of Classroom Tests

The best suggestions for improving the reliability of classroom tests are:

(1) Write longer tests. Instructors often want to know how many items are needed in order to provide reliable measurement. It is not easy to answer this question, because it depends on the quality of the items, the difficulty of the items, the range of the scores, and other factors. The best advice is to include as many questions as you think the students can complete in the testing time available. Around 40 multiple choice questions would seem an appropriate number for a regular class period.

Pay more attention to the careful construction of the test questions. Phrase each question clearly so that students know exactly what you want. Try to write items that discriminate among good and poor students and are of an appropriate difficulty level. If you use BEST’s Digitek scoring service, you receive information on the discrimination and difficulty level of each objective item. Test items that contribute to test reliability have a discrimination index (R) that is positive and .40 or higher, and a moderate difficulty level (P) of 35 to 80.

(3) Start planning the test and writing the items well ahead of the time the test is to be given. A test written hurriedly at the last minute is not likely to be a reliable test.

(4) Write clear directions and use standard administrative procedures.

Because students’ grades are dependent on the scores they receive on classroom tests, faculty should strive to improve the reliability of their tests. The over-all reliability of classroom assessment can be improved by giving more frequent tests. The composite based on scores from several tests and quizzes typically has higher reliability than the individual components. The positive and negative errors for individual students tend to even out over a semester.