1. Background Information
| Authors: | Cynthia B. Schmeiser and Douglas R. Whitney |
| Title: | Effect of Two Selected Item-Writing Practices on Test Difficulty, Discrimination, and Reliability |
| Source: | The Journal of Experimental Education
|
| Year: | Spring 1975, pp. 30-34 |
2. Abstract
The goal of the study was to examine the impact of two undesirable test characteristics on the difficulty of tests. Multiple choice questions were used to examine the effects of i) questions which contain material unnecessary to answer the question (window dressing) and ii) questions which do not express a complete statement or question (incomplete item stems). A randomized block design was used with the "item writing practice" as the treatment variable and course achievement level as the blocking variable. The dependent variable was the score on the test. Two separate ANOVAs were completed on each of the two question types. The results indicated that the treatment effect was insignificant in the case of window dressing, but was significant in the case of incomplete questions. These findings are consistent with prior research.
3. Null hypothesis - level and sample size per group
In both cases, it was predicted that test which includes the undesirable item would be more difficult than a test in which the deficiency is corrected. So, the null hypotheses, which are not formally stated in the paper, could be stated as:
Ho1: The inclusion of questions in a test which contain unnecessary information (window dressing) does not cause such a test to be more difficult than one in which this practice is corrected.
This hypothesis was tested using a 75 question multiple choice exam administered to an undergraduate class (which appears to be in business) of 71 students, 65 of which were evaluated in the ANOVA.
Ho2: The inclusion of questions in a test which do not clearly state the question (incomplete item stems) does not cause such a test to be more difficult than one in which this practice is corrected.
This hypothesis was tested using a 61 question multiple choice exam administered to an undergraduate class (which appears to be in sociology) of 210 students, 175 of which were apparently used in the analysis.
No alpha level is stated. Significant results in the ANOVA were those for which the p level was significant at least the .01 level.
In both samples, students were blocked into five groups, based on previous achievement in the respective course involved.
4. Independent and dependent variables
There were two independent variables, the blocking and treatment variables, and one dependent variable.
The blocking variable was prior achievement in the applicable course. In each case, students were assigned to one of five levels. In the case of hypothesis 1, based on a business course, students were blocked based on a composite score of their performance on three previous tests in the course. In the case of hypothesis 2, based on a sociology course, students were assigned to blocks based on a composite score of their performance on three previous tests and a paper. Note that the blocking occurred subsequent to the assignment of tests to students.
The treatment was the type of test taken. One test included some questions which contained problems affected by the deficiencies analyzed in the study and the second was the same exam, except those questions containing design errors were corrected.
The dependent variable was the number of questions answered correctly on the exam.
5. Instrument, commenting briefly on its reliability and validity.
In the case of hypothesis 1, the test was a 75 question multiple choice exam. 20 of the questions were judged to be subject to the problem of window dressing. The second version of the test contained the same questions, except the 20 with window dressing problems were revised to correct for the errors.
In the case of hypothesis 2, the test was a 61 multiple choice exam. 22 of the questions were judged to contain the problem of incomplete stems. The second version of the test revised those 22 items to correct for the errors.
The tests were written by the instructors who taught the respective classes. In both cases, tests in which at least one fourth of the items were subject to the appropriate problem were selected for the analysis. Also, the instructors who wrote the respective exams were asked to review the revised exams to determine that the revisions did not adversely affect the students' ability to answer the questions. Note that the revisions were made only in the item stems as opposed to the responses from which the students were to select the correct response.
The authors measured both the reliability (using the KR-20 measure of internal consistency) and validity of the tests. In both cases, the two versions of the tests were judged to be reliable and the revisions made did not harm the validity of the tests.
6. Experimental procedure
As mentioned before, two versions of a multiple choice exam were administered to students, with one version containing design errors and the second version correcting for those errors. Students were randomly assigned to one of the two tests as every other student received one version of the test. Unfortunately, the blocking of students based on prior course achievement was not completed until after the test was administered, because the information necessary for the blocking was not available prior to the administration of the test. However, the authors state that the "differences in achievement were effectively randomized across forms." The same procedure was used to test both hypotheses.
7. Statistical analysis and conclusion
A randomized block design was used and separate two way ANOVAs were conducted to test each hypothesis. A post hoc analysis was not conducted.
Hypothesis 1 examined whether window dressing had an effect on test performance. As predicted in a randomized block design, the blocking factor of prior performance was significant at the .01 level (F ratio = 14.64, df = 4, 55, p<.01). However, the treatment effect was not significant (F = .02, df = 1, 55, p value not disclosed), indicating that in this case the test containing items for which window dressing is a problem is not more difficult than a test in which this problem has been corrected. Some prior research has encountered similar findings, while other prior studies have had conflicting results. The authors therefore conclude that the "nature of the window dressing is a critical factor in determining whether or not any effect on test difficulty occurs." As is assumed in a randomized block design, the interaction effect was also insignificant (F = .65, df = 4, 55, p value not disclosed).
Hypothesis 2 examined whether incomplete stems had an effect on test performance. As predicted, the blocking factor was significant (F= 31.04, df = 4, 165, p<.0005). Also, the treatment effect was significant (F = 23.88, df = 1, 165, p,.001), indicating that in this case the test containing items with incomplete stems is more difficult than a test which corrects for this problem. This is consistent with previous research as other studies have found, in varying degrees, that incomplete stems affect the difficulty of a test. These results, along with the results of previous studies conducted by the authors, indicate that when this problem is corrected, performance can be expected to improve by 6-11%.
Since the interaction was insignificant in both cases and the assumptions for Paull's criteria were not met, the authors properly used MS~jthjn as the denominator for the F ratios.
In both cases, the inclusion of badly designed items does not cause students to be ranked differently, so if grades are assigned on a relative basis, these deficiencies do not pose a problem. However, if students are assigned grades on an absolute basis, the problem of incomplete stems could affect a student's grade as the practice tends to somewhat lower grades.
The authors do indicate that there has been a limited amount of research in this area, and imply that the reader should use caution in generalizing the results to a particular setting different from those present in the study.
8. If I were the researcher, how would I improve the study?
The authors should have clearly stated the null hypotheses and disclosed preset alpha levels. It was possible to determine what the null hypotheses were, but it is totally unclear whether the authors had considered a preset alpha.
There is a disagreement on the sample size used for the testing of hypothesis 2 between the text and table 2. The text indicates that the sample size should be 180 (210 original subjects less 30 that could not be used) while table 2 indicates that 175 were used.
Also, 6 subjects were omitted from the test of hypothesis 1 while approximately 30 were omitted from the testing of hypothesis 2. The authors should address this mortality problem more directly and ascertain whether this had any effect on the results. For example, test results and block membership of these omitted subjects could be discussed to determine, for example, whether a large portion of the omitted subjects were members of a certain block.
Also, the blocking was not done until after the test was administered because the levels of prior achievement were not available until after the fact. This probably resulted in unequal cell sizes, although the authors do not address this issue. It seems as if the blocking could have been accomplished before the test was administered and then the tests could have been assigned to students so as to accomplish equal cell sizes. Also, the authors should disclose the cell sizes used in the actual testing so the reader could ascertain whether there was a problem due to unequal cell sizes.
It is indicated that a certain number of questions were judged to be subject to the appropriate problem studied. It should be stated who judged these questions to contain the errors and on what basis the judgments were made. There are bound to be different degrees of the problems and the authors did not indicate, for example, whether they singled out only serious window dressing errors or any degree of error involving the problem.
No post hoc analysis was conducted. It would have been interesting to see, in the case of hypothesis 2, whether the problem of incomplete stems had an effect for certain blocks (achievement levels) more than others. My guess is that the problem would have had a more adverse effect on students in the lower prior achievement levels.
When the authors discuss the results, they do not make any conjectures as to how widespread these problems are in general. This should be addressed as, if these problems are encountered less frequently in practice, the problem of incomplete stems may not be as serious as indicated by the results of the analysis. The authors chose tests which have at least one fourth of the questions subject to the particular problem. They should indicate why they imposed this criterion and indicate how well this is in accordance with the average presence of such errors.
There did not appear to be problems with any of the psychometric issues as these multiple choice tests were both easy to administer and easy to score. Also, time did not appear to be a problem. Also, as discussed in part 5, the reliability and validity of the tests were examined, and no problems were evident.