Publication Date: 2003-09-00
Author: Childs, Ruth A.; Jaciw, Andrew P
Source: ERIC Clearinghouse on Assessment and Evaluation
Costs of Matrix Sampling of Test Items. ERIC Digest.
Matrix sampling of items that is, division of a set of items into different versions of a test form is used by several large-scale testing programs. Like other test designs, matrixed designs have advantages and disadvantages. For example, testing time per student is less than if each student received all the items, but the comparability of student scores may decrease. Also, curriculum coverage is maintained, but reporting of scores becomes more complex. In this Digest, nine categories of costs associated with matrix sampling are discussed: development costs, materials costs, administration costs, educational costs, scoring costs, reliability costs, comparability costs, validity costs, and reporting costs.
Development costs include the cost of writing items, subjecting them to sensitivity and technical reviews, pilot and field testing them, and analyzing the pilot and field test results. In general, developing more items requires more staff time and more participation by schools. In small jurisdictions, developing large numbers of items may be particularly burdensome because the cost of developing additional items raises the per-student cost of testing more quickly when there are fewer students taking the test; and the numbers of schools and students available to pilot and field test new items are limited.
Materials costs include the expense of printing the test booklets and of shipping them to schools. Longer tests are more expensive to print and, because the resulting booklets are larger and heavier, shipping costs more. In addition, although new computerized printing technologies are helping to decrease the costs of printing multiple versions of the test booklets, the complexity of preparing multiple versions for printing still means that they are more expensive to produce than a single version of a test.
If a test is to be administered by computer, the materials costs are very different, of
Another possibility is to print a single version of a test booklet and instruct different
In addition to test booklets, other materials must be printed and mailed to the schools, including instructions for handling the test booklets and administering the test and explanations of why the test is being given and how the results of the test should be interpreted. Parents may receive materials, either directly or through the schools, explaining the purpose of the test and reporting their child's results. Some of these materials may be distributed via the Internet, but printed materials are still required for parents without Internet access.
Administration costs include the time teachers and other school personnel must devote to preparing for the test administration (reviewing the procedures and sorting the materials), administering the test, and returning the materials to the scoring site.
Educational costs include the time that is taken from other educational activities for test preparation and administration. They also include the impact of knowing that the test will be administered on the way teachers cover the curriculum. For example, teachers may increase the amount of time they spend teaching parts of the curriculum they expect to be on the test and decrease the amount of time spent on other concepts. The test may also impact the way that a district or school allocates resources. For example, resources may be directed disproportionately to the grade levels that will have to take the test, if the district or school believes that doing so might improve test results.
Scoring costs include the costs of scanning and processing "bubble sheets" for
The financial and logistical costs of scoring more student work may be partly offset,
Reliability costs. Reliability refers to how accurate and how consistent scores are.
Different levels of scores must also be considered. Many testing programs are required by law to produce both student scores and school- or district-level scores. In addition, they may provide summary results for the entire jurisdiction. Paradoxically, some test designs can increase the reliability of one score level while decreasing reliability at another level. In particular, if the number of items an individual student answers is small, the reliability of student scores will be low. However, if multiple forms of the test are administered, the number of items contributing to the school- or district-level score may be large.
As Shoemaker (1971) explains, a classical test theory analysis of the resulting data would yield a mean test score for each group of students who happened to take the same items, and the mean school test score would be computed as a weighted composite of the subgroup scores. The standard error of the school mean test score based on the matrixed test would be smaller than the standard error from a test of the same length, but in which all student scores were based on the same items. In an itemresponse theory (IRT) analysis of the same data, however, administering different items to different students would not necessarily improve score reliability for students, schools, and districts. To do so without increasing the number of items per student, the test would have to become "adaptive" that is, it would avoid administering easy items to those students who have better mastery of the material and are almost certain to get those items right, and avoid administering hard items to students who are almost certain to get them wrong.
Comparability costs. It is usually assumed that the scores of different students taking a test can be compared. Comparability is improved by uniform administration conditions and equivalent marking. It also can depend on the particular items that students receive. If all students receive the same items, then their scores are easier to compare than if they receive different items. The comparability of aggregate scores, such as school- or district-level results, is also important to consider.
The approach chosen to analyze the test results makes a difference. If the items are
However, if classical test theory is used, then the particular items may affect comparability. IRT models require at least several hundred responses per item. The
"Consider the case of a state-level testing program that administers different sets of
In other words, the types of items that will most improve the meaningfulness of
Validity costs. Validity refers to the extent to which a test is measuring what it is
The degree to which a test measures the intended construct can also be affected by
Reporting costs. A more complex test design may require more explanatory materials and more communication with educators, parents, and the media. This is especially true if the complex design supports certain scores at some score levels (e.g., at the schoolor district-level) and not at others.
The nine categories of costs will vary in importance depending on the testing program. Testing directors and their staffs must examine relevant costs in light of their mandate(s), the content of the tests, the financial resources available, and the
Bock, R. D., & Mislevy, R. J. (1987). Comprehensive educational assessment for the states: The duplex design. Evaluation Comment (November 1987, pp. 1-16). Los Angeles, CA: Center for Research on Evaluation, Standards and Student Testing, UCLA.
Brennan, R. L., & Johnson, E. G. (1995). Generalizability of performance assessments. Educational Measurement: Issues & Practices, 14, 9-12, 27.
Cronbach, L. J., Linn, R. L., Brennan, R. L., & Haertel, E. (1995). Generalizability
Fitzpatrick, A. R., Lee, G., & Gao, F. (2001). Assessing the comparability of school scores across test forms that are not parallel. Applied Measurement in Education, 14, 285-306.
Gao, X., Shavelson, R. J., & Baxter, G. P. (1994). Generalizability of Large-Scale
Haertel, E. H., & Linn, R. L. (1996). Comparability. In Technical Issues in Large-Scale Performance Assessment (pp. 59-78; Report No. NCES 96-802). Washington, DC: U.S. Department of Education.
Mislevy, R. J., Beaton, A. E., Kaplan, B., & Sheehan, K. M. (1992). Estimating
Shoemaker, D. M. (1971). Principles and Procedures of Multiple Matrix Sampling
Please note that this site is privately owned and is in no way related to any Federal agency or ERIC unit. Further, this site is using a privately owned and located server. This is NOT a government sponsored or government sanctioned site. ERIC is a Service Mark of the U.S. Government. This site exists to provide the text of the public domain ERIC Documents previously produced by ERIC. No new content will ever appear here that would in any way challenge the ERIC Service Mark of the U.S. Government.
Share this page:
More To Explore