08 August 2010

How to Build a Test

Last month, I wrote a series of posts (Parts One, Two, and Three) about how an idea for a test item becomes an actual item. Items can take a lot of different pathways, depending upon the test. Advanced Placement free response questions (which are not reused) get a test drive in college freshmen biology classes around the country. The SAT has an entire section on every test which is devoted to pilot items to replenish the bank. Some states exclusively use contractors to write items---although the need for revision, pilot, data and content reviews remains as part of the process. Other states, like ours, have classroom teachers write as many items as possible.

I had a comment on one of the posts about test build. How is it determined which items get put on the test? What about the psychometrics?

First of all, tests are comprised of pieces of very expensive real estate---not just in terms of the money spent developing, scoring, and reporting on them ($1M - $1.5M for our state...per test...per year), but also in terms of what gets tested. The volume of possible standards for items far outstrips the number of slots on a given test. This is why it is not possible to claim that teachers are off in classrooms "teaching to the test." The standards selected are different from year to year---teachers don't know which ones will be specifically targeted, only the pool of standards the items are built from.

Tests typically have some design characteristics. For example, there are set numbers of item types (multiple choice, true/false, fill-in-the-blank, essay...). There are also specifics on how many items from particular areas of content may be asked. On a biology test, there will be percentages allotted to molecular, evolutionary, human, plant, and other sundry biology areas.

There is a chunk of the test that gets eaten up by "anchor items." These are items that are the same as the previous test and need to occur within the same half of the test that they appeared in before. The purpose of the anchor items is twofold. First, it allows for some equating from the old test to the new one (we can look at how two groups of students performed on identical items) and it also serves as a test within a test. The items selected need to (inasmuch as possible) resemble the make-up of the overall test---the same percentages allotted to content, item types, etc. Again, this gives us a psychometric comparison.

I wish I could say that it is a free-for-all beyond that---that you can just fill in the rest of the space with items in the bank that are the right content and type. And plugging holes is part of the process. It's just not the whole story. Test builders keep track of how many times A, B, C, D, and E (for AP) get used as answers for multiple choice items. This can be trickier than you'd think---because you can't just randomly order the answers. Why not? Because the item was piloted a certain way. All of the data for that item is based on that structure. If you reorder the answers, you are changing the item. For an item that uses four answer choices (A - D) it is considered okay to flip A with D and B with C if you have to balance the overall answer selections.

The selected items must also add up psychometrically such that the test is as close as possible in difficulty to the previous test. We don't want to leave room for the scores to rise and fall as a result of using an "easy" test one year and a "hard" one the next. There are several values associated with a test item---I won't pretend to know them all. Two are particularly important: the p-value and point biserial.
  • P-value means something different here than in classical statistics. In test-speak, it's the proportion of students who answered the item correctly. The lower the p-value, the more difficult the item. 
  • Point biserial is a correlation between the item score and the test score. (Again, data is gathered from pilot.) In other words, did students who got the item correctly also do better overall on the test? In this case, the higher the number (e.g. the more positive the correlation) the better the item.
The p-values and point biserials are tracked across the entire test build. The psychometricians are looking for a sweet spot that means the current test is as close as possible in difficulty to the previous version. This is where test build can get tedious, because sometimes you just need to find one item to swap out...and you don't have it. And once you start swapping, you throw off all the other things you had balanced (item types, content...).  The bigger your test bank, the easier it is to handle these little burps.

Once you get to this point, you just have the more concrete pieces to deal with. What directions to the student should be included? How many items on a page? What instructions for the proctor are going to be provided? Where are you going to put the pilot items---and how many?

Want to know about a particular test? Check the website (College Board, ETS, your state's department of education...) for information on test specifications (which will tell you ahead of time about the structure) and technical reports (which will tell you about the psychometrics after scoring). If you're a Washington teacher, here are links to the technical reports, released items, and item analysis data for all released items (by state, district, and school). Specific test and item specs for each content area are linked from the main site.

I wish I knew more about how adaptive tests are built, as well as online tests which scramble items (and therefore throw off the careful structure of a build). If you've been involved with these and have some insight to share in the comments, I'd love to learn more about those.

Test builds are intricate processes---far moreso than would be reasonable at a classroom level; however, classroom teachers have much greater capabilities to capture a range of performance over time and in with a variety of tools. I do wonder if I would have looked at tests differently in my classroom if I had access to the psychometrics and the time and knowledge to be more purposeful in how I constructed them.

PS If you want to see what the scoring process looks like, check out my post from the 2005 AP Biology Read.

No comments: