20 July 2010

From Standards to Assessment, Part III: Pilot to Test

Assuming that our adolescent Items (see Parts I and II) are ready for their debut, they will make their way onto a pilot test. Because the items target the same standard, they would not appear on the same test. If the pilot is embedded with a regular test, the items would be left off of any test that already had a similar "live" item. Kids complete the test and their answers are scored.

For Item One, scoring is simple:
1. Which energy transformation creates food for a plant?
a. light to heat
b. light to chemical
c. chemical to kinetic
d. chemical to light

There is only one correct answer (b). We just have to track data on how many students answer it correctly (as well as how many choose each of the other answers). We will also want to know something about the students who answered correctly---did they get credit for most of their answers on the whole test...or not?

Thing Two, however, has a different row to hoe:

2. Plants transform light energy into ________.

We know from the standard that students might put sugars, food, or chemical energy into the blank. But are there other possibilities? Sure. I can imagine that some students might put starches, carbohydrates, or glucose. Shouldn't those also receive credit? The thing is, there will be all sorts of right answers that were unanticipated. What if a student writes organic compounds? Would you accept that? How about nutrients or nutrition? This is why open-ended items go through rangefinding. Although there may be a pre-established rubric or bank of answers, the simple fact is that an item in the "wild" never behaves like you think it will. While rangefinding will help determine what is acceptable as answers, it is still only a sample. When items are formally scored, all sorts of little oddities will pop up and individual decisions will be made.

After pilot testing, rangefinding, and scoring, the data from the items are then evaluated. Again, an item can die anytime during this process, and some items meet their maker here. For example, an item could be well-written and straightforward to score (e.g. multiple choice), but few kids are able to answer it correctly. There is nothing wrong with having "difficult" items in the bank, but they need to be carefully considered. An item that represents important content might remain in the bank while something more obscure might get the boot. Various pieces of psychometric data are collected and banked with the item.

Assuming the items described above survive all of that process, they go into the real Item Bank and become live items eligible to be placed on a test and count toward a student's score. (Our Bill has finally become a Law!) They await their turn in the Item Bank, only to be plucked if their psychometric data matches the needs of the test: content, difficulty, type (multiple choice, open-ended), point value. Some items make it this far and never see the light of day. Others are used a few times and then retired. Data from first use is compared to pilot data. Student samples from open-ended items get a second look.

So there you have it. A journey from a standard to a real test item. Depending upon the test and development needs, the process can take a few months to a couple of years. It's far more intensive and rigorous than anything you will find in a classroom; however, we need to be careful about comparing the two arenas. Just because the items on a teacher developed test do not undergo the validity measures those on a standardized test endure does not make them less useful. Keep in mind that classroom assessment is more about reliability in the form of multiple measures and a variety of data about student performance against a particular standard. Teachers have so much more to focus on as opposed to rigorous item development for classroom use. I've heard some of the assessment experts at conferences state that they don't feel that this sort of thing is appropriate for the classroom (or PLCs) simply because they will never have the sample sizes necessary to get the psychometrics right. This does not give teachers free reign to give crappy tests---it just means we shouldn't get obsessive about individual items.

And just as we should not get all Judgey McJudgerson about classroom assessment, the same should be true for large-scale. It's fair to say that there are only a few items per test and usually one for each standard that is measured on a given assessment---but this doesn't make the test "bad" or the information useless. They're great tools. We may not like the way the results get used, but I can think of plenty of classroom examples where the information from assessments was used poorly, too. In the end, we all have to be a little smarter about what, how, and why we ask in terms of student performance at all levels.


Pierce said...

Having worked the last couple of years as an editor on state assessment tests, I find this pretty interesting. Mostly I banged on the grammar, spelling and the occasional math error, but there were times I had to question whether the item made any sense at all. The usual answer was that Item Development said it did, and that was that.

Anyway, very nice series of posts looking at something that I'm sure most people have even less of a clue about than I do. Now if you could just explain the actual psychometry of these tests, we'll be all set.

The Science Goddess said...

Hi, Pierce,

I just wrote a post on Test Builds, which includes a bit about the psychometrics. This might answer some of your questions. That stats associated with the tests aren't my area of strength---I just know the most basic info about what makes an item "good."