Essay questions for STAAR tests to be graded by computers in Texas

The machines will be trained how to evaluate open-ended questions.

By Michael MarksFebruary 15, 2024 3:56 pm,

The Scantron – or something like it – has long been a normal part of standardized tests. Feeding answer sheets into a scanner is faster than grading by hand. 

Computers aren’t just evaluating multiple choice questions in Texas, however. Now, most of the essay questions on the state’s required assessments will also be graded by a machine, rather than a person.

Talia Richman, reporter for the Dallas Morning News’ education lab, spoke to the Texas Standard about the change. 

This transcript has been edited lightly for clarity:

Texas Standard: I think most of us are familiar with the little bubbles – make sure the bubble is completely covered with your pencil and all that. How is this automated essay grading system supposed to work?

Talia Richman: Yeah, the Texas Education Agency built what they’re calling automated scoring engines that have been trained based on how humans score essays to replicate that faster, more efficiently. That’s important because the STAAR test this year includes more essays at every grade level. That would take humans a very long time, cost a lot more money.

So they’re thinking these computers that have been trained by humans can do it faster – and, they think, just as well.

Computers trained by humans. Do I smell AI somewhere here?

The agency is quick to say this is not the same as the generative AI that programs things like ChatGPT, but, you know, a pretty narrow tool with those capabilities. Each engine is trained to score one question, and they say it can’t do anything beyond that.

I’m a little confused, though, given that essays are such a subjective thing. 

The agency says that each engine looks at thousands of previous essays, and they are able to pick up, based on the rubric that’s established, what makes a good essay. These are supposed to pull from evidence and text and, you know, synthesize ideas.

They say that they scored all of last spring’s essays that were graded by humans again using the computers and that it was a similar distribution.

» GET MORE NEWS FROM AROUND THE STATE: Sign up for Texas Standard’s weekly newsletters

But I’ve always thought that, in a way, essays were an opportunity for people who might not do well on standardized tests where you have objective answers. With an essay there’s a whole lot more creativity, or certainly room for creativity, involved. And I think a lot of students approach these essays with that in mind, and it gives them an opportunity to sort of flex those creative muscles. Will humans have any role in this process?

So these engines are also trained to detect anomalies. So if they’re seeing something that seems really creatively done or is an unexpected length, they’ll reroute those essays kind of acknowledging like, “hey, I’m a computer, I don’t think I can grade this right,” towards a human scorer who will look at it.

Millions of kids take the STAAR every year, and about 25% of these written answers are going to get routed towards humans to make sure that there are some people eyes on these essays still.

Apparently other states have made a similar change to their standardized tests. How’s that gone?

Yeah, Texas is not totally alone here. What’s interesting is Ohio made this change several years ago kind of quietly in the same way that the agency did here in Texas. But what bubbled up is that some districts noticed an irregular number of zeros that their students had scored on essays. And that kind of prompted those Ohio educators to ask questions.

And something similar is happening here, where a large number of high schoolers scored zeros during the recent STAAR test on their those essay questions. You know, the agency is saying that’s not because of the computerized scoring, but definitely district officials want to know more about why so many students scored zeros – way more than have in the past.

Because, you know, these these STAAR tests, these scores are very important. The schools are graded by the state largely based on how well their students perform on standardized tests. So the stakes feel high to get this right.

They do indeed. And I just want to make sure I understand something: Are these computers already being used to score essays in Texas?

Yes, they were used for the first time in the latest iteration, the December 2023.

And as you mentioned, there are some who are saying, well, there are too many zeros here. What else have you been hearing from local school administrators and teachers, or what are the concerns?

I think a big thing is transparency. They wish that they’d been more involved in the process, had an opportunity to ask more of these technical questions. I think that they are in the process of of demanding answers now.

I think that they wish, you know, given some of the difficult history with technology that STAAR’s had in the past, that they’ve been really brought in on the front end of this.

If you found the reporting above valuable, please consider making a donation to support it here. Your gift helps pay for everything you find on texasstandard.org and KUT.org. Thanks for donating today.