“Robo-readers aren’t as good as human readers — they’re better,” the headline says. Hmmm. Annie Murphy Paul writes,
Instructors at the New Jersey Institute of Technology have been using a program called E-Rater in this fashion since 2009, and they’ve observed a striking change in student behavior as a result. Andrew Klobucar, associate professor of humanities at NJIT, notes that students almost universally resist going back over material they’ve written. But, Klobucar told Inside Higher Ed reporter Scott Jaschik, his students are willing to revise their essays, even multiple times, when their work is being reviewed by a computer and not by a human teacher. They end up writing nearly three times as many words in the course of revising as students who are not offered the services of E-Rater, and the quality of their writing improves as a result. Crucially, says Klobucar, students who feel that handing in successive drafts to an instructor wielding a red pen is “corrective, even punitive” do not seem to feel rebuked by similar feedback from a computer….
When critics like Les Perelman of MIT claim that robo-graders can’t be as good as human graders, it’s because robo-graders lack human insight, human nuance, human judgment. But it’s the very non-humanness of a computer that may encourage students to experiment, to explore, to share a messy rough draft without self-consciousness or embarrassment. In return, they get feedback that is individualized, but not personal — not “punitive,” to use the term employed by Andrew Klobucar of NJIT.
There are some serious conceptual confusions and evaded questions here. The most obviously evaded question is this: When students are robo-graded, the quality of their writing improves by what measure?
Les Perelman’s objections are vital here. He has written,
Robo-graders do not score by understanding meaning but almost solely by use of gross measures, especially length and the presence of pretentious language. The fallacy underlying this approach is confusing association with causation. A person makes the observation that many smart college professors wear tweed jackets and then believes that if she wears a tweed jacket, she will be a smart college professor.
Robo-graders rely on the same twisted logic. Papers written under time pressure often have a significant correlation between length and score. Robo-graders are able to match human scores simply by over-valuing length compared to human readers. A much publicized study claimed that machines could match human readers. However, the machines accomplished this feat primarily by simply counting words.
And there’s this:
ETS says its computer program tests “organization” in part by looking at the number of “discourse units” – defined as having a thesis idea, a main statement, supporting sentences and so forth. But Perelman said that the reward in this measure of organization is for the number of units, not their quality. He said that under this rubric, discourse units could be flopped in any order and would receive the same score – based on quantity.
Other parts of the formula, he noted, punish creativity. For instance, the computer judges “topical analysis” by favoring “similarity of the essay’s vocabulary to other previously scored essays in the top score category.” “In other words, it is looking for trite, common vocabulary,” Perelman said. “To use an SAT word, this is egregious.” Word complexity is judged, among other things, by average word length…. And the formula also explicitly rewards length of essay.
Perelman went on to show how Lincoln would have received a poor grade on the Gettysburg Address (except perhaps for starting with “four score,” since it was short and to the point).
Notice, not incidentally, that Perelman’s actual arguments belie Paul’s statement that “critics like Les Perelman of MIT claim that robo-graders can’t be as good as human graders, it’s because robo-graders lack human insight, human nuance, human judgment.” It’s perfectly clear even from these excerpts that Perelman’s point is not that the robo-graders are non-human, but that they reward bad writing and punish good. And since the software only follows the algorithms that have been programmed into it, the problem actually begins with the programmers, who may not have any real understanding of what makes writing effective, or — and this seems to me more likely — can’t find algorithms that identify it.
I suspect, then, that with this automated grading we’re moving perilously close to a model that redefines good writing as “writing that our algorithms can recognize.” So why would any teachers ever adopt such software? That one has a simple answer: because the students are happier when they interact with the machines about their writing than when they have to respond to human teachers. If you read Paul’s whole essay, you’ll see that that’s all the system has to commend it: it pacifies the children, while the teachers just stand by and watch. The software really is teaching the children, and what it’s teaching them is to do what the software tells them to do. The achievement here is not improved writing, but improved obedience to algorithmic machines.
Welcome to the future of education.