Max. Classroom Capacity: On Robo-Grading

Loren J. Naidoo, California State University, Northridge

Dear readers,

Happy new year! As this column is being published, many of you who are academics are, I hope, basking in the warm glow of having submitted all your grades for the fall semester. At the point I am writing this column however, I am buried in a mountain of grading, a mountain born of my own stubborn insistence on assigning written assignments and short-answer exam questions to my 150+ students. Let’s be honest, grading papers and exams is rarely riveting work (neither are writing or taking them, to be fair to students). Despite my best attempts to focus on grading I often find my mind wandering, and of late my mind has wandered in one specific direction: Robots!

Almost exactly 100 years ago, the Czech play R.U.R. by Karel Capek premiered on January 25, 1921. This play is famous for the term “robot” being coined. I have always liked robots, starting with R2D2. I read books about robots, watch movies about robots, and I even own an actual robot or two. It’s still a bit shocking to realize that we are now firmly living in the era of functional robots and artificial intelligence (AI). This was brought home for me, quite literally, when we received a robotic vacuum as a Christmas gift last year. We use it enough to make me wonder how many human cleaners are unemployed as a result of this technology. Maybe some of you have already ridden in an autonomous vehicle—it’s not hard to imagine what impact that technology will have on taxi/Uber/Lyft/truck drivers. The appeal of these robot helpers has only increased with the coronavirus pandemic making salient the risk other humans can pose. In academia, I know many colleagues who are struggling with increased class sizes and workloads resulting from funding shortfalls due to the pandemic.

The challenge of trying to grade hundreds of short-answer questions accurately and consistently, even using a rubric, made me wonder whether a robot could do this better—specifically, the kind of AI technology that many companies now use to efficiently conduct sentiment analysis of, for example, Amazon product reviews. I mentioned this fantasy to my department chair who to my surprise enthusiastically told me about a company called Monkeylearn¹(www.monkeylearn.com), and quite quickly, my dream became a reality! Well, kind of—keep reading and you’ll see how things worked out. I’m going to take you through the process step by step, share with you the results, and propose some tentative conclusions.

But before I get into the gory details, it’s important to say that I’m not the only one who has dreamed this dream of having a trusty robot sidekick to grade more quickly and reliably than any human could. Large education companies like ETS and Pearson have invested in developing automatic-assessment systems. Multiple states including Utah and Ohio have used robo-grading to evaluate writing ability in standardized test essays. These moves have prompted considerable backlash, with critics providing vivid examples of utterly nonsensical essays that received high grades from automated grading systems. The tone of some of these critiques (e.g., “The use of algorithmic grading is an attack on the value of academic labor of both teachers and students”; Warner, 2018) reminded me of the flesh fair scene in the movie AI where an anti-robot showman decries David, the boy-like robot, as “the latest iteration to the series of insults to human dignity.” But that is not to say that the critiques lack substance. Standardized tests are high-stakes affairs, and it seems likely that many advocates for robo-grading may be more concerned with profit than with quality assessment or educational outcomes. Plus, assessing general writing ability may be too tall an order for AI at this point. My goal was narrower: I wanted to understand how useful robo-grading might be to score short-answer exam questions,² which brings us back to Monkeylearn.

Monkeylearn is a service that allows subscribers to create custom machine-learning models to (a) classify text into categories that the user defines and (b) extract key pieces of data from data that the user uploads to their website. No knowledge of coding is required—everything can be done through their website interface. I was curious to see whether their service could be used to grade students’ answers to several open-ended questions that were part of an exam that was administered electronically on Canvas, the web-based learning management system, to 124 undergraduate students. These short-answer questions had suggested word limits of 3, 30, or 100 words based on the breadth and complexity of the question.

Let me take you through the steps involved using one exam question that asked students to identify which leadership theory best describes a leader who focuses on listening to their employees’ concerns. This question had a suggested word limit of three, though this wasn’t enforced, and some answers were much longer. I chose to use a “topic classifier model” to categorize answers according to criteria that I created based on my rubric. I downloaded the exam answers from canvas using the “student analysis” option. I uploaded this .CSV file to Monkeylearn and chose the column with answers to this question. Next, the interface asked me to create “tags” for each category that the AI would use. These I made based on my rubric. In this case, they were “individualized consideration,” “relationship oriented,” “other leadership,” and “non-leadership.” In earlier attempts I created tags like “fully correct,” “partly correct,” and “incorrect” but realized that these categories were too diffuse, and the models worked better when the tags were associated with specific content. For example, “individualized consideration” and “relationship oriented” were both considered “fully correct” in my rubric but look quite different from each other in terms of the answer text; therefore, it worked better to code them as separate tags.

The next step was to train the model. Monkeylearn pulls one answer from the dataset, and I clicked on the tag (or tags) that best categorized the student’s answer. Then another answer is shown. These answers are not chosen randomly or sequentially; they are chosen strategically to help the model learn as quickly as possible. Tagging is a critical step as the model will learn to associate patterns in the data with the tags that you are assigning. How does it learn? The brief explanation I received from Monkeylearn was that the answer text is transformed into vectors, one- to four-word features, and the model examines what words happen before and after those vectors, and the relationships among them. Like other examples of AI, it’s a black box (At least to me!). But one thing it does NOT seem to be doing is a simple word search. The model handled variations in wording (e.g., “individual consideration,” “transformational: individual consideration,” “consideration [relationship related]”), spelling mistakes, long sentences, and so forth. After a few trials, the model started to preselect tags based on what it had already learned—allowing me to observe what the model is learning and to correct where necessary. The tags I used for this question were meant to be mutually exclusive categories, with the word limit limiting answers to one category, but it’s possible that the model could assign more than one category: The answer “IC/relationship orientation” would likely be tagged as both “individualized consideration” and “relationship oriented.”

After about 20 trials—a remarkably small amount of data—the model was considered trained enough to use. It took all of 5 minutes! At this point, I could have continued training the model, but I chose to test the model by running the model on all of the exam data. Ideally, I would have had separate training and testing data, but I didn’t. The result of running the model was that my original .CSV file was outputted with two new columns: a classification (how each student’s answer was tagged) and confidence score for each classification that ranged from zero to one, with higher scores corresponding to higher confidence. It’s not clear to me how the confidence score is computed, but I’ll say more on that later. I used MS Excel formulas to convert categories into numerical scores (e.g., categories of “individualized consideration” and “relationship orientation” received full marks, and other leadership and non-leadership received different levels of partial marks). Boom—question graded.

At this point, I reviewed the categories that the model assigned and compared them to my own grading of the same four criteria. My own grade disagreed with the model’s 14 times out of 124. The average confidence score for those trials was .438 compared to an overall average of .569. Five of the discrepancies were borderline answers that, upon reflection, I realized the model had coded correctly. The results for two other very-short-answer questions with three-word limits were similar: 12 and 10 disagreements and .81 and .42 average confidence scores, respectively. In summary, robo-grading of very short answers using Monkeylearn worked very well—it learned extremely quickly and was very “accurate” with my own grades as the basis for comparison. The reliability of the model seemed as least as high as the reliability of this human grader—we both made some mistakes, but I took a lot more time. Especially with very large class sizes, you might have Monkeylearn grade the exam questions, review by hand the subset of answers with low confidence ratings, and end up with equally reliable grades compared to grading by hand in much, much less time.

I used the same process to robo-grade answers to a multipart case question, each part with a recommended 30-word limit. The case described an organization with an employee motivation problem and asked students to (a) diagnose the problem, (b) propose an intervention as a solution to that problem, and (c) explain why their intervention should work based on relevant theory. Most of the diagnoses had to do with various job characteristics, and most of the interventions and explanations were based on job enrichment, goal setting, or incentives. With the increased complexity of these questions, the tagging process became more time consuming. Whereas almost every answer to the very-short-answer questions fit only one category, here a bit less than half of the answers fit two or three categories. Still, the model itself took about the same number of trials to train. As before, I opted to use a minimally trained model. The results were much worse. Of the 124 answers to Part a, I categorized 48 of them differently than the model. This excludes cases where the primary categorization differed, but my category aligned with a second or third category identified by the model. Four of the 48 discrepancies were due to the model failing to provide any category for the answer. More problematic, the average confidence rating for the disagreements (.654) was little different than the average (.687). For Part b, the results were just as bad. There were 47 disagreements, of which 32 resulted from the model failing to categorize at all. Again, the average confidence rating of disagreements (.622) differed little from the average (.697). There were 30 disagreements (average confidence = .644) in Part c, with little difference in confidence ratings compared to the overall average (.695). In summary, robo-grading of these moderately short answers was still very fast but produced discrepancies with my hand grading about one-third of the time, though it’s important to keep in mind that there is much more room for interpretation in evaluating these answers to begin with. I also used minimally trained models. Perhaps investing a bit more time in training the models would pare the discrepancies down to a more manageable level.

The final question was the most complex. Students were asked to write a script in which they deliver performance feedback to an employee. The suggested limit was 100 words—not quite an essay, but a long short answer. I had a detailed rubric for this question. Working out how to convert it into tags took me about an hour. I developed 11 separate criteria for feedback content (e.g., friendliness in customer service) and delivery (e.g., was the feedback not “owned”). Training the model took 48 trials and about 90 minutes. The median number of categories assigned per answer was 4, ranging from 1 to 5 categories. Of the 124 answers, 27 of them produced identical results as my grading, whereas the remaining 97 produced different results. The average confidence rating across all categories was .73. But, because the grade is a product of all of the categories for each answer, it makes more sense to look at the product of the confidence ratings for each answer. The overall average of this product was .32 (.31 for the answers that were graded differently).

I carefully reviewed the discrepancies and corrected the errors, some of which were made by me. Because there were so many errors, I decided to recreate the model from scratch, training it using the set of categories derived from the first attempt and “corrected” by hand. It took 52 trials and about 40 minutes to train the model. This time, there were 62 cases where my answers were identical to the model’s—exactly half. The average confidence rating was .76, and the average product of the confidence ratings per answer was .36. So, the model improved, but at considerable cost in terms of time. Moreover, I looked at the grades that were produced by me compared to the model. The grades were a function of the combination of the different categories for each answer—for some categories, points were added; for others, subtracted. The correlation between my grades and the model’s grades was r = .81. In summary, the effort required to develop a model that can reliably categorize across this many different categories is considerable. The results don’t look great but again must be understood in the context of baseline of reliability of a human grader. In the process of trying to verify where the model went wrong, I quite often concluded that my original grade was wrong and the model was right. If I had an independent grader with whom to compare, I’m not certain my grades would have been more closely aligned with that person’s compared to the robo-grades.

Before I get to my conclusions, it’s important to note the many limitations of my methodology. First, I was the only human rater, and clearly a biased one, though not invested in the success or failure of robo-grading. Second, I used data from 124 students in one undergraduate class in management at one university. I used short-answer questions that varied both in terms of the complexity of the rubric and the length of responses, and these two factors covaried, making it difficult to understand the distinct impact of each. Third, I used one platform, Monkeylearn. I didn’t try other options, and I have close to zero understanding of how the coding works.

Here’s what I can tentatively conclude from this exercise:

Robo-grading works better with narrow questions for which there is a small universe of answers, and particularly when the answers are short in length. However, these questions are also the easiest and quickest for humans to grade, so the value comes at scale. It’s difficult and tedious for a human to reliably grade several hundred questions, even easy ones. Robo-grading can help grade very-short-answer questions in very large classes, particularly when a human grader can check low-confidence cases by hand. It’s a nice alternative to automatically graded multiple choice questions (the original robo-grading!), especially if your institution was paying for your Monkeylearn subscription.
As the complexity and length of the answers increases, the amount of thought and time required for developing the rubric, training the model, and looking for problems also increases, reducing the efficiency of robo-grading. For the number of students that I had, it probably wasn’t worth it, especially for the longer format items. With larger numbers of students (e.g., if the same exam was administered every semester), then investing more time in training the model with more data might be worth it.
The validity of the output of the models is mostly a function of the design of the rubric and the training of the models. Even though robo-grading did not consistently produce results that corresponded closely to my own grading, the most valuable aspect of trying to use robo-grading was that it forced me to think very carefully about (a) my scoring rubric to generate the tags and (b), during model training, what specific ideas or phrases are evidence that an answer falls into a category, and how categories of answers get translated into numerical grades. Even if I never use robo-grading again, I think I will be a more precise and reliable grader, less prone to relying on vague, holistic impressions of answer quality or whether criteria were met, which are likely to be unreliable. And I thought I was already relatively detail oriented with my grading procedures.
Robo-grading is smarter than I thought. For example, one model coded the answer “I would most likely use goals, such as if the factory produces x amount of items then employees will receive a certain percentage of profits” as “incentives” rather than “goals.” I had to read that answer twice to make the same determination.
Robo-grading is as dumb as I expected. For example, one model coded the single-letter answer “r” (I assume the student ran out of time) as “job enrichment.”
I’m dumber than I thought. Being forced to revisit my prior grading made me realize how many mistakes I make when grading. Or, even if they weren’t mistakes per se, how subjective and variable judgments can be.
Robo-grading stupidity recapitulates human stupidity. Even one inadvertent mistake during model training can be dutifully learned and applied. I accidentally clicked the wrong button and tagged “theory y” as belonging to “charismatic leadership” without realizing it, and the model was quite “accurate” in tagging subsequent theory-y answers as charismatic.
A right answer combined with less clear content (or even a bunch of nonsense) may be tagged as a right answer. But in fairness, that’s a tricky situation for human graders, too.
If robo-grading was readily available, it might make me more likely to design exams that are more amenable to robo-grading. That might be good because robo-grading may be more reliable and feasible to use with large class sizes. That might be bad because it could limit how much deep-level understanding or complex thought we try to assess. Still, I would argue that it’s probably better than the ubiquitous multiple choice question format.
Robo-grading can be used as a basis for providing students with feedback on written answers in the form of the categories to which their answers conformed. This may be more feedback than most instructors provide to students on exams, particularly in large classes, but it also may be less depending on how much time the instructor devotes to providing feedback.
I didn’t actually use robo-grading to assign grades. The actual use of robo-grading in higher education raises all kinds of complex ethical and practical issues. Do students have a right to know that robo-grading is being used to assist instructors? Will students object to being assessed in this way? How quickly will students develop ways to cheat the system, and how effectively can AI models be trained to detect these attempts? Does adopting robo-grading free up instructor time that can be devoted to other teaching activities that may have a bigger impact on student learning? Does robo-grading exacerbate existing trends, such as those toward larger class sizes and fully online formats, that degrade educational quality?

Interested in this topic or in hearing more about my research on robo-grading? As always, your comments, questions, and feedback are welcome: Loren.Naidoo@CSUN.edu. Stay safe and healthy!

Notes

¹A quick note about Monkeylearn: They are a private company, they did not ask me to write about them, I have no financial relationship with them, but I am taking advantage of their free 6-month “academic” subscription.

² There’s quite a bit of empirical research on this and related questions in computer science and education, among other disciplines, which I don’t have the space (or expertise, in some cases) to review here.

Reference

Warner, J. (2018). More states adopt robo-grading. That's bananas. Just Visiting blog. https://www.insidehighered.com/blogs/john-warner?page=18

2710 Rate this article:

No rating

Max. Classroom Capacity: On Robo-Grading

Loren J. Naidoo, California State University, Northridge

Comments are only visible to subscribers.

Categories