I’m thinking about grading again, as I have to do every quarter. My grading scale is usually not set in stone – I say that exams will have weights that float in some range – and I usually try to look at the data and tweak the weights so that things are fair.

None of this has ever sat right with me. Every professor and lecturer sets his or her own grading scale and we make no effort to harmonize them. Are we hurting the students? What if Professor X and Lecturer Y use scale s, but I’ve been using scale t and this makes my best students fail while propping up my weaker students? (That seems like an absurd edge case, but how can be we sure we aren’t doing something like this?)

I decided to investigate.

### The experiment

Here’s what I did. I happen to have per-question data for each of my calculus classes going back to 2013. That is, I know how every student did on every midterm and final question. To study the effect of grading scale on grade, I did the following. (Note: I’m not a statistician, so I have no idea if there is a smarter way of doing this, or if there’s a theoretical explanation for some of what I observed. Please let me know of there is!)

1. I chose a random weight for the final exam somewhere between 0.3 and 0.6, and I split the rest evenly between the midterms.
2. Given these weights, I computed the class scores and looked at the rank order of the students.
3. I repeated this process a bunch of times.
4. For each student, I then have a list of ranks (one for each random grading run). I computed the standard deviation of this list for each student.
5. I divided these standard deviations by the number of students.

This gives a score for each student under a random grading scale that measure what percentage deviation in his or her rank one expects overall.

One could wonder what happens if this is done to random data (i.e., randomly assigned question scores). I computed these as well. One might also wonder if the students’ autocorrelations are responsible for behavior. So I also computed the result for “random data with the given covariance matrix” (i.e., take random data and then use the Cholesky decomposition of the desired covariance matrix to make linear combinations of the random data with the right covariance).

### Some results

Here are a couple of images that come from numerical simulations of grading scales run against actual student data, as described above. These were each generated by 10000 runs. (Like I said, this is a Monte Carlo approach to something that might just have a simple analytic solution. Even worse: my R code takes a painful amount of time to run. Does anyone want to weigh in? Send me email! I’ll update the post if someone explains how this actually works.)

First, a plot showing the expected deviation percentage for real data (from Autumn of 2014) vs random data, ordered from largest deviation to smallest. The red curve is from the true data; the blue is from random data.

Next, a plot showing the same things but comparing real data against minimally correlated random data.

We can also visualize a scatterplot of mean rank (over all runs) versus normalized standard deviation of ranks (over all runs). First, true versus random data:

Finally, true versus minimally correlated data:

### Relevant observations

Several things jump out at us.

1. Any reasonable grading scale (weighting the final somewhere between 0.3 and 0.6) has a reasonably high chance of placing a student within about 5% of his or her “true” rank in the class.
2. The fluctuations are greatest near the middle of the curve.
3. This has good and bad consequences: if you are doing very well or very poorly, there will be almost no effect, but if you are in the middle, a small rank change is associated with a larger potential value change.
4. Linear combinations of uniformly sampled data with the given correlation matrix almost correctly model true student data. What is going on?

Food for thought!