UK's exam-grade U-turn: A failure of Computational Thinking destroys public trust again

Another day, another fiasco that’s pinned on the failure of algorithms to make reasonable decisions about our lives.

Summary for those not in the UK: the exam regulator Ofqual, alongside the Department for Education (DfE), cancelled public exams in March because of the pandemic, and botched using the available data on students, teachers, schools etc. to assign grades. “The algorithm” was widely blamed. Then after one set of grades were released there was a U-turn because of an outcry that students were being unfairly downgraded—causing ongoing chaos for university entrance.

Procrustes was the mythical Greek innkeeper who welcomed visitors but then forced them to fit his single size of bed. Too long and your legs were chopped. Too short and you had to be stretched to fit—rather like Ofqual’s algorithm for stretching or …

Procrustes was the mythical Greek innkeeper who welcomed visitors but then forced them to fit his single size of bed. Too long and your legs were chopped. Too short and you had to be stretched to fit—rather like Ofqual’s algorithm for stretching or squeezing this year’s students to fit last year’s performance curve. Oh and if you are one of the small number studying Ancient Greek, and perhaps learning about Procrustes, you would have been too elite to have The Procrustes-Ofqual algorithm applied to your results.

Computation is powerful for decision-making. But when it’s misused, like all powerful toolsets, the results can be highly damaging—both to the decision in question, and to confidence in the toolset itself.

I warned in my blogpost COVID-19: Mathematical Modelling on Trial this April that ineptitude at computational thinking (CT) is an urgent societal problem whose effects we’re increasingly feeling.

Even I’m amazed that such a severe case of CT failure should emerge from the very bodies who’ve failed to introduce CT education in schools, and insisted we stuck to old-school maths instead. Rote-learning long division procedures or indeed 1930s statistics isn’t going to get you using modern multiparadigm data science to reach good decisions with computation (including if you’re a Minister, civil servant or analyst in Ofqual).

Talk of being hoisted by your own petard!

I was curious what had gone wrong, so took a quick look. I came in thinking that this was a hard problem to “get right” or be seen to: people are always going to be unhappy if they didn’t score well, or if everyone scores too well, and therefore a public outcry was more-or-less inevitable. In particular, I noted the tension between individuals’ grades and grades en masse not losing meaning through inflation. The thing is that when you have a definite and long-standing way to measure something complex, however unrepresentative it is of what you really want to know (exams of ability in this case) and therefore however off track (or unfair) it may actually be, it offers some level of transparency and therefore perceived fairness through experience of the system. That transparency and therefore ability to challenge obvious errors in its application gives a certain type of confidence.

If you suddenly remove that basis of transparency and control—exams—the individual versus en masse tension becomes harder to reconcile. Passing the buck to “algorithms” doesn’t solve the problem unless you can (a) make them understandable either because they’re simple or are made easy to interrogate (b) justify why they are “fair” and particularly for the individual (c) stress test in all sorts of cases that they will match your definitions of “fair”.

Of course there’s lots of understanding of maths, society or of what’s possible that’s required—judgements of all sorts to decide what algorithms or more generally computation to use and how, with what checking. If there’s a screw-up, as there has been here, it’s not the algorithm per se or the idea of computation but those who assigned its use and made decisions as to purpose and application. Putting it another way, it’s not whether you are using algorithms on machines or human judgement or the different combos, it’s whether you’ve applied CT intelligently and systematically to the problem with the right machinery to optimise your decision.

What’s ironic in this case is how...the government departments involved...are specifically responsible for setting or approving the very curricula that have failed them.

As I said, for all these reasons, knowing how hard it can be to use CT well on messy, human problems, I began with some sympathy for Ofqual in choosing “the algorithm”. But this evaporated and turned to anger as I started to find out what had actually been done; and how multidimensional the CT failure actually was; how obvious the mistakes were; mistakes that I’d be upset if one of our data science teams didn’t spot and suggest solutions to in a morning, not 5 months. Not one point mistake, but misjudgements, crappy machinery, wrong idea, poor verification, muddy presentation, ineffective iteration on the original problem to be solved.

I’m angry not just as a UK taxpayer, parent and employer, but because the brand and power of “algorithms” and “computation” should be so misused, putting the benefit that it can offer in jeopardy for the future, as mistrust in their power accrues.

The best way to explain why I say this is a multidimensional CT failure is to briefly walk through applying the 4-step CT process, experience of which should be the bedrock of our core computational education at school, but isn’t.

Bear in mind I’m going to do this very casually and have only looked at the problem for an hour or so, but I wanted to give a flavour of what ought to have happened in much more detail over 5 months. (Many others have far more comprehensive analyses from different viewpoints).

APPLYING CT

The 4-STEP CT PROCESS

The 4-STEP CT PROCESS

First Define. We want “fair” grades, but what does that mean in this case? En masse that “standards” are preserved and we don’t see significant grade inflation (and so are “fair” to previous cohorts). Individually that students get what they would have, if they were to have taken the exams. But there are other measures of fairness too: that performance of previous years in a particular school is not your fault, that elite private schools don’t do better out of the changed system, that there’s some clarity why you got what you got as a basis of any challenge. Acceptance that if individual fairness and standards collide, you err on the side of individual fairness because this is a new system, and so on. Did they do this right? If they did in any way to start with, they certainly didn’t verify (part of step 4) against it sufficiently and iterate. In simple terms either in step 1 or step 4, those in control failed to ask the right questions or listen to those that did.

Next Abstract. Start with what data we could possibly have. Mock exam results, teacher assessments, history of that school, that subject in that school, each teacher’s assessments against what happened in the past, the derivative of school achievement and so on. Then—crucially—take as much of this data as possible to nuance your computation. This approach is completely at odds with what people are taught in school maths as I discuss extensively in my book The Math(s) Fix. Because calculating was expensive before computers, people are taught to simplify the problem so they can calculate it. That approach doesn’t work for most nuanced, complex areas where we apply computation today because the nuances and complexities with sophisticated algorithms are what enable optimised decision-making. Strip the complexity out, and you’ll get simplistic, wrong answers most of the time (often not just in the decimal place, but completely wrong) on anything that isn’t cut and dry (like the physics of planets rotating around the sun).

This fundamental error in approach appears to have occurred here. I understand they chose between many models, and then picked one (not a combination), and one with many problems: get teachers to rank students, throw out their assessments of the grades (except in some edge cases) and map on to the previous cohorts’ splay of grades. This abstraction is too simplistic to cater for a case when there were some students who didn’t work at all and got U grades (failures), for example. If you’re a cohort, it could mean someone has to get a U because that was in the splay of last-year’s grades. Obviously that can’t be “fair” or explicable on an individual level. Can we better mesh teachers knowing their students with their overall generosity of grades causing inflation, by having schools each get a quota of grades or grades per subject for them to propose how to share out? If so, what will go wrong and right about that? And so on.

Next Compute. If you’ve set-up your model or in fact the multiple competing models you should have set up to compare, results should start to churn out. But to do this, and be able easily to assess different models, make changes to see effects, the data needs to be in good shape. Good shape for data means “computable” as well as accurate and whole—where it has structure and meaning attached ready for a computer to use it—and is crucial to the rapid iteration of models you need to do a good job. This should have been organised years ago in this case but may not have been.

Next Interpret. Ensure you satisfactorily answer what you defined as your problem in step 1. Does it meet the criteria (and were these in fact the right criteria)? If not, how can we adjust so it does. Have we got completely the wrong abstraction of the problem or is it that we need to throw in more model complexity to handle all sorts of cases? Have we thought of or imagined all the scenarios that are significant? How are we stress-testing our model against them and what happens when we do? Have many different groups within government (educators, PR, No10, head-teachers) engaged with it sufficiently for effective cross-questioning?

In order to do this you need to have the data and models interactive and easily re-codable live so that the Prime Minister can himself see in real time the answers to questions he might have. How was the information made available for verification? In printed out graphs, or perhaps a dead PDF report or unengaged PowerPoint presentation or…in a live, computable notebook document that the decision-maker could drive? In the modern world of using mass data for decisions, the interface to the data has a big effect on outcome.

This exam-grade decision is a major one: affecting many people, politically sensitive and so on. Was the Minister and his civil servants taken in detail through “the model”, live as I describe? Or was this left to vague questioning about whether “fairness” had been achieved? If not, why not? Did they not have even the basic, requisite CT skills?

Iterating the CT process to verify and improve decisions

Iterating the CT process to verify and improve decisions

Why was CT misapplied?

That’s a smattering of the CT process for this case. What’s shocking is how on every level and at every step (bar perhaps the computation itself) it appears to have been misapplied, and failed. Why didn’t the analysts at Ofqual apply CT effectively, what about their management, and the DfE supervision, both civil servants and ministers?

I think the answer in the end is pretty simple. Poor education in CT. Because, like for everyone else, their school education doesn’t cover it. Not in maths, not elsewhere. Not for big data problems with modern computation, not in the messy situations that computation today is asked to provide decisions for.

With no education, and no experience of meshing human and computer, maths and coding—CT—it’s hardly surprisingly that it fell through so many hands and was such a failure. This isn’t the first time, and it won’t be the last. In fact expect to see accelerating incidents and severity of such failures.

I’m sure it will emerge that individuals with good CT warned about what was to come. But the system failed, because the system isn’t set-up for a modern computational age, and key people operating it are woefully ill-equipped with CT education.

What’s ironic in this case is how these government departments are the ones specifically responsible for setting or approving the very curricula that have failed them. And that yet again this shows up how urgent the problem is.

Worse, I understand the DfE just handed out £150M funds to help with extra tutoring of students with their maths: how to do simplistic, manual calculations when the student’s actually got the computer in front of them for Zoom or Teams anyway, and could be doing realistic, harder CT problems using it. If you try to get funding for a full, mainstream CT educational program to go alongside or replace maths, you’re out of luck. Too much trouble, too much disruption.

We’re in a vicious circle where the government don’t have the CT abilities to analyse why they need to fund core CT education! It is urgent that this cycle is broken and with The Math(s) Fix I launched “TMF Campaign for Core Computational Curriculum Change” to help to collect support for so doing. Please add yours if you agree.

I finish with two specific fears I have from this episode. Firstly that that exams will become more prescriptive rather than more open-ended and so more unfair in failing to match the real-world. Secondly that continuous misuse of computation will lead to mistrust and demise of its use. This won’t take us back to a utopian pre-computation age, but a new pre-enlightenment era where reason has retrenched.

Conrad Wolfram2 Comments