AI in Re-imagining Assessments

The Problem

The Structural Failure
of Current Evaluation

The Default Operating Mode

A student enrolled in a law programme is assigned a research paper on constitutional interpretation, due in seventy-two hours. They open an AI model, type the assignment question, and receive a structured, well-argued response within seconds. After minor edits, the submission is uploaded. Turnitin finds no red flags — the content was generated, not copied — and a first-division grade is awarded.

The student has demonstrated no independent legal reasoning, engaged with no primary sources, and formed no original argument. The credential, however, reflects otherwise.

Why Detection Has Already Failed

AI detection produces probabilistic estimates that do not meet the evidentiary standard required for academic misconduct proceedings. Paraphrasing attacks can bypass most detectors, and tools specifically designed to make AI text undetectable are widely available.

More fundamentally, detection is in a permanent race it cannot win. Each model update makes generated text harder to flag. The assignment becomes an AI detection evasion exercise rather than genuine engagement with the material assigned.

"The problem is not that AI exists, or even that students use it, but that the system provides no mechanism to distinguish between demonstrated understanding and generated output."

— JGU 5 Initial Contribution · Inria AI Grand Challenge 2026

What assessment was meant to do

Academic assessment serves two simultaneous purposes: it produces a grade and it produces information — about what the student knows, where comprehension has failed, and what instruction must still do. These functions break down simultaneously when AI-generated work is submitted.

What assessment is currently doing

It is verifying that a student has submitted a document. The grade is an award but the information is absent. Faculty proceed on false data, unable to identify gaps in understanding, adjust instruction, or meaningfully gauge the effectiveness of their teaching.

Why It Matters

The Stakes

12%

Rural Internet Adoption · IAMAI-Kantar 2025

Access to premium AI tools tracks closely with socioeconomic advantage. This is not a digital divide — it is an evaluation regime that systematically advantages those already ahead, while presenting itself as a meritocracy.

Enforceable UGC Guidelines on Generative AI

The UGC's 2018 plagiarism regulations remain the governing standard. No formal update addresses AI-generated content. AICTE has moved faster — creating a fragmented regulatory landscape with inconsistent accountability standards.

∞

Downstream Credential Erosion

Law firms and consulting firms have already introduced parallel assessment exercises precisely because they no longer treat transcripts as sufficient evidence of competence. The burden of evaluation is migrating beyond universities.

T.1

Assessment has stopped working as a measurement tool

A submitted essay is supposed to show that a student can construct an argument. When a machine can produce that same output in minutes, the instrument stops measuring what it was designed to measure. Policy that tells students how much AI they may use does not fix the underlying problem.

T.2

The professor–student relationship has quietly eroded

AI did not directly cause the erosion of Socratic learning — tutorial-based dialogue was progressively sidelined as institutions shifted toward scalable written assessments. But AI has made the cost of that erosion visible and urgent. The Socratic method is the most resilient form of assessment precisely because it cannot be delegated to a machine.

T.3

Equity is more complicated than it looks

AI tools are often described as democratising because they are widely available. A student at a well-resourced institution with a premium AI model is not using the same tool as a student at a state college on a free-tier platform. The gap is not just access — it is also the capacity to use AI well, which tracks closely with the quality of prior schooling.

T.4

AI companies have interests institutions are not accounting for

The adoption of AI tools in academic settings is entry into a commercial relationship with companies whose business models depend on data accumulation. Student submissions and reasoning patterns are commercially valuable training data. The DPDPA 2023 is directly relevant, but subordinate regulations for educational AI have not been developed.

T.5

The degree is losing its signal value

When grades increasingly reflect AI-assisted output rather than a student's own reasoning, employers, postgraduate institutions, and professional bodies lose the ability to trust what a degree represents. Faculty have already begun resorting to informal viva-style conversations to verify understanding — that informal workaround points toward the structural reform the system needs.

T.6

The student has quietly checked out

When any assignment can be resolved in ten minutes with the right prompt, the incentive to sit with a problem and build understanding from the struggle disappears. This is not laziness — it is a rational response to a broken incentive structure. A student who consistently outsources academic thinking will graduate with a credential but without the intellectual muscle that credential is supposed to represent.

Our Proposal

The SLAM Framework

The Situated Learning Assessment Model is a governance framework built on a single premise the current system has never adequately operationalised: what a credential certifies is not the ability to produce a document, but the capacity to demonstrate genuine understanding. SLAM does not modify how submissions are staged or disclosed — it moves evaluation away from submissions entirely.

Start From the Learner

Most evaluation reform designs a universal framework, then inserts flexibility. SLAM inverts this — beginning from the learner's year of study, subject, and chosen direction, building the evaluation method upward from there. The result is a principled architecture within which different subjects and years are assessed through mechanisms genuinely suited to what is being demonstrated.

ii.

Real-Time Demonstration

The unit of evaluation shifts from what a student produces outside the room to what they can demonstrate inside it. Structured application tasks, Socratic dialogue, and performance-based evaluation make understanding visible through action. A student cannot delegate a live advocacy performance. A student cannot outsource reasoning a clinical scenario demands in real time.

iii.

Format ≠ Standard

Two students assessed through different formats are not being held to different standards. The format is variable and context-responsive. The standard — the cognitive depth, analytical rigour, and disciplinary competence expected at a given year level — is fixed and institutionally defined.

Differentiation by Subject Type

Subject Type	Evaluation Method	Rationale	Tags
Foundational & Theory	In-classroom application tasks with unseen problems requiring real-time reasoning	Tests whether a student can identify what principle applies, why it applies, and what it requires in a specific situation — reasoning that happens in the student's mind, in the present moment, and cannot be produced in advance.	No AI substitution possibleDesign-based resistance
Applied & Clinical	Moot court, case simulations, structured problem-solving under observation, live client scenarios	Professional competence demands evaluation through performance. A student cannot delegate a live advocacy exercise. The assessment instrument is the performance itself, which by definition cannot be generated in advance.	Performance-basedProfessional alignment
Elective & Specialisation	Structured faculty-led inquiry in the student's chosen domain, directed intellectual exchange	A student who has elected to focus on technology law should be assessed on their capacity to reason within that field — through a directed exchange in which the faculty member probes the depth and independence of thinking in real time.	Domain-specificIntellectual ownership

Differentiation by Year of Study

Stage	Evaluative Priority	Assessment Focus
Early Years	Conceptual grounding	Can this student identify the relevant principle, apply it correctly, and explain why it leads to the outcome reached? The question is not complex — the test is whether the answer is genuinely theirs.
Middle Years	Analytical development	Students must explain why one position is stronger than another, what their favoured argument cannot account for, and where reasoning breaks down under pressure. Precisely the questions AI can simulate in a document but cannot answer when a faculty member pursues them in real time.
Final Years	Intellectual ownership	Dual-track model: shared disciplinary competence (core knowledge every graduate must demonstrate) + learner-specific track (the student's elected specialisation). Both tracks use in-context, real-time demonstration as the evaluation instrument.

Addressing the fairness question directly

Language equity. Before enrolment, institutions map the medium of prior schooling for each student. In the first semester, all students participate in academic discourse coaching through credited or co-curricular workshops — a universal preparation, not a deficit intervention. Students may initially articulate reasoning in their first language before self-translating; the argument, not the language of its first expression, is what is being assessed.

After every viva, faculty are required to give feedback that separates content understanding from expression quality — so students can distinguish what they know from how fluently they currently articulate it.

Process anxiety. In-classroom application tasks can disadvantage students who process more slowly or who experience anxiety in high-pressure academic settings. SLAM does not treat these as edge cases. Faculty development must include explicit training in designing tasks that allow for variation in response pace and style without reducing the rigour of what is being tested.

Where language anxiety remains significant, viva questions are made available thirty minutes before the session — removing vocabulary retrieval pressure without reducing the spontaneity of reasoning being tested.

Our Journey

How We Built This

Brainstorming

Where it started

The question we kept returning to

The conversation began not in a seminar room but in an informal exchange among five law students who had noticed something unsettling: assignments were being graded on output that no longer reliably reflected the understanding of the person submitting it. Not hypothetically — in our own institution, in our own cohort.

We started from the student's perspective rather than the institution's. Our first instinct was to ask what the assessment was actually trying to do before asking how AI disrupted it. That reorientation — diagnosis before prescription — shaped everything that followed. We read through UGC frameworks, BCI guidelines, and NAAC accreditation criteria, and found the same gap everywhere: the instruments were designed for a world in which written output was evidence of the reasoning that produced it. That assumption had quietly collapsed.

We argued about detection first — whether smarter tools could close the gap. The consensus, after working through the literature, was no. Detection is a probabilistic exercise that cannot meet the evidentiary standards required for misconduct proceedings. We needed to think structurally, not technologically.

The question we eventually landed on was this: if you could no longer trust what a student submitted, what could you trust? The answer, after considerable back-and-forth, was what they could demonstrate in real time, in front of you, under questioning.

Initial Contribution

Submitted — April 2026

A framework built from first principles

Our initial contribution to the challenge laid out the structural failure of detection-based approaches and proposed SLAM as an alternative architecture. The core argument was that the credential crisis was not primarily a technology problem but an evidentiary one: institutions had lost the mechanism for distinguishing demonstrated understanding from generated output.

We proposed three interlocking principles. Assessment should start from the learner rather than from an abstract universal standard. Evaluation should move toward real-time demonstration — what a student can show inside the room — rather than what they produce outside it. And format should be understood as variable while standard remains fixed.

We mapped these principles across subject types and year levels, and addressed the fairness questions directly: on language equity, on processing anxiety, and on what it means to prepare students for a form of evaluation that is, in many institutional contexts, unfamiliar.

We were careful about one thing from the start: we did not want to propose a framework that worked only for well-resourced institutions like our own. The students who stand to lose most from the collapse of meaningful evaluation are those for whom a credible degree is the primary instrument of social mobility.

III

Evolving the Proposal

Feedback & community dialogue

What we heard, and what it changed

The peer review process surfaced questions we had anticipated and several we had not. The scalability challenge came up repeatedly — how do you run viva-style assessments at the scale of a public university with hundreds of students in a cohort and faculty already stretched thin? We had addressed this in our framework but not with sufficient concreteness, and that was a legitimate criticism.

We also heard a sharper version of the equity argument. Our proposal assumes that integration — formally recognising AI use and building assessment around it — is preferable to prohibition. The counterargument was that prohibition might be more equitable precisely because it levels access to zero. Working through that argument changed how we articulated our position: prohibition does not in fact eliminate unequal advantage, it merely pushes it underground, where disparities continue to shape outcomes without any institutional mechanism to address them.

A third line of feedback pushed us toward something we had not adequately considered: AI literacy as an assessable competency in its own right. Our framework had largely treated AI as something to render irrelevant to evaluation. The question — can you assess whether a student uses AI well, not just whether they can think without it — opened a parallel track we intend to develop further.

What the feedback clarified is that SLAM is not a finished proposal. It is an architecture. The implementation mechanisms — rubrics, calibration standards, audit trails, pilot structures — need to be built in consultation with institutions, not handed down as a template. The framework, not the format, is what we are asking to be standardised.

Internal Debate

Where We Disagreed

The discussion within the team was structured around two competing paradigms — integration as regulation and prohibition as equalisation. The deliberation focused on which framework more accurately responds to the structural transformation introduced by generative AI.

Position A · Integration

AI integration as structural necessity

Generative AI has already entered the epistemic core of academic work. Prohibition attempts to regulate a practice that is already widespread yet difficult to detect or enforce. Integration enables reconstruction of assessment frameworks — shifting evaluation toward process-based and performance-based verification that directly tests understanding rather than relying on written artefacts AI can replicate.

Position B · Prohibition

Prohibition as a claim to equity

Access to AI is stratified across three axes: tool quality, AI literacy, and institutional capacity. Integration — particularly when paired with disclosure requirements — was seen as insufficient because it renders inequality visible without correcting it. Formalising AI use risks entrenching advantage for already well-resourced students, while overburdened faculty in large public institutions lack capacity to meaningfully assess disclosures.

Resolution · Team Position

Integration, with structural safeguards

Prohibition does not, in fact, eliminate unequal advantage — it pushes AI use into informal and unregulated domains, where differences in access and capability continue to shape outcomes without institutional oversight. Integration, while not eliminating inequality, creates the only viable framework within which disparities can be addressed through standardised access, AI literacy training, and redesigned assessment methods. Regulatory leverage is only possible under conditions of formal integration — the DPDPA 2023 can only be operationalised when institutions explicitly recognise and govern AI use.

Peer Dialogue

What the Reviewers Said

Five reviewers engaged with our initial contribution between April and May 2026 — from France, Mexico, Senegal, and India. Their questions sharpened the proposal in places, challenged it in others, and in one case opened a line of thinking we had not yet pursued. What follows is the exchange in full.

MOUH Mariam External Reviewer 14 Apr 2026

In favour

The proposal presents a compelling shift from evaluating polished written outputs to assessing genuine student understanding through real-time methods. However, the proposal would benefit from a more concrete implementation strategy, particularly considering under-resourced institutions and faculty managing large student cohorts. How can these personalised, real-time assessments be conducted at scale without significantly increasing teacher workload?

JGU 5 Response

SLAM is not a suggestion to replace all assessments with individual viva-style evaluations. It is a phased and flexible framework where assessment formats differ based on subject type, year of study, and institutional capacity. Scalability is addressed through supervised in-class application tasks, structured group discussions, simulations, and debate-based assessments that allow multiple students to be evaluated simultaneously. Real-time oral assessments are intended to supplement, not entirely replace, existing evaluation systems.

— Kashish Goel, 11 May 2026

Gaëlle Méchin External Reviewer 18 Apr 2026

In favour

The point about the trustworthiness of diplomas in the age of generative AI is highly relevant. More broadly, this reflects a systemic issue across education: students are incentivised to prioritise short-term performance over long-term learning. AI brings this to the forefront. Some systems don't rely on papers much — in France, written in-person exams require real-time reasoning around a "problématique." However, both written and oral exams have limits: students often learn for the test and forget afterward. AI is an opportunity to rethink assessments in favour of deeper, long-term understanding.

JGU 5 Response

Your observation about AI exposing a deeper structural weakness within education is something we strongly agree with. Many assessment systems were already rewarding short-term performance over durable understanding, and generative AI has simply made that gap more visible. We agree that formats centred around real-time reasoning are significantly more resilient than conventional take-home submissions. That is precisely why our proposal attempts to rethink not just the format of assessment, but what assessment is actually meant to verify.

— Akshit Mathur, 11 May 2026

Haziel Álvarez External Reviewer 22 Apr 2026

In favour

The format-vs-standard distinction in SLAM is genuinely useful. Two questions: SLAM shifts evaluation to live faculty judgment — what about structural safeguards inside the viva itself? Blind rubrics, cross-faculty calibration, audit trails — to ensure fluency doesn't get conflated with reasoning quality at scale. Also, the proposal largely treats AI as something to render irrelevant to assessment. Do you see a parallel track where AI literacy becomes its own assessable competency — "can you use it well" — not just "can you think without it"?

JGU 5 Response

Mechanisms such as structured rubrics, calibration standards, and audit trails are exactly the kinds of safeguards we believe would become necessary as frameworks like SLAM scale institutionally. We would also clarify that the proposal is not attempting to render AI irrelevant to assessment broadly — our argument is narrower: AI should not substitute the evidentiary function of assessment itself. AI literacy would sit within SLAM but not as a substitute for demonstrated cognition. The ability to use AI well cannot replace the need to demonstrate independent reasoning.

— Akshit Mathur, 11 May 2026

Amadou Gaye ARIA Team 27 Apr 2026

We question the idea that paid AI tools are necessarily better, since both paid and free versions can make similar errors or biases. Your point about alignment with societal and employer expectations is very relevant, especially as AI reshapes other sectors beyond education. The equity concern already existed before AI — and while inequalities are real, access to knowledge is still broadly possible through the internet and open resources.

JGU 5 Response

Our concern is that AI risks formalising and amplifying existing inequalities within evaluation systems that still assume comparable conditions of access, literacy, and institutional support. Our point regarding paid and free AI tools was less about correctness alone — students with access to premium models benefit from better contextual memory, longer interactions, multimodal capabilities, and greater institutional exposure in how to use these systems effectively. The advantage lies not only in the tool itself, but in the conditions under which students learn to engage with it.

— Akshit Mathur, 11 May 2026

Nipun Ranchhod Navadia External Reviewer 9 May 2026

The proposal reframes AI not just as a plagiarism issue, but as a deeper challenge to how universities evaluate genuine understanding. The focus on viva-based assessments and real-time reasoning directly addresses AI-generated submissions. Connecting assessment failures with equity, faculty preparedness, and declining credential credibility makes the proposal realistic and well-researched. One area to improve: simplifying the framework and making implementation more practical. The ideas are strong, but large-scale adoption may be difficult for institutions with limited resources. Adding clearer pilot strategies and measurable outcomes could strengthen it further.

JGU 5 Response

We have attempted to account for implementation through phased rollouts, pilot institutions, mandatory outcome reporting, faculty development mechanisms, and equity audits — precisely because we recognise that large-scale adoption cannot happen uniformly across institutions with differing levels of resources. We completely agree that clearer measurable outcomes and implementation indicators would strengthen the framework further, and that is something we intend to develop more concretely as the proposal evolves through the next stages of the challenge.

— Akshit Mathur, 11 May 2026

What Comes Next

Implementation Roadmap

Pilot Phase

Pilot institutions implement subject-type differentiated evaluation in two disciplines. Faculty development workshops on application-based assessment design. Mandatory outcome reporting to surface early constraints. Language equity mapping for incoming cohorts.

Expansion

Framework extends across all year levels at pilot institutions. Equity audits conducted before further expansion — identifying where differentiated formats are producing unintended disadvantage. Cross-faculty calibration standards developed and tested.

National Rollout

Regulatory guidance enables national rollout with institutional adaptation flexibility built in. The framework's logic scales — but its specific mechanisms must be calibrated to disciplinary and institutional context. The framework, not the format, is what is standardised.

Conditions for implementation

Regulatory definition of discipline-specific, year-level competency standards by UGC, NAAC, and BCI
Institutional investment in faculty development for application-based and inquiry-led evaluation design
Equity audits at each phase of implementation to identify where differentiated formats produce unintended disadvantage
Compliance with the Digital Personal Data Protection Act 2023 governing any digital tools used within the evaluation process
Phased rollout beginning with pilot subjects, with mandatory outcome reporting before expansion proceeds
Cross-faculty calibration standards and structured rubrics to prevent conflation of fluency with reasoning quality

AI inRe-imaginingAssessments

The Structural Failureof Current Evaluation

The Default Operating Mode

Why Detection Has Already Failed

What assessment was meant to do

What assessment is currently doing

The Stakes

The SLAM Framework

Start From the Learner

Real-Time Demonstration

Format ≠ Standard

Addressing the fairness question directly

How We Built This

The question we kept returning to

A framework built from first principles

What we heard, and what it changed

Where We Disagreed

AI integration as structural necessity

Prohibition as a claim to equity

Integration, with structural safeguards

What the Reviewers Said

Implementation Roadmap

Conditions for implementation

Who We Are

AI in
Re-imagining
Assessments

The Structural Failure
of Current Evaluation