Designing Programming Courses in the AI Era | Alternative Assessments for Genuine Learning

Published on 2026-04-16 | Last updated 2026-04-16

Cover image for programming course design article — Designing Programming Courses in the AI Era

Generative AI tools have fundamentally changed what programming assessments actually measure. Without deliberate redesign, homework and exams risk evaluating AI proficiency more than student understanding. This article maps assessments—from attendance and timed coding tasks to oral exams and code walkthroughs—against Bloom's Taxonomy, and introduces an "AI-resistance" framework to evaluate each method's robustness against AI substitution. We discuss staffing and scalability constraints that govern what is operationally feasible, and propose concrete course designs at low, medium, and high effort levels. The goal is not to ban AI, but to build courses that recover what AI has eroded: genuine, verifiable competence.

Opening

Every semester, programming instructors encounter a version of the same quiet dilemma: a student submits polished, correct code—but cannot explain why it works. I've sat on both sides of this table. As a TA for MIT 6.2000 and 6.2500, I've graded hundreds of submissions and run my share of office hours, and I can tell you this gap existed long before ChatGPT. What generative AI has done is widen it dramatically and make it nearly invisible in traditional grading workflows. In courses where homework and take-home projects once served as the primary evidence of learning, those formats now measure, at best, a student's ability to prompt an AI effectively (Hanh and Duyen 2025; Delphino 2025).

This article is a practical guide for instructors who want to close that gap. I am not here to argue that AI tools should be banned—that ship has sailed, and frankly, AI-augmented engineering is a skill worth teaching in its own right. The question I care about is more precise: when a student submits work that looks correct, how do we actually know whether they understand it? We examine the full range of assessment options—attendance, homework formats, quizzes, exams, and projects—and evaluate each through the lens of "AI-resistance." We then propose concrete course designs that remain operationally feasible for typical staffing levels, without sacrificing the genuine learning that programming education is meant to produce.

What Are We Trying to Achieve? | Purpose > Goals > Targets

Purpose of the Class

Before we talk about assessment mechanics, it is worth being precise about what we are actually trying to achieve. A well-intentioned instructor can spend hours redesigning assignments and still miss the point if the course goals were vague to begin with.

For this subfield (e.g. CV, NLP, deep learning), develop real programming competence and reasoning ability, in the AI era.

That single sentence does a lot of work. "Real competence" is different from "correct output." "Reasoning ability" implies the student can explain, adapt, and debug—not just produce. And "in the AI era" acknowledges that we are not trying to pretend 2024 didn't happen; we are teaching students to be effective engineers in the world as it actually exists.

Bloom's Taxonomy as a Foundation for Goal Setting

With that purpose in mind, Bloom's Taxonomy (Anderson et al. 2001) gives us a structured vocabulary for specifying what we want students to be able to do. As shown in Figure 1, it organizes cognitive skills into six hierarchical levels: Remember, Understand, Apply, Analyze, Evaluate, and Create.

I use the taxonomy in two ways here. First, it anchors course goals to specific cognitive levels—turning vague aspirations like "students will understand neural networks" into something measurable, like "students can diagnose why a given architecture fails on a specific input class." Second, and this is the part that matters most in the AI era, it gives us a direct way to evaluate how vulnerable each assessment format is to AI substitution.

Here is the hard truth: generative AI is excellent at the lower levels of the taxonomy. It can recall facts, explain concepts fluently, and produce functional code. Any assessment that only tests these levels is highly substitutable by AI. This does not mean we abandon them—students still need to understand how the code works, even if they did not write it from scratch—but it does mean we need to be intentional about pushing assessments toward Apply, Analyze, Evaluate, and Create, where AI assistance becomes harder to substitute for genuine understanding.

Bloom's Taxonomy diagram showing six hierarchical levels of cognitive skills — Bloom's Taxonomy provides a hierarchical framework for cognitive skills, progressing from recall and comprehension to application, analysis, evaluation, and creation.

A resource I return to often is the Utica University combined chart of Bloom's Taxonomy Revised (Utica University 2001), which pairs each cognitive level with both action verbs and suggested deliverable types. It is a practical starting point when you are trying to move from a learning goal to a concrete assessment format.

With this framework in place, we can map course goals directly to Bloom's levels:

(Remember) Students can recall core algorithms, data structures, and debugging patterns without AI prompts—testable via closed-book quizzes and timed recall exercises.
(Understand) Students can explain their code and reasoning in real time to a TA or peer, demonstrating genuine comprehension rather than AI-polished output.
(Apply) Students can implement and debug solutions in new contexts, independently and under time constraints.
(Analyze) Students can compare algorithmic approaches, identify trade-offs, and diagnose bugs in unfamiliar code.
(Evaluate) Students develop judgment—knowing when to use AI tools, when to work independently, and how to critically assess AI-generated code for correctness and efficiency.
(Create) Students can design and build original projects that synthesize course concepts, using AI as a collaborator while maintaining authorship and genuine understanding.

Beyond the cognitive hierarchy, a few operational goals are worth making explicit: incentivize time spent on learning-relevant work over AI-shortcutting; maintain fairness and interpretability in grading; and keep the course sustainable for the teaching staff.

Targets by the End of the Semester

Increasingly, engineers code with AI tools to increase productivity. It is impractical—and arguably counterproductive—to ask students not to use AI at all. Instead, I find it more useful to define two categories of targets: what students can do without AI, and what they can do with it.

Part 1: Independent Competence (without AI). These targets assess what students can do purely from their own knowledge, corresponding to the lower-to-middle levels of Bloom's Taxonomy (Remember through Apply):

Implement fundamental algorithms and data structures from scratch within a timed setting.
Debug unfamiliar code and articulate the root cause of errors.
Explain any submitted code in an oral walkthrough, including design choices and edge cases.
Answer conceptual questions about trade-offs between approaches without preparation time.

Part 2: AI-Augmented Competence (with AI). These targets assess what students can build when AI tools are available, corresponding to the upper levels of Bloom's Taxonomy (Analyze through Create):

Design and implement a project that goes beyond course examples, using AI as a coding assistant while maintaining intellectual ownership.
Critically evaluate AI-generated code: identify correctness issues, inefficiencies, and failure modes.
Think about scalability: would this solution still work at 10× the data size or number of users? Would it degrade gracefully or collapse?
Think about maintainability: which parts of the project will need frequent updates? Can the process be partially automated?
Integrate AI tools strategically—knowing when to prompt, when to verify, and when AI output is simply insufficient.

This two-part framing is not a compromise. It reflects what good engineering actually looks like: people who can only work with AI are fragile; people who can do both are adaptable.

Constraints and Tradeoffs in Assessment Design

Having clear goals is necessary but not sufficient. Different assessment formats impose very different costs on teaching staff, and those costs compound fast in large courses. A code walkthrough that takes 10 minutes per student becomes 17 hours of staff time for a class of 100. Before committing to any format, instructors need to honestly reckon with what they can sustain.

It is also worth noting that MIT's grading policy mandates criterion-referenced assessment, which actually supports our AI-resistance goals (MIT Registrar's Office 2024):

The grade for each student shall be determined independently of other students in the class, and shall be related to the student's mastery of the material based on the grade descriptions below. Grades may not be awarded according to a predetermined distribution of letter grades.

This is a meaningful design constraint. It means the question "did this student demonstrate mastery?" is always more important than "how did this student rank relative to peers?" That framing actually helps with AI concerns: if a student can demonstrate mastery live, in real time, it does not matter much whether they used AI to draft their code—what matters is that they can own it.

A few principles I try to hold in tension when designing assessments:

Increased staff effort is acceptable if assessment quality improves meaningfully—but not all quality improvements are worth the cost.
No format is fully AI-proof; the goal is robustness, not perfection.
Some convenience can be traded for more authentic evidence of understanding.
Different assessments may have different rules about AI use, and that is fine—transparency with students about which is which matters.
More complex grading is acceptable if rubrics remain clear and fair.
The course should be designed around incentives, not policing.
Any solution must still scale to the course's actual staffing and size.

Methods

The rest of this article works through the main categories of assessment—attendance, homework, quizzes and exams, and projects—and evaluates each on practical and AI-resistance grounds. I draw on my experience TAing MIT 6.2000 and 6.2500 throughout, since those are the contexts where I have actually tried or observed most of these approaches.

Attendance

Attendance is the most contentious item to grade, especially in CS, where many students reasonably feel that recorded lectures make physical presence optional. I am sympathetic to that view. But there is something that happens in a live classroom—a question asked at the right moment, a misconception corrected before it calcifies—that recordings simply do not replicate. If attendance is going to count for anything, it should be tied to actual engagement, not just showing up.

Attendance Sheet: A physical sheet passed around during lecture with a brief prompt or question—not a graded quiz, just a note-taker. This has been used effectively in MIT 6.2000, where I TA'ed. The handwritten element makes it hard to fake remotely, and coordinated fraud (e.g., one student filling out multiple sheets) is easily spotted by neighboring students.
QR Code Sign-in: A QR code displayed during lecture, combined with a rotating code or brief question that students must answer to register attendance. This was a method I suggested and implemented while TAing MIT 6.2500. The code changes every lecture, so students must be present to capture it.
In-class Polling: Tools like iClicker or Google Forms with a live question at a randomized point in lecture. This doubles as an engagement check and attendance record. The unpredictable timing discourages students from leaving after the first five minutes.

One practical note: attendance credit should have a threshold, not a linear penalty. If attendance is worth 5% of the final grade, students should receive full credit at 90% attendance. This reduces stress on both sides—students who miss a lecture or two are not catastrophically penalized, and staff are not sorting attendance disputes over marginal cases.

Participation

In-class discussion where students answer questions or share thoughts verbally is currently hard to expedite with AI tools—unless students arrive with real-time assistance like earpieces or smart glasses, which is a separate problem. But participation is also genuinely hard to record and grade at scale. In a fast-moving lecture, a TA running around noting names is disruptive, and extended class discussion often bogs down the pace.

Two lighter-weight approaches tend to work better. Structured small-group discussion: pause lecture for 2–3 minutes, have students discuss a prompt in pairs or triads, then cold-call one group. TAs can circulate and note active participants without disrupting flow. Post-lecture reflection forms: a one-sentence takeaway submitted within 10 minutes of class ending. This is more easily gamed with AI, but it at least creates a record of engagement and takes almost no grading time.

In practice, many large CS courses fold participation into attendance or drop it as a graded component entirely. I think that is a reasonable call at scale, but in smaller courses or recitation sections, participation remains one of the strongest signals of genuine understanding.

Homework

Homework is where AI substitution is most acute, and where the most design work needs to happen. Traditional written and coding submissions—the backbone of most CS courses—are now the most vulnerable format. I'll go through each variant in rough order of AI-resistance.

Traditional Coding Submission: Written answers and coding submissions via Google Colab or similar. Easy to submit, easy to grade, and trivially substitutable by AI for students who know how to prompt. This is the format we are primarily trying to mitigate against.
After-class Written Note: A one-page handwritten note summarizing key concepts from lecture and homework, submitted as a photo. Deadlines set 1–2 days after lecture give students time to reflect and synthesize. The handwritten format is difficult to fake with AI, and the act of summarizing by hand is itself a learning mechanism—research on retrieval practice is clear on this point.
Audio or Chalk Talk Homework: Students record themselves explaining their solution. Chalk talk is the stricter variant: unedited video of the student working through a problem on a whiteboard or paper in front of a camera. The real-time, unedited constraint makes it genuinely hard to fake, because fluency cannot be scripted. A student who does not understand the material will show it within the first minute.
Teamed Podcast: Students work in pairs to interview each other about the material and record the exchange as a short podcast episode. The conversational format—particularly the unrehearsed follow-up questions—is harder to script credibly with AI. It also has a side benefit: students who need to explain concepts to a peer tend to understand them better afterward.
Video Homework: Students prepare a short video explaining a concept or subtopic. This tests understanding and communication, but is more susceptible to AI-generated scripts or synthetic audio and video than chalk talk. Requiring the student to be on camera, unedited, and responsive to a prompt disclosed at submission time (rather than in advance) raises the AI-resistance considerably.
Online Timed Coding Task: Students write code on a platform like CodeGrade or HackerRank under a strict time limit. This directly tests coding ability under pressure. A known workaround is two students coordinating—one starts the test early and screenshots problems for the other—which can be mitigated by randomizing question order and requiring webcam proctoring. Not perfect, but meaningfully harder to game than take-home work.
Code Review / Walkthrough: Students submit code and then explain it in a short 1-on-1 or small-group session with a TA. The TA asks follow-up questions: "Why did you use this data structure here?" or "What happens if the input is empty?" This is operationally expensive but is the most reliable method I know of for verifying genuine understanding. Wilson and Nishimoto (Wilson and Nishimoto 2024) found that shifting grading emphasis from code correctness to in-person demonstration of understanding improved learning outcomes in an engineering programming course—a finding consistent with my experience.
Iterative Submission with Diffs: Students submit multiple drafts over time, and grading considers the progression. This makes it visible when a student jumps from nothing to a polished submission in a single step—a strong signal of AI-generated work. The operational challenge is that it requires a platform that tracks revision history and makes diffs easy to review; without that infrastructure, grading becomes unwieldy.

A few structural notes on homework design are worth stating explicitly. For recording-based formats (audio, chalk talk, video), a clear rubric with a small number of grade levels—A, B, C, D is sufficient—is essential for consistency across graders. Assigning one staff member to grade all submissions of a given type is preferable for uniformity, but 3–5 minutes per submission across 100 students is 5–8 hours of review; distributing across 2–3 graders is more realistic.

The most effective overall strategy is what I think of as a cocktail approach: mix different homework formats across the semester, so students encounter each type only once or twice. This prevents them from optimizing a single AI-assisted workaround. One oral homework, one timed coding task, one iterative submission—none of these is individually airtight, but together they make consistent AI substitution significantly harder.

Quiz and Exams

Traditional Written Exams: In-person, closed-book, handwritten. Still the clearest way to verify individual understanding in a controlled setting, though limited in testing practical coding skills. One underappreciated limitation: because no exam can cover all course material, grades partially reflect which topics a student happened to review—introducing a degree of randomness that criterion-referenced grading cannot fully eliminate. Most MIT CS courses have added or expanded written exams since 2023 specifically to counter AI-assisted take-home work, and this is largely sensible.
Open-book / Open-internet Exams: Students can use any resources but must finish within a tight time limit. Time pressure is the key lever here: students who understand the material have a significant advantage over students who are trying to retrieve and prompt AI in real time.
Oral Exams: Students explain solutions live to an instructor or TA, unscripted. This is the strongest guard against AI substitution but also the most expensive to administer. In practice, oral exams work best for smaller classes or as random audits on a subset of students—applying them to everyone at scale is usually not feasible.
Take-home Exams: Longer time windows (e.g., 24 hours), open-resource. In terms of AI-resistance, these are essentially indistinguishable from homework. They should always be paired with a follow-up oral component or code walkthrough to verify understanding; standalone, they are not an exam, they are a well-scheduled homework.
Conceptual Short-answer Questions: Rather than asking students to produce code, ask them to trace through a given snippet, predict output, identify a bug, or explain why a particular approach fails on a specific edge case. These are harder for AI to help with under time pressure because the questions are context-heavy and require reading code that the student has never seen.

Projects

Projects typically refer to final assignments where students design and implement a larger system or application. This is also the format that changes least in the AI era—AI-tool usage is generally expected and appropriate in project work, and the value of a project lies in the idea, the execution, and the live defense, not in code written without assistance.

Midterm Project(s)

In the ML courses I have taken at MIT, midterm projects are relatively uncommon. When they do exist, they typically serve as scaffolding for the final: pre-proposals, proposals, and progress reports create checkpoints that catch students heading in a bad direction before they have invested significant effort.

Proposal writing is worth treating as a skill in its own right, not just a bureaucratic step. AI can polish grammar and suggest structure, but generating a novel and defensible research question—and connecting it credibly to existing literature—still requires genuine engagement with the material. Each research proposal in my ideal structure would include: a clearly stated research question and motivation; a description of who benefits from the work and how; a proposed methodology with data sources, algorithms, and evaluation metrics; a Figure 1 showing the architecture or framework; and at least 20 references spanning foundational and recent work.

Presentation-based midterms are another option: students select a topic grounded in at least two research papers, prepare a polished 10-minute presentation, and submit a screen-recorded video with camera enabled. This format resembles video homework but places more emphasis on literature synthesis—less on implementation, more on reading comprehension and the ability to communicate research clearly.

Final Project

Report: Traditional project with written report and code. AI tools are generally allowed here, and the emphasis should be on the quality of the idea, the rigor of the execution, and the clarity of the write-up.
Poster Presentation: A poster session at the end of the semester where students present and defend their work to peers and instructors. The live Q&A component—particularly when instructors ask probing follow-up questions—makes it resistant to work the student did not actually understand.
Oral Presentation: Students present in front of the class and answer questions. Tests both understanding and communication.
Video Presentation: Students prepare a polished video presenting their project. Less interactive than live presentations but practical for large classes where scheduling prohibits oral defense.

The project component does not need a major redesign for the AI era. The main design work belongs in homework, quizzes, and exams—where we want to measure genuine understanding while making AI substitution difficult enough that it is no longer the path of least resistance.

For attendance and participation, this depends on the instructor's priorities. CS courses have traditionally de-emphasized both compared to fields like EE or the humanities. My own view is that students who attend lectures and engage with the material actively learn better, and that small but non-trivial attendance credit—say 3–5%—is a reasonable nudge without becoming punitive.

Summary of Methods and AI-Resistance

Let me now define the "AI-resistance" scale I have been gesturing at throughout. AI-resistance is the degree to which an assessment format requires the student to demonstrate understanding in ways that cannot be substituted by AI-generated output. A Low rating means a competent AI user can achieve a high score without understanding the material—the format evaluates output, and AI can produce that output. A Medium rating means AI assistance is possible but imperfect; workarounds exist, yet prepared students retain a real advantage. A High rating means the format requires real-time, verifiable demonstration of knowledge—typically live, interactive, and responsive to follow-up—where AI cannot stand in for the student.

AI-Resistance Level	Example Formats	Description	Tradeoff
Low	Traditional written homework, take-home exams, final project with report only	Easily substituted by AI tools. Students can achieve high scores without truly understanding the material. Output-based grading cannot distinguish AI-generated work from genuine effort.	Lowest cost to administer. Familiar to students and staff. Scales well.
Medium	Oral homework recordings, online timed coding tasks, open-book exams with time limits, video homework, iterative submissions with diffs	More difficult to rely solely on AI, but workarounds exist. Students could script oral recordings in advance, or coordinate on timed tasks. Time pressure favors prepared students but does not fully block AI use.	Moderate operational cost. Some formats require platform setup; others require review time.
High	Chalk talk (live, unedited whiteboard explanation), oral exams, code review/walkthroughs with TAs, in-class poster Q&A	Requires students to demonstrate understanding live and in real-time. Follow-up questions make it nearly impossible to fake knowledge. These formats are the closest proxy to ground-truth understanding we have.	Most expensive to administer. Requires significant staff hours. May cause anxiety for students uncomfortable with live assessment.

My Solution

Here I outline two concrete course design proposals: one for instructors willing to invest moderate extra effort, and one for those with the staff capacity for higher-touch assessment. Both are grounded in the cocktail approach—mixing formats so that no single AI workaround covers the whole course.

Medium Extra Effort Solution

These additions require some upfront setup but are sustainable with normal TA staffing.

Attendance sheet with a quick question: A physical sheet passed around during lecture with a simple question or one-line prompt. The handwritten element makes remote faking impractical, and the marginal grading cost is low—TAs scan for completion, not correctness.
Weekly handwritten one-page note: Students submit a handwritten summary of key concepts from lectures and homework each week. TAs skim for effort and understanding. This format is hard to fake with AI, and the act of writing by hand is itself a retrieval practice that reinforces memory.
Paper-based in-lecture quizzes: Short, in-person quizzes given at random intervals during lecture—not announced in advance. This tests real-time understanding in a low-stakes format and doubles as an implicit attendance check.

High Extra Effort Solution

These additions require dedicated TA capacity but offer the highest gains in assessment quality.

Oral exams with two staff present: A portion of the grade is based on a live oral exam where students explain their solutions. Having two staff present—one asking questions, one taking notes—makes the grading more consistent and reduces the risk of a student appealing a grade based on ambiguity. This is the most robust way I know of to verify genuine understanding.
Video presentations from a structured topic list: From a list covering roughly 10–30% of course material, students choose one topic, prepare slides, and submit a screen-recorded video with camera enabled. The camera requirement helps distinguish genuine presentation from AI-narrated slides. TAs review for quality and depth.
Timed coding tasks on a proctored platform: For a portion of homework or quiz credit, students complete timed coding tasks on CodeGrade or HackerRank with randomized problem ordering. This tests real-time ability and is considerably harder to game with AI. The platform setup cost is non-trivial but amortizes over multiple semesters.

Closing

At the start of this article, I described a student who submits correct code but cannot explain it. That student exists in every cohort, and they existed before ChatGPT. Generative AI has not created a new problem so much as it has removed the friction that once slowed the problem down. The friction used to buy time—time for the student to engage with the material even imperfectly, time for genuine understanding to develop through iteration. Now that friction is gone, and the gap between output and understanding can open up in a single homework cycle.

The assessments I have described here are attempts to reintroduce friction in the right places. Not friction that punishes students for using tools, but friction that requires them to stand behind their work—to explain it, defend it, and adapt it in real time. Chalk talk, oral exams, timed coding, and code walkthroughs are not clever tricks. They are just formats where the student cannot opt out of understanding.

I will be direct about something: none of this is airtight, and it will require ongoing adjustment. AI tools evolve quickly, and any specific workaround I have described will eventually be circumvented by a sufficiently motivated student with access to better tools. What will not be circumvented is a course culture that makes genuine understanding the default expectation—one that is communicated clearly, assessed consistently, and graded on mastery rather than output. That is the part worth protecting.

The instructors who will navigate the next few years well are not the ones who find the perfect AI detector. They are the ones already asking the right question: not "how do I stop students from using AI?"—but "how do I actually know whether they understand?"

Reference

Anderson, Lorin W., David R. Krathwohl, Peter W. Airasian, Kathleen A. Cruikshank, Richard E. Mayer, Paul R. Pintrich, James Raths, and Merlin C. Wittrock. 2001. A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives. New York: Longman. https://doi.org/10.1177/0013124502034002008.

Delphino, Kaléu. 2025. "Assessing the Prevalence of AI-Assisted Cheating in Programming Courses: A Pilot Study." https://doi.org/10.48550/arXiv.2507.06438.

Hanh, Nguyen Van, and Nguyen Thi Duyen. 2025. "AI-Assisted Academic Cheating: A Conceptual Model Based on Postgraduate Student Voices." Frontiers in Computer Science Volume 7 - 2025. https://doi.org/10.3389/fcomp.2025.1682190.

MIT Registrar's Office. 2024. "Guidelines of the Committee on Curricula: Section 4, SUBJECTS: The Communication Requirement." https://registrar.mit.edu/faculty-curriculum-support/faculty-curriculum-committees/committee-curricula/committee-guidelines-4.

Utica University. 2001. "Bloom's Taxonomy Revised: A Combined Chart." Utica University Academic Assessment Resources. https://www.utica.edu/academic/Assessment/new/Bloom%20tx%20revised%20combined.pdf.

Wilson, Sara Ellen, and Matthew Nishimoto. 2024. "Assessing Learning of Computer Programming Skills in the Age of Generative Artificial Intelligence." Journal of Biomechanical Engineering 146 (5): 051003. https://doi.org/10.1115/1.4064364.

Meng-Chi Ed Chen

PhD Student in EECS at MIT