Designing and Grading Exams for Large Introductory Courses

January 27, 2026
vol 72 issue 21
Talk About Teaching & Learning
print

Harry Smith

When I talk about my introductory programming class with colleagues, friends, and family, we often end up discussing the challenge of teaching a course with hundreds of students. It’s at this point that I get to adopt a suitably ironic facial expression and make my favorite joke: teaching hundreds of students is easy. Assessments and grading are hard! Indeed, it’s difficult to imagine a more efficient use of my time than showing up to a packed auditorium and lecturing for an hour: hundreds of lecture-hours given in one sitting, and all I had to do was prepare some slides. This is an oversimplification, but every hour spent in preparation for a course meeting provides some added benefit when delivered to hundreds of students at once.

Assignments flip the ratio of active-to-background time on its head: if two hundred students sit for a sixty-minute exam, I have two hundred papers to process by grading and providing feedback. Anything that I do to respond to each exam item has to be repeated hundreds of times. If teaching is a cycle of “prepare, teach, assess,” it is that final step that marks a key difference between the management of small courses versus large ones but that is also key to learning for students.

In teaching a large class, then, I carefully consider how the design of assessments affects the cost, measured in instructor and TA hours spent, of responding to students’ work. It helps to rely on the dichotomy between formative assessments, focused on providing low-stakes feedback that a student can use to adapt, versus summative assessments, which measure student understanding at the end of a unit or semester.

The difference in the role of the feedback in both kinds of assessment lends itself towards a useful analysis of where instructor effort is best spent. Since the aim of a formative assessment is student growth, these are the assignments where a student gets the most value from personalized feedback. Evaluation and feedback on purely summative assessments are “too late” to be useful: they measure the learning that has already been done. A course best supports learning outcomes by including numerous assignments that provide formative feedback, in turn preparing students for more cumulative assessments.

To follow this example, my course contains several different types of assignments. There are low-stakes activities that signpost the details from readings, videos, or previous classes that will prepare students for class today. Because of the frequency of these check-ins, I use Ed Lessons (within Ed Discussions), a tool that can autograde these activities, giving students immediate feedback (Canvas Quizzes or Google Forms work just as well). These activities help the students check their understanding before confronting other challenges.

I also give students challenging programming tasks every week. Here too, student work can be automatically checked for many measures of correctness, so students get nearly instantaneous feedback. Even outside of programming tasks, it is important to me to allow students to check elements of their work while they are still doing it. I find that access to a formative feedback loop in the process of completing the assessment helps students learn from their mistakes.

The final set of assessments for my course comprises the three exams: two midterms and one final. Midterm exams are unique because they are a midpoint between formative and summative feedback and in the stakes that they typically carry. I am struck by how, in my course, midterm exams, worth maybe 6% of the grade, alert struggling students in ways that programming projects, worth 60%, that they fail to complete do not. Since exams send the signals that students are conditioned to pay attention to, I make these assessments useful in both a formative and summative context.

To make this possible, I recommend Gradescope. Gradescope is an online platform used for accepting student submissions and providing both evaluation and feedback. I use it for open-ended, written work; for online quizzes; for automatically graded programming assignments; and for exams. I think that exams are where Gradescope shines. After students take an exam, I collect all of the papers, scan them, and upload them to Gradescope. Each uploaded scan is processed using a blank exam template that I prepare on the website, finding the work that belongs to each question on the exam. After all scans are processed, I can quickly grade all responses to a given question, either by using automatic grouping of equivalent student answers for multiple choice, fill-in-the-blank, or short-answer questions or by manual evaluation of open-ended or bespoke question types.

Automatic grouping reduces the process of grading to a quick confirmation that Gradescope has correctly grouped all answers together. The time spent grading one of these questions can be measured in just a few seconds per student. While such rigid forms of questions only measure learning objectives of low complexity, for me the cost of adding these questions is near zero in proportion to the work of creating the exam. I also appreciate that I can easily see common wrong answers, allowing me to award partial credit or reevaluate the way that I teach a certain topic. (I want to note that I also use Gradescope to take attendance: I hand out a simple printed worksheet that students will complete during lecture and hand in at the end. I collect, scan, and upload them like exams. Gradescope identifies names, and I can give credit on correctness as well as attendance and participation using the grouping tools).

Even these restrictive question types can be used to provide a sort of scaffolding for a more complicated task: a common pattern that I employ on an exam is to include a handful of simply-formed questions to first prompt students to identify definitions and relationships relevant for a challenging programming task, and then to have them complete that task in reference to the pieces they just completed. Designing questions this way helps for giving a subtle hint or for awarding partial credit based on error carried forward; in any case, it allows for questions that build towards greater complexity without adding significant effort to the grading process.

All of that saved time can be spent towards the grading of richer kinds of questions in greater detail, both so that you can tease out the correct parts from the mistakes and provide actionable feedback that the student can use in future assignments or courses. The rubric system in Gradescope also gives students clear actionable feedback in more detail than graders often can. The grading is flexible, too, supporting the grouping of related items and different preferences for additive versus subtractive grading or mutually exclusive items. Anything you want to communicate to the student outside of the rubric can be marked directly on the scan of their paper, which students will be able to see.

I believe that instructors can maintain meaningful engagement with hundreds of students without being overwhelmed by costs that increase with class size. The key lies in recognizing that effective teaching requires deliberate choices about where to invest manual effort—prioritizing rich feedback where it matters while relying on automation for the rest.

Harry Smith is a senior lecturer of computer and information science in Penn Engineering.

This essay continues the series that began in the fall of 1994 as the joint creation of the College of Arts and Sciences, the Center for Teaching and Learning, and the Lindback Society for Distinguished Teaching.

See https://almanac.upenn.edu/talk-about-teaching-and-learning-archive for previous essays.