I tried using an AI agent to help me mark students' work: here's what I discovered.

Roger Kennett
Apr 5
8 min read

Updated: Apr 7

Depth studies are vital ways to get students doing science and their science project can be one of the more memorable and enjoyable parts of a student's year. Marking them can be time consuming, especially when following good practice for feedback: specificity about what was good and why, actionable steps for the student to improve, personalisation. Several iterations of an AI agent were evaluated on their responses to real student submissions for Year 9 science Depth Study related to the Energy topic in the new NSW science syllabus. The best performance was achieved with an agent trained with limited focus. Performance was enhanced when the agent was assisted with ultra short, embedded teacher comments. This also proved the most time-effective approach for the marker, almost eliminating human corrections to the agent's output. Overall, the time to mark each report was comparable to manually marking with minimal comments, but the quality of the feedback was significantly enhanced.

If you are a science teacher, you probably know the pain of marking these project first hand! Sure, you can split up the marking between teachers, but the efforts to ensure consistency are significant and we often end up atomising the marking to the point of it being formulaic. This can result in higher marks for less challenging work, with the rigid formula often unable to reward students who tacked more challenging projects. Holistic marking can overcome this, but generally requires a single (at most 2) teacher-marker[s] across a cohort of perhaps 150 students.

So I tried (not my first time) training an AI agent to assist with the feedback. I will provide details of the agent and a summary later. For now, I hope my journey of discovery will be helpful.

Agent iterations 1: Limited scope.

I limited the agent to just two sections of the student report (abstract and the discussion). The agent was provided the relevant parts of the marking rubric, as well specific examples of what to look for. Since it knew the rubric, I had it assigning marks for those two sections criteria as well. After running this over about 25 scripts, I found it was OK at the comments, although sometimes got the science wrong. What followed was a process of trial and error, developing the prompts to steer it in the right direction. At times I found this more specific prompt resulted in it being less agile when it encountered a more novel student investigation. It was also somewhat generous with the marking, but with some more crafting, I was able to get it to discriminate better.

Agent Iterations 2: An expanded failure

Armed with limited success, I expanded the agent to look at a total more sections of the student's report. The quality of its performance was noticeably reduced the more it I gave it to consider. The AI allocated marks diverged even more from my marking. "Was this even worth it?"

Agent Iterations 3: Revisit the drawing (or prompting) board

About to give up, buy good coffee and just spend the entire weekend manually marking, I had one last go. I decided to only use a Large Language Model (LLM) for what it is good at - language! So I abandoned allowing the agent to allocate marks. That freed it up to craft comments following the pattern I trained it on (see above). I was getting somewhere, but it still wasn't great about picking the most important things to comment on (positively and constructively). It was certainly no experienced science teacher!

Agent Iteration 4: Enhancing with a few choice phrases of my own.

If you are still with me, you are probably wondering if I had spent more time on this than it would have taken me to just mark them all myself. You and me both! Had it only be class or two, I would have. I am also stubborn, this dumb a-intelligence thing was NOT going to beat me :) So I told the agent to wait, and I would add a few very short key phrases of my own to guide the comment before it wrote the comment. This was so much better. However, it was still time consuming, flicking between the student's work and the AI input screen.

Agent iteration 4: but with Agent AFTER me.

I was marking this using the Speedgrader in Canvas. Similar to the grader in Schoolbox, you can click and add a comment anywhere on the student's work. So I tried adding ultra brief comments on the most important things, placing the comment on the sentence it was relevant to. (Remember, I was allocating also marks using the marking rubric as I went.) Then, selecting ALL and copying that into the AI-agent's prompt, my comments were embedded within the student text with my teacher ID identifying each comment. Since I had prompted the AI it would get comments from me, it understood this without needing any further refinement. It recognised my comments and used them to prioritise what it focussed overall comment. Now it was generating high, specific, personalised (in that it referred specifically to aspects of the student's work, not just using their name) comments. Now the student had my ultra-brief comment (e.g., "mixed the variables") within their report AND this idea developed to a quality feedback comment they could understand. The agent was doing a great job of highlighting the best parts, weaving in the training from me to explain why that was good and taking my somewhat abrupt and cryptic comments and explaining them appropriately for the age group of student (something that was also part of the agents' instructions). I found that I rarely needed to edit the comment, perhaps less than 1 on 20, and even then it was minimal. The other bonus is that it included comments in the other sections (that it was not directed to focus on) without reducing its performance.

Lessons I learnt

Agents are way better than a chatbot. By agent, I mean an ecosystem like playlab.ai, Open AI playground, Cogniti from USYD. These essentially feed a pre-prompt to an LLM of your choice (it is worth trying different ones, the difference is significant and sometimes a "lesser" LLM does a better job).
Co-creation. This is not earth shattering, but it still took me a while... use a tool for what it is best for. Sure, you can use a chisel as a screwdriver, but it might end in blood. AT the moment, LLMs are amazing at following language patterns. Create your agent to perform where it is most likely to succeed - taking patterns you have provided and crafting some disparate ideas into nice, logical, age-appropriate paragraphs.
You first, then AI. For tasks like this, give it guidance. Short comments from you within a student's response are powerful guides for AI. It is faster overall to guide it first, than try and add your comments later.

Was it worth it?

Like lots of technology, it greatly improves the end result, with about the same time from you.

Using the AI, it was about 1.5 hours per class. This is the time it would have taken me to put ultra brief comments on the student's work and allocate marks. To have crafted quality, personalised feedback would have taken me much longer. When I add on the time to try all the iterations before I started - well, I probably broke even. But, I have two things for next time: experience AND an agent that will only need a little tweaking for next time! So, on balance it was worth it. This post took me longer than marking another class would have :)

What's next?

I am playing with Python to try a different approach and to eliminate the "cut and paste", waiting for the LLM to respond, and the back and forth between screens. I have got the API and Python talking to a LLM working, I just need some time to develop an iterative script using a bit of PyPDF2. My plan is to allocate marks and add a few, choice comments, then the AI part can grind away on its own time without needing me.

Final comments: Treat the AI-output as a draft. It still says weird things and can get the science wrong! Maintain student privacy, only use privacy-ensured ecosystems.

Oh, if you've got this far, you might like the AI Classroom.

AGENT DESIGN

I was using an agent-LLM ecosystem that ensured privacy, and that the students' work would not be used for training future models. GPT-4 hosted by Microsoft Azure Released in March 2023

Below is the design for the final iteration of agent:

----------

The author, a [redacted]-year-old science student, has completed... [ redacted - this section provided specific details about the task]

Use Australian or UK English spelling.

Do not start with "Dear student". Instead use their first name within the first sentence. e.g., "[firstName], your report has many ...."

Avoid hyperbole. When describing student understanding, use these in rank order; extensive, thorough, sound, basic.

Pay special attention to the sections mentioned below. Do NOT comment yet, wait until you get to the instructions about feedback to provide the student.

Title: Note if it is "witty" with some succinct play of words relevant to their investigation. If not, that is something to improve for next time to attract interest in your work.

Abstract: 5 sentences. Succinct. Sentences listed below

1. Real-world context of this investigation, especially why this matters  (eg.

renewable electricity production is important for the world to reduce reliance on fossil fuels and minimise Climate Change.).

2. How the research you did relates to the context (e.g., by improving efficiency we can generate more renewable electricity and reduce fossil fuel dependence")

3. The method briefly described  (e.g., we tested...by measuring...)

4. Your findings (e.g., we found that...)

5. Relates their finding back to the context and/or suggests further experimentation. (this means.... .  Further research into [specifics] could improve [real world context] )

Background:

A succinct summary of the relevant background science.

Discusses gravitational potential energy (GPE). Explains transformation of GPE into KE. Mentions law of conservation or energy OR that some energy is transformed into unhelpful forms like heat or sound and is "lost" energy or "waste" energy OR relates energy transformations to energy efficiency. Any other genuine science that is relevant gets a special mention.

Things to note:

a. Does the abstract set the context and relate to real world issues (e.g., Climate Change)?

b. In the "why" section, have they brought in scientific concepts from the background?

c. In the "further" section have they suggested different or new investigations? If they have related this to what they said in the "why" section, that is especially good. If they have just talked about improving this method, that is something they can improve.

Address your comment to the student. Use their first name once.

Keep your comments succinct. Word limit is 200 words. Redraft if you are too long

Include this final sentence: "refer to your report for any more specific comments from me".

There will be comments within the work identified as Roger Kennett. Use these to prioritise what to include in your comment.

Your  comment should:

Note specific things in their report that are good, and say WHY they are good in the context of a scientific report. Do this for each section I have directed to you.

Explain explicitly how the student could improve for next time. Use any comments from me as your first priority, then other things you were directed to notice.

I tried using an AI agent to help me mark students' work: here's what I discovered.

Recent Posts

Comments

Subscribe to stay informed