What I Learned from Building with LLMs in L&D - Gianni from Third Space Learning

Introduction

Okay, so a few words who we are. We are a tech company. We provide tutoring for disadvantaged people here in the UK and in the US for primary and secondary students.

We have achieved these for a lot of students, 160,000 children in 4,000 schools. And we believe AI will transform our ability to deliver our mission.

Even fewer words on me. I'm the technical architect at ESL.

I've written the first piece of code when I was six. I have a computer science degree.

I worked on my first AI project, Kinda, in 2005. when AI, large language model, wasn't really a thing.

I'm a code polyglot. I code in a lot of different languages.

I've started with ATSL in 2019 and since then we delivered a lot of valuable stuff for children and I've driven by an insane passion for everything I can put my hands and code.

The Vision of AI in Education

So, at ESL, as I said, we believe AI will transform the way we can deliver our mission.

We had mainly two questions. The first is, can an AI agent evaluate a session? And we split this question into two sub-questions. Can an AI agent evaluate language?

Because our tutorials are based in Sri Lanka and India. and can an AI tool evaluate the session delivery? It means if the tutor can engage the student, if the explanation is clear and so on.

The second question, probably more fascinating, is can an AI agent deliver a session? It means, can an AI tool teach a learning objective using our own classroom? And can an AI tool deliver the whole program? So working weekly with a student, engage the student, remembering everything the student did from the assessment to everything the student said, and so on.

We replied to both of these questions with a yes. Today I'll talk about AI evaluation, not AI tutor. And the reason is I learned more implementing the AI evaluation rather than AI tutor. And for this conversation, I think it's more suitable.

So why we want to implement the AI evaluation? Because we want to improve the quality of our service.

At the moment, we evaluate 3% roughly of our sessions. We try to evaluate sessions from all the tutors. But 3% is what we can do now.

Our target is evaluate all the sessions. And it's roughly 10,000 sessions per week.

Accuracy. At the moment, evaluations are carried on by specialists, but they are human. It means they have their taste. Same session could be evaluated differently based on just how evaluator wake up in the morning.

Our target is to have an accuracy of 90% and it means the same session will be evaluated always at the same way. Cost per evaluation at the moment is 10 pounds because we have to pay people that evaluate the session. We want to reduce that to one cent.

And yeah, that's the amount of money we need to run our process and so on. Then we want to improve our tutors. How? Feeding them feedback constantly.

And now we gave them feedback after two weeks. I mean, if someone say to me, two weeks ago you did this piece of code, I probably won't remember anything. If the review is the day after, probably it's definitely better.

And then we don't want to spend 200 hours per month evaluating session. We want to reduce that not zero because our evaluator want to listen like five minutes of a session, but not 45 minutes every single session.

The AI Evaluation Challenge

So let's start talking about how we implemented this tool. So the question was, can an AI agent evaluate a session? What kind of process do we want to evaluate a session?

So the first is to isolate tutor's utterances because we don't want to address the student's grammar, for example, or the answers from the student. It's just how the tutor teach a learning objective.

And then we want to analyze tutor utterance for grammar issues and readability, if the language used by the tutor is suitable for a child of 10 years or something like that. And then we want to analyze the same utterance for delivery features like engagement and so on.

And then we want to create an evaluation matrix. We want to store this matrix in our database. We want to analyze this matrix to see if the tutor performance are improving or not. And create actionable suggestions for the tutor, sending them straight away and having the tutor to know what has gone well and not so well in the session.

Exploring GPT-4 for Session Evaluation

So we asked, can we use GPT-4 to evaluate a session? We spent a couple of months. working with Assistant API, trying to write giant prompts, a lot of prompt engineering. And then after a couple of months, we said, no, we cannot.

Why? Because it's not consistent. It's not precise. It costs a lot.

We were talking about roughly £2 per session, and it was slow. I'm not saying GPT-4 is not a good model. Actually, we are using GPT-4 for the other project, the AI Tutor. What I'm saying, and it's lessons I've learned, that we can use just the right tool for each step.

I just said I'll try not to say to be too technical, but actually this is a technical speech. And I'll show you some code. Well, not some code, the result of some code.

And so I have a Python script to orchestrate the evaluation. I have some NLP library to analyze readability and grammar errors. An open source model, not GPT-4 because it's too costly, to create actionable suggestions for tutors. And now I'll show you a quick demo.

So this is ChagGPT. You probably are very familiar with it. And I'm asking just a very simple question.

I have a transcript, and I'm asking, can you identify two torsoterons for the attached script and count them? Oh, our system detects an unusual activity for your system. Please try again later. OK, let's try again.

Okay, it's reading the document. That's a good first step. Yes, of course, but... Okay. Reading transcript.

Based on the content of the transcript, the tutor utterance identified by libel speaker B. That's correct. In each transcript entry, to count the number of tutor utterance, I will go through the document and count every instance where speaker B is mentioned. Let's wait.

Actually, if you click here, you can see what it's doing. It's creating a giant text. You should create an array, but yeah. OK.

Spoiler, it will fail. I mean, if you are lucky, you get the wrong number. If you are, I don't know, less lucky, you get just an error or the other way around. But it will fail.

Why? Because LLM are great to create content, to summarize content. They are great at coding, but they are not great at math. They are just not there.

I can continue to show you this or we can carry on if you prefer. Believe me, it will fail. After five minutes, it will fail.

Developing a Cost-Effective Evaluation Model

So what we have done, I created this script. I created this script to analyze the sentence for each utterance in the transcript. It's a Python script.

We can go back to this window later. I don't want to bother you. I'll show you just the output and how quick it is.

So this script uses a model, a very small model, a very cost-effective model, to identify who the tutor is in the script and simply just analyze the transcript and return the tutor returns and the count is just as my fingers because it's just math and the coding is one of the best aspects of doing math. Okay, so done. And this is the output.

109 tutors' utterance, that's the right number. And GPT is still failing. Okay, so the next aspect is how do we want to analyze these 109 tutors' utterance?

I implemented actually two different methods. One is using another open source model. Mixtrel is the next tier of Mixtrel, a bit more powerful. And the second is just using what you use on Word to grammar check your sentence.

And I won't run the execution because it takes like five minutes. But I have the results. It took actually three minutes. And I have a score and a model to identify positive aspects and negative aspects.

Having two different methods to do the same thing has an advantage. You can double check what you are doing. So if my method returns roughly the same rate for the transcript and similar suggestions, it means we are doing well.

So I have this other class that use a couple of NLP algorithm and actually I can't run it because it's extremely fast. Live demo, I hope it works. It worked two seconds ago. Yes, it's working.

So as you can see, it's processing 20, 40%, 50%, 60%. And that's done. Okay. So the output is 1.75 and the good thing, it's always 1.75. It's not like 1.70, 1.80. No, it's always 1.75 because it follows some rules, some grammar rules. Each rule is weighted, so we have quite a good outcome here.

And the model returns similar suggestions to the other models. And actually I tried to pass the same output to using different models, GPT-4, LAMA, Mixtral, Mistral. I get similar results by all the models. What is the difference? The difference is cost and speed. Have you seen like 24 seconds comparing to three minutes and GPT is still trying to process the first step. Cost. So the NLP-based evaluation costs almost zero. It's not exactly zero because you have to identify who the tutor is, who create the suggestions, but it's like, hundreds of milli pounds. With GPT-4, just language, it would cost 73 cents per session, and yeah, all the way up to zero for this algorithm I put together. That's the main thing.

Key Learnings and Conclusions

So what I've learned, back to the presentation, The first is use the right tools for the job. We have a lot of tools available. We have a really powerful model like GPT-4.

We have open source models that are doing quite well. We have NLP. We can still code. I know that LLMs are great at doing everything.

But coding is still a thing, at least for a few years yet. When using an LLM, choose the right one based on your needs. We choose to use a mixed shell because we didn't need the power of GPT-4, because starting from a very well-defined input, you can get a very well-defined output.

take into consideration how to operationalize the process. Why? Because you can have a great prototype, but then you have to put it on production. And sometimes it's not just, okay, let's run on AWS and everything will be fine.

Because to run LLM, especially if you want to run it locally, you need a very powerful GPU. Or, I don't know, there's open source models, serverless inference, there's a lot of options.

you can take into account and that's really important because for example for us evaluating 10 000 sessions per week at two pounds per session is not conceivable and then consider all the possible options to make your product production ready it's a slightly different aspect because you need to monitor you need to check accuracy everything has to be perfect because otherwise in our case tutors will comply why I'm evaluated by a machine that is doing something weird and all the above applies to all the projects we are working on including the iTutor

As I said, we are using GPT-4 for the eye tutor because we need its capability on reasoning. That is definitely better compared to Mistral, Mixtral, LAMA, and whatever. Will it be always the case? We don't know. Probably, perhaps, in a few months' time, LAMA-5, LAMA-3 will be so amazing we'll use LAMA-3 instead of GPT-4. But for now, GPT-4 is a good option.