From the course: Responsible Generative AI and Local LLMs

Coding ELO in Python

"

- [Instructor] A lot of people are excited about Chatbot Arena because it's a great way to do head-to-head testing for large language models. We can see commercial models versus open source models and proprietary models, and what's really cool about this is we can actually really dive into who's better from a numerical perspective. The reason why we're able to do that is because of Elo. And Elo is a strength of schedule metric. And so, what it really scores is how well you do against your opponents versus what's your score. So if you have a 20 record, but all the people you competed with were amateurs, that doesn't say as much as someone who has maybe a few losses, but has actually worked with much harder opponents. So, this is what the Elo is really telling us is the strength of schedule. So if we go through into Python, we can actually code this up ourselves. So, let's go ahead and take a look at how that would work. Here, we have a fighter and we initialize it with a name and a rating. And then, we have a method here, in this case, the K is 50, and this would be the default value that we would adjust depending on potentially how many competitions there are, 32 is a good number for a lot of chess matches because there's many matches, but in the case of UFC, there's very few, so, maybe we want the K to be a little bit larger, and you would tune it based on your experiences with the dataset. But in a nutshell, what it says is if you win, you get a score that's dependent on your opponent's rating as part of the rating. And if you have a loss, it's also depending on the opponent's rating as the component. And so if we take a look here, we could create a couple fighters, in this case, Connor and Khabib. We could also go through and print the rating before the match, and then we could print the simulation of what happens, and we would update their ratings right afterwards, and then we would go through and print this out. So, not a lot of code, most of it's actually print statements. If we go through here and we run this, we can go ahead and see that Connor has a 15, Khabib has a 1500, the winner's Khabib, after the match, Connor actually regresses, he goes to 1475 and Khabib actually improves, and his goes to 1523. You can see how if you do multiple iterations, you're going to get these metrics that show numerically who is the best based on their opponent, not just based on whether they won or lost. And this is one of the powers of Elo is it's a great way to measure competition. Now, we can even dive into IPython, as well here and play around with this a little bit. So if we say from elo import Fighter right here, we could also do the same thing, so we could make an instance of this and play around with it, we could say let's make one of these here, we'll copy this, throw it in, and let's go ahead and make a second one, throw it in. And this is what's nice about using an interactive terminal or using something like Jupyter for these kinds of things, is that it's kind of fun to play around with the object, and you have it in memory, and you can actually start to experiment with something. In this case, we could again, go through here, and figure out basically who want it, and then if we wanted to as well, if we say, this is one of the cool things that you can do with IPython, is we could also do dict, and we could look inside of the state, and we could actually see that, oh, it's just a dictionary that has the name, and it has the rating. So, Elo isn't a magic metric, it's pretty simple to actually code it up yourself. You can see here, this is just what the score is, but it's a very intuitive rating and it's a great metric for judging the competitive nature of large language models, both open source and proprietary against each other to see really how they act in a real world scenario.

Inhalt