Benchmarking LLMs for HR Tech - Part 1

Muhammed Shaphy

CEO & Co-Founder @ PriceSenz | Founder @ Digital Staffing Stealth Startup | AI Enthusiast | Robotics Coach

Published Apr 16, 2024

Every week brings new advancements in the realm of LLMs, whether through the introduction of new models or the enhancement of existing ones. Over the past several months, our team has dedicated considerable effort to benchmarking a variety of LLMs. Here, we share some recent insights.

Methodology Our focus was on the effectiveness of LLMs in matching candidates' resumes with job descriptions (JDs). We employed a dataset of 10 resumes and 10 JDs, segmented by tech and non-tech roles:

Resumes: 4 entry-level, 3 mid-level, 3 executive-level
JDs: 4 core tech roles, 3 non-tech roles, 3 semi-tech roles

Models were assessed against a benchmark derived from pre-extracted information from the same set of documents. Scores above 80 were indicative of performance at or above the level of manual extraction methods.

Created with GPT-4's 'Data Analyst' Plugin

Key Findings Here's a summary of the performance analysis across the 13 LLM models:

Structural Integrity in Responses: Average accuracy: 76.5%. Many models demonstrated excellent structural accuracy, with some achieving 100%. Top performer: GPT-4, Claude3 (100%)
JSON Response Accuracy: Average accuracy: 68%. Several models showcased perfect JSON parsing abilities. Top performer: GPT-4, Claude3 (100%)
Hallucinations per 10 Runs: Average rate: 36.5%, indicating that 3-4 responses out of 10 might contain inaccurate or fictitious information. The rates varied widely. Top performer: Claude3-sonnet (15%)
Overall Accuracy: Average: 59.6%. There was a broad range in model reliability. Top performer: Claude3-sonnet (95%)
Useful Additional Information: On average, 52.3% of responses provided valuable additional context or information. Top performer: Llama-2-70b-chat (80%)
Keywords Missed in Summarization: Models missed about 53.5% of critical keywords on average, indicating moderate performance in summarization. Top performer: Claude3-haiku (20%)

Key Takeaways:

Claude3-sonnet excels in generating accurate results with minimal hallucinations, proving highly effective for precise data matching.
GPT-4 and GPT-4-turbo-preview show strong performance in maintaining structure and JSON parsing accuracy, though they could benefit from improvements in reducing hallucinations.

Stay tuned for more updates as we continue to explore the evolving landscape of language models!

Visit us - Web: www.pricesenz.com LinkedIN: PriceSenz

GiL Davis, CEO, DFW Drywall Experts

DFW Drywall Experts

2mo

This is great news! Thanks for sharing

2 Reactions

Susan Bryant, CPA, CTP

Unboxed CPA propelling business owners towards their dreams. | 2021 Top 50 Women in Accounting | 2022 Enterprising Women of the Year Honoree | 2023 GS10KSB | Tax + Accounting + Business Strategist

Fascinating! Bijith Moopen

3 Reactions

Nitin Nagar

Transforming Business with Next-Gen Cloud & AI Solutions | Expertise in Microsoft Azure, M365, Open AI and Cloud Security

It’s a beautiful analysis! Thank you for sharing! It will be interesting to see best practices to reducing hallucinations..

Marcelo Grebois

☰ Cloud & Software Architect ☰ MLOps ☰ AIOps ☰ Helping companies scale their platforms to an enterprise grade level

That's some cutting-edge analysis. Exciting times in LLM benchmarking. 🚀 #innovation Muhammed Shaphy

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

Can't wait to dive into those insights. Muhammed Shaphy

See more comments

Benchmarking LLMs for HR Tech - Part 1

Muhammed Shaphy

CEO & Co-Founder @ PriceSenz | Founder @ Digital Staffing Stealth Startup | AI Enthusiast | Robotics Coach

More articles by this author

Explore topics

Benchmarking LLMs for HR Tech: Featuring Claude 3.5

Jun 25, 2024

Benchmarking GPTs: Featuring GPT-4o

May 20, 2024

Benchmarking LLMs for HR Tech: Featuring LLama-3

Apr 23, 2024

Surviving the Pandemic: The PriceSenz Story

Jan 2, 2023

Explore topics