Benchmarking LLMs for HR Tech - Part 1

Benchmarking LLMs for HR Tech - Part 1

Every week brings new advancements in the realm of LLMs, whether through the introduction of new models or the enhancement of existing ones. Over the past several months, our team has dedicated considerable effort to benchmarking a variety of LLMs. Here, we share some recent insights.

Methodology Our focus was on the effectiveness of LLMs in matching candidates' resumes with job descriptions (JDs). We employed a dataset of 10 resumes and 10 JDs, segmented by tech and non-tech roles:

  • Resumes: 4 entry-level, 3 mid-level, 3 executive-level

  • JDs: 4 core tech roles, 3 non-tech roles, 3 semi-tech roles

Models were assessed against a benchmark derived from pre-extracted information from the same set of documents. Scores above 80 were indicative of performance at or above the level of manual extraction methods.

Created with GPT-4's 'Data Analyst' Plugin

Key Findings Here's a summary of the performance analysis across the 13 LLM models:

  • Structural Integrity in Responses: Average accuracy: 76.5%. Many models demonstrated excellent structural accuracy, with some achieving 100%. Top performer: GPT-4, Claude3 (100%)

  • JSON Response Accuracy: Average accuracy: 68%. Several models showcased perfect JSON parsing abilities. Top performer: GPT-4, Claude3 (100%)

  • Hallucinations per 10 Runs: Average rate: 36.5%, indicating that 3-4 responses out of 10 might contain inaccurate or fictitious information. The rates varied widely. Top performer: Claude3-sonnet (15%)

  • Overall Accuracy: Average: 59.6%. There was a broad range in model reliability. Top performer: Claude3-sonnet (95%)

  • Useful Additional Information: On average, 52.3% of responses provided valuable additional context or information. Top performer: Llama-2-70b-chat (80%)

  • Keywords Missed in Summarization: Models missed about 53.5% of critical keywords on average, indicating moderate performance in summarization. Top performer: Claude3-haiku (20%)

Key Takeaways:

  • Claude3-sonnet excels in generating accurate results with minimal hallucinations, proving highly effective for precise data matching.

  • GPT-4 and GPT-4-turbo-preview show strong performance in maintaining structure and JSON parsing accuracy, though they could benefit from improvements in reducing hallucinations.

Stay tuned for more updates as we continue to explore the evolving landscape of language models!

Visit us - Web: www.pricesenz.com LinkedIN: PriceSenz

This is great news! Thanks for sharing

Susan Bryant, CPA, CTP

Unboxed CPA propelling business owners towards their dreams. | 2021 Top 50 Women in Accounting | 2022 Enterprising Women of the Year Honoree | 2023 GS10KSB | Tax + Accounting + Business Strategist

2mo

Fascinating! Bijith Moopen

Nitin Nagar

Transforming Business with Next-Gen Cloud & AI Solutions | Expertise in Microsoft Azure, M365, Open AI and Cloud Security

2mo

It’s a beautiful analysis! Thank you for sharing! It will be interesting to see best practices to reducing hallucinations..

Marcelo Grebois

☰ Cloud & Software Architect ☰ MLOps ☰ AIOps ☰ Helping companies scale their platforms to an enterprise grade level

2mo

That's some cutting-edge analysis. Exciting times in LLM benchmarking. 🚀 #innovation Muhammed Shaphy

Pete Grett

GEN AI Evangelist | #TechSherpa | #LiftOthersUp

2mo

Can't wait to dive into those insights. Muhammed Shaphy

To view or add a comment, sign in

Explore topics