How do you test the intelligence of an LLM? The answer is, benchmarks such as MMLU, HumanEval, AGIEval and the like.
Whether it's GPT-4 or Llama 2, creators typically begin by highlighting their LLMs' benchmark scores in their research papers.
But how do you set up a benchmark? For this, we make the benchmarks attempt various human-level examinations.
A majority of the benchmarks, primarily originating from the US, incorporate various examination elements. For instance, MMLU assesses 57 tasks, encompassing subjects such as elementary mathematics, US history, computer science, and law. Similarly, AGIEval draws inspiration from assessments like the SAT, LSAT, and other examinations, including the Chinese College Entrance Exam (Gaokao), law school admission tests, math competitions, and national civil service assessments.
Keep reading with a 7-day free trial
Subscribe to Sector 6 | The Newsletter of AIM to keep reading this post and get 7 days of free access to the full post archives.