Phan Nguyen Hoang Long is a co-lead author of the study behind what has been widely described as "the world’s toughest exam" for AI.
In the paper, titled "A benchmark of expert-level academic questions to assess AI capabilities," published by the international journal Nature on Jan. 28, Long and an international team of researchers introduced Humanity’s Last Exam (HLE), a benchmark aimed at evaluating the knowledge and reasoning abilities of large language models (LLMs) such as ChatGPT, Gemini, and Grok at a research and expert level.
HLE is described in the paper as "a multimodal benchmark at the frontier of human knowledge," designed as a closed-ended academic test with broad subject coverage and expert-level difficulty.
The exam consists of 2,500 questions spanning dozens of disciplines, including mathematics, the humanities, and the natural sciences. Developed globally by subject-matter experts, HLE features multiple-choice and short-answer questions suitable for automated grading. Each question has a known, unambiguous solution that is easily verifiable but cannot be answered through simple internet retrieval.
More than 1,000 professors and researchers from over 500 leading universities and research institutions worldwide - including Stanford, Harvard, Princeton, MIT, and Oxford - contributed to the benchmark.
"This is a major milestone after five years of pursuing AI research, with the hope of doing work that is useful and has global impact," Long told VnExpress of his publication in Nature, a prestigious scientific journal founded more than 150 years ago with an acceptance rate of around 8%.
A graduate of Case Western Reserve University in the United States, Long is currently a research engineer specializing in AI safety at the Center for AI Safety (CAIS), which is led by Dan Hendrycks, an adviser to Elon Musk.
The project originated from an idea proposed by Musk and has been jointly developed since 2024 by CAIS and Scale AI, the AI startup founded by billionaire Alexandr Wang. Wang, who also heads Meta's superintelligence laboratory, serves as one of the project's advisers.
![]() |
|
Vietnamese engineer Phan Nguyen Hoang Long in a photo he provided. |
The New York Times has described HLE as so difficult that "when AI passes it, look out." The benchmark has since become one of the most influential tools used by companies such as DeepMind, OpenAI, and xAI when evaluating and launching new AI models.
In July 2025, xAI used HLE during the development of Grok 4.
According to Long, HLE provides a shared reference point for policymakers, helping to ground discussions on AI development trajectories, associated risks, and potential regulatory responses.
He said he plans to continue working in AI safety, which he believes will play a decisive role in shaping the technology's impact on society.
Results from HLE are publicly released and regularly updated. While top-tier human experts consistently score around 90%, current AI models continue to face substantial challenges.
As of early this year, AI performance has improved but remains well below human proficiency, according to leaderboards published by Scale AI and Artificial Analysis, an independent U.S.-based AI benchmarking firm.
Gemini 3 Pro currently scores between 37.5% and 38.3% on HLE, while GPT-5.2 records a score of around 35.4%. Claude 4 Opus achieves approximately 25.2%, though specialized reasoning variants of Claude perform slightly better. Zoom AI has reached a score of 48.1%.
Grok 4 has demonstrated industry-leading performance on the benchmark, with reported scores ranging from 38.6% to 50.7%.
Founded in 1869, Nature is a multidisciplinary scientific journal that publishes pioneering research across the sciences. Articles are selected based on strict criteria for novelty, scientific significance, methodological rigor, and broad relevance to the global scientific community.