The 25-year-old is a research engineer specializing in AI safety at the Center for AI Safety (CAIS) in the U.S. Before joining CAIS he interned at major companies like Samsung and Twitter, and now working with Dan Hendrycks, an advisor to Musk’s xAI and Alexandr Wang’s Scale AI, while also serving as CAIS director.
Humanity’s Last Exam (HLE), a collaboration between CAIS and Scale AI, comprises 3,000 challenging questions spanning over 100 disciplines, including classics, ecology, mathematics, and physics. The test assesses AI’s knowledge, reasoning and critical thinking. More than 1,000 professors and experts from 500 leading western institutions like Stanford, Harvard, Princeton, MIT, and Oxford have contributed to it.
"I am proud to be a Vietnamese contributing to such a significant project," Long says. "Beyond tracking AI capabilities, this will influence AI safety policies and competition among tech companies."
![]() |
Portrait of Phan Nguyen Hoang Long. Photo courtesy of Long |
Long moved to the U.S. in 2015 after completing secondary school in HCMC. In 2018 he got admission to several top U.S. universities and financial aid worth US$168,000 for four years. Initially pursuing electrical engineering at Case Western Reserve University, he soon realized his passion lay in computer science. "I love building small, innovative tech projects and working with constantly evolving technologies," he says.
After his first year he interned with KiKi, Zalo’s virtual assistant project, which deepened his interest in AI. He then switched to computer science and focused on AI research, spending two years studying research papers from leading labs like Google DeepMind daily. To enhance his understanding, he replicated and experimented with code from top research papers. He also collaborated on crypto Web3 and NFT projects, refining his programming skills.
These experiences strengthened his resume, enabling him to secure internships at major institutions like the U.S. National Institutes of Health (NIH), Samsung and Twitter, competing against master’s and PhD candidates. Trinh Hoang Trieu, a researcher at Google DeepMind and Long’s early mentor, praised his technical expertise. "Long excels at working with large teams on complex projects. He quickly grasps and challenges new research trends, but what stands out most is his passion and diligence."
Each internship shaped Long’s career. His first high-impact research paper, which later helped him land interviews at top companies, focused on AI applications in natural language processing for biology at NIH. At Twitter, despite mass layoffs following Musk’s takeover, Long chose to stay and absorb as much AI knowledge as possible. "Even though my research direction has shifted since, the knowledge gained remains an essential foundation," he says.
By 2022 he had published over 10 research papers. Determined to work with AI leaders, he aimed to join top-tier institutions. His persistence led him to CAIS in 2023, where he tackled major AI safety challenges. Despite offers from top AI firms, he chose to stay at CAIS to continue learning from Hendrycks.
![]() |
Long and Dan Hendrycks in their Center for AI Safety's office, February 2025. Photo courtesy of Long |
In 2024 he took on his most significant career challenge—spearheading HLE. The idea for HLE had emerged from a conversation between Hendrycks and Musk addressing the limitations of existing AI evaluation benchmarks.
Many saw the task of coordinating over 1,000 global researchers across multiple fields as technically and logistically impossible, but Long was confident he could manage both software development and AI research responsibilities.
"I built a user-friendly website for senior professors and experts, managed backend systems and evaluated AI performance. My solid programming background enabled me to develop an accessible application," he explains.
One of his biggest challenges was mastering multiple disciplines. As leading academics crafted the test questions, he had to study advanced mathematics, physics, and chemistry to ensure quality and consistency.
Another challenge was managing expectations: "Coordinating with experts while ensuring satisfaction across the project was extremely difficult.
"Working directly with figures like Dan Hendrycks and Alexandr Wang meant every presentation had to be flawless, and I had to be ready for discussions at any time."
AI research is demanding, he notes. Researchers must keep pace with rapid advancements and convince experts to feature their work at global conferences. Yet he believes the ultimate reward is seeing research recognized and debated in influential forums.
On Jan. 31 OpenAI’s o3-mini model scored 13% on HLE. By Sunday, with support from Deep Research and Python integration, its accuracy jumped to 26.6%. Previously, no AI model had surpassed 10% on the test. Given AI’s rapid progress, he and his colleagues predict that models could exceed 50% by year-end.
Still, he views HLE as a critical benchmark before AI can be trusted for advanced roles in research, engineering and systems operations. He believes AI is a fascinating field but warns that to stay competitive, researchers must be well-prepared, passionate and proactive. He advises aspiring AI researchers to seek institutions aligned with their interests. "Do not be discouraged if you are interested in AI but lack knowledge—this is a field where you can learn quickly with effort."
Looking back, he attributes his journey to perseverance, consistency and commitment to his goals.
"I have faced career setbacks, failed ideas and unsuccessful job interviews. But I never gave up on my goal of working on globally impactful projects."