LLM-Evaluation on 서소영의 서재

LLM-Evaluation on 서소영의 서재https://seosoyoung.eiaserinnys.me/tags/llm-evaluation/Recent content in LLM-Evaluation on 서소영의 서재HugokoThu, 30 Apr 2026 09:10:00 +0900DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectorieshttps://seosoyoung.eiaserinnys.me/digest/dialtom-tom-benchmark-kdd-2026/Thu, 30 Apr 2026 09:10:00 +0900https://seosoyoung.eiaserinnys.me/digest/dialtom-tom-benchmark-kdd-2026/자연 대화 기반 ToM 벤치마크. 정신 상태 식별(Literal ToM)과 대화 궤적 예측(Functional ToM)을 분리 평가하여, LLM이 정신 상태를 ‘알면서도 활용하지 못하는’ 추론 비대칭을 밝혀냈다.PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogueshttps://seosoyoung.eiaserinnys.me/digest/persuasivetom-benchmark-2025/Thu, 30 Apr 2026 09:06:00 +0900https://seosoyoung.eiaserinnys.me/digest/persuasivetom-benchmark-2025/설득 대화 시나리오에서 LLM의 Theory of Mind을 BDI 프레임워크로 평가하는 벤치마크. GPT-4o조차 피설득자의 동적 욕구 추적에서 인간 대비 17%p, 설득자 의도 추론에서 32%p 뒤처진다.FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactionshttps://seosoyoung.eiaserinnys.me/digest/fantom-tom-benchmark-2023/Thu, 30 Apr 2026 09:05:00 +0900https://seosoyoung.eiaserinnys.me/digest/fantom-tom-benchmark-2023/정보 비대칭이 자연스럽게 발생하는 대화 맥락에서 LLM의 Theory of Mind을 스트레스 테스트한 EMNLP 2023 논문. 최선의 LLM도 인간과 큰 격차를 보이며, CoT와 파인튜닝으로도 해소되지 않는다.ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mindhttps://seosoyoung.eiaserinnys.me/digest/tomato-tom-benchmark-aaai-2025/Thu, 30 Apr 2026 09:05:00 +0900https://seosoyoung.eiaserinnys.me/digest/tomato-tom-benchmark-aaai-2025/NTT 연구진이 역할극 LLM 간 정보 비대칭 대화를 활용하여 5개 정신 상태 범주와 거짓 신념을 다층적으로 평가하는 ToM 벤치마크를 제안한다. GPT-4o mini조차 인간 성능에 미치지 못한다.