NeurIPS 2024 Workshop on Large Foundation Models for Educational Assessment

Overview

The advanced generative artificial intelligence (AI) techniques, such as large language models and large multimodal models, are transforming many aspects of educational assessment. The integration of AI into education has the potential to revolutionize not only test development and evaluation but also the way students can learn. Over the past years, some successful adoptions of machine learning in this area are using natural language processing for automated scoring, or applying collaborative filtering to predict student responses. The rapid advances of large foundation models (e.g., ChatGPT, GPT-4, Llama, Gemini) demonstrate the potential of intelligent assessment with data-driven AI systems. These models could potentially benefit test construct identification, automatic item generation, multimodal item design, automated scoring, and assessment administration. Meanwhile, new research challenges arise in the intersection of AI and educational assessments. For instance, the explainability and accountability of current large foundations models are still inadequate to convince the stakeholders in the educational ecosystem, which limits the adoption of AI techniques in large-scale assessments. Also, it is still unclear whether the large foundation models are capable of assisting complex assessment tasks that involve creative thinking or high-order reasoning. Tackling these research challenges would require collaborative efforts from researchers and practitioners in both AI and educational assessment.

This one-day workshop provides a forum for researchers from AI and educational assessment to review and discuss the recent advances of applying large foundation models for educational assessment. The workshop includes keynote talks and peer-reviewed papers (oral and poster). Original high-quality contributions are solicited on the following topics:

Large foundation models for automated scoring
Large foundation models for automated item generation
Large foundation models for computerized adaptive testing
Large foundation models for educational content generation
Large foundation models for knowledge tracing
Large foundation models for creating technology-enhanced items
Knowledge augmentation of large models for educational assessment
Knowledge editing of large models for educational assessment
Finetune large foundation models for educational assessment
Generative AI for assessment security and accountability
Trustworthy AI (Fairness, Explainability, Privacy) for educational assessment

Invited Speakers

Hong Jiao

University of Maryland

Diyi Yang

Stanford University

James Sharpnack

Duolingo

Susan Lottridge

Cambium Assessment

Vered Shwartz

The University of British Columbia

Schedule

Date: Dec. 15, 2024
Location: MTG 19&20

9:00AM - 9:10AM Opening Remarks
9:10AM - 9:50AM Keynote Talk 1 (Dr. Diyi Yang)
9:50AM - 10:30AM Keynote Talk 2 (Dr. Vered Shwartz)
10:30AM - 11:00AM Coffee Break
11:00AM - 11:40AM Keynote Talk 3 (Dr. Hong Jiao)
11:40AM - 12:25PM Oral Paper Session 1 (3 papers, 15 minutes per paper)
12:25PM - 1:00PM Poster Session
1:00PM - 2:00PM Lunch Break (Poster Session continue)
2:00PM - 2:40PM Keynote Talk 4 (Dr. Susan Lottridge)
2:40PM - 3:25PM Oral Paper Session 2 (3 papers, 15 minutes per paper)
3:25PM - 4:00PM Coffee Break
4:00PM - 4:40PM Keynote Talk 5 (Dr. James Sharpnack)
4:40PM - 5:25PM Oral Paper Session 3 (3 papers, 15 minutes per paper)
5:25PM - 5:30PM Closing Remarks

Oral Session	Time	Title	Speaker
Session 1 11:40AM - 12:25PM	11:40AM - 11:55AM	PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals	Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M Murphy, Nev Jones, Kate V Hardy, Hong Shen, Fei Fang, Zhiyu Chen
	11:55AM - 12:10PM	Automated Feedback Generation for Open-Ended Questions: Insights from Fine-Tuned LLMs	Elisabetta Mazzullo, Okan Bulut
	12:10PM - 12:25PM	MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation	Aniket Deroy, Subhankar Maity, Sudeshna Sarkar
Session 2 2:40PM - 3:25PM	2:40PM - 2:55PM	DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing	Haneul Yoo, Jieun Han, So-Yeon Ahn, Alice Oh
	2:55PM - 3:10PM	Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring	Jiazheng Li, Hainiu Xu, ZHAOYUE SUN, Yuxiang Zhou, David West, Cesare Aloisi, Yulan He
	3:10PM - 3:25PM	Leveraging Grounded Large Language Models to Automate Educational Presentation Generation	Eric Xie, Guangzhi Xiong, Haolin Yang, Olivia Coleman, Michael Kennedy, Aidong Zhang
Session 3 4:40PM - 5:25PM	4:40PM - 4:55PM	Enhancing Non-Cognitive Assessments with GPT: Innovations in Item Generation and Translation for the University Belonging Questionnaire	Mingfeng Xue, Yunting Liu, huaxia xiong
	4:55PM - 5:10PM	A Pre-Trained Graph-Based Model for Adaptive Sequencing of Educational Documents	Jean Vassoyan, Anan Schütt, Jill-Jênn Vie, Arun Balajiee Lekshmi Narayanan, Nicolas Vayatis, Elisabeth Andre
	5:10PM - 5:25PM	Generating Reading Assessment Passages Using a Large Language Model	Ummugul Bezirhan, Matthias von Davier

Keynote Talk 1: Teaching Social Skill via Large Language Models, Diyi Yang
Today, social skills are essential to success both on the job and in life. However, practice environments for social skills are typically out of reach for most people. How can we make social skill training more available, accessible, and inviting? Existing mechanisms for practice and feedback largely rely on expert supervision, making training difficult to scale especially given the shortage of trained professionals. In this talk, I share two of our recent works on social skill training using LLMs. The first one explores how to empower therapists in learning therapy skills with LLM-empowered feedback and the second one looks at training people with conflict resolution skills via simulated practices. We conclude by discussing risks, concerns and mitigation strategies related to LLMs-based simulation for social skill training.

Keynote Talk 5: Should an LLM grade your students’ exams?, Vered Shwartz
Evaluating and grading students’ work is a difficult and time consuming task. The general-purpose nature of large language models (LLMs), along with their vast knowledge across a wide range of domains, position them as a strong candidate for automatic assessment of free-text answers. However, there are various limitations pertaining to the reliability of LLMs as evaluators and fairness issues that arise from LLM-based automated assessment. This talk will discuss several factors that need to be considered when deciding whether and how to utilize LLMs for this task.

Keynote Talk 3: AI-Enhanced Practices in Educational Assessment, Hong Jiao
Assessment of student learning outcomes are expected to be valid, reliable, accurate, and efficient. Technology has played a significant role in shaping and enhancing educational assessment practices. As the landscape of artificial intelligence (AI) evolves rapidly, AI technology is currently revolutionizing the test development process, particularly in terms of assessment design, item development, implementation, and psychometric analysis. This presentation will first demonstrate the transformative impact of AI technology on educational assessment by presenting successful use cases of AI in test development such as automated scoring, cheating detection, and process data analysis. Next, the presentation will explore additional possibilities and opportunities that AI can bring to enhance educational assessment practices. This includes the use of generative AI for item generation and item parameter prediction modeling. Finally, the presentation will address the challenges in the applications of AI in educational assessment. It will highlight the potential bias and fairness issues, as well as the ethical considerations for the responsible use of AI in large-scale test development. This presentation will emphasize the importance of thoughtful implementation and continuous evaluation to ensure the validity, reliability, and fairness of AI-powered assessment systems.

Keynote Talk 4: Building AI Applications for Large Scale Assessment: A Case Study in Writing Feedback, Susan Lottridge
Building AI applications for large scale assessment requires substantial efforts not just from data science, but also psychometrics, hand-scoring, item development, UI/UX, and machine learning engineering to ensure products can support trustworthy AI (e.g., NIST AI Risk Management Framework) in ways that are also cost effective. This presentation will discuss the key elements and steps our team took when designing and building a writing feedback tool called "Write On with Cambi!" that walks students through reviewing their essay using structured feedback and highlighting organizational and grammatical elements in the essay to emphasize areas for review. The presentation will cover the how the overall purpose for product drove the design, structure, and human labeling of the annotations, the AI modeling using fine-tuned lightweight open source models, the evaluation of the human and AI annotations including bias evaluations, as well as the mapping of elements to structured feedback, the UI/UX, and finally efficacy studies conducted with students and teachers.

Keynote Talk 2: Building a High Stakes Online Test with Human-in-the-loop AI, James Sharpnack
In this talk, we present a comprehensive overview of the construction of the Duolingo English Test (DET), a high-stakes, large-scale, fully online English language proficiency test built using human-in-the-loop (HIL) AI. The DET is a computerized adaptive test where test takers respond to items designed to assess their proficiency in speaking, writing, reading, and listening. Human oversight plays a critical role in ensuring the DET's fairness, lack of bias, construct validity, and security. Additionally, AI and foundation models enable the scalability of key processes, including real-time security features and automated scoring. We will take a tour of this process, organized into five parts. First, items are constructed using generative AI models like GPT, followed by human review to evaluate potential bias and fairness. Second, machine learning models automatically grade items by predicting expert human ratings based on construct-aligned AI features. Third, items are calibrated using explanatory item response theory models and custom item embeddings, aligning them with historical test taker performance. Fourth, tests are administered via Thompson sampling, motivated by framing computerized adaptive testing as a contextual bandit problem, and scored using a fully Bayesian approach. Finally, automated AI signals, such as eye-gaze tracking and LLM response detection, support proctors in detecting cheating and ensuring adherence to test rules. Throughout this process, humans and AI collaborate seamlessly to maintain the DET's integrity and effectiveness.

To ensure the accessibility of our workshop for virtual attendees, we will stream all presentations and facilitate questions from online attendees via Rocketchat.

Accepted Papers

MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation	Aniket Deroy, Subhankar Maity, Sudeshna Sarkar
When All Options Are Wrong: Evaluating Large Language Model Robustness with Incorrect Multiple-Choice Options	Emilia Wiśnios, Gracjan Góral
Gemini Pro Defeated by GPT-4V: Evidence from Education	Ehsan Latif, Xiaoming Zhai
PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals	Fei Fang, Hong Shen, Jiayin Zhi, Kate V Hardy, Nev Jones, Ruiyi Wang, Samuel M Murphy, Shaun M. Eack, Stephanie Milani, Travis Labrum, Zhiyu Chen, Jamie C. Chiu
Automatic Generation of Question Hints for Mathematics Problems using Large Language Models in Educational Technology	Benjamin Clement, Junior Cedric Tonga, Pierre-Yves Oudeyer
Automated Feedback Generation for Open-Ended Questions: Insights from Fine-Tuned LLMs	Elisabetta Mazzullo, Okan Bulut
DREsS: Dataset for Rubric-Based Essay Scoring on EFL Writing	Alice Oh, Haneul Yoo, Jieun Han, So-Yeon Ahn
Fusion-Eval: Integrating Assistant Evaluators with LLMs	Jindong Chen, Lei Meng, Lei Shu, Liangchen Luo, Nevan Wichers, Yinxiao Liu, Yun Zhu
Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring	Cesare Aloisi, David West, Hainiu Xu, Jiazheng Li, Yulan He, Yuxiang Zhou, Zhaoyue Sun
BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration	Alina A. von Davier, Geoffrey T. LaFlair, James Sharpnack, Kevin Hao, Kevin P. Yancey, Klinton Bicknell, Phoebe Mulcaire
VISTA: Visual Integrated System for Tailored Automation in Math Problem Generation Using LLM	Jeongwoo Lee, Jihyeon Park, Kwangsuk Park
Enhancing Non-Cognitive Assessments with GPT: Innovations in Item Generation and Translation for the University Belonging Questionnaire	Mingfeng Xue, Yunting Liu, Huaxia Xiong
A Large Foundation Model for Assessing Spatially Distributed Personality Traits	Avi Bleiweiss
Towards Scalable Automated Grading: Leveraging Large Language Models for Conceptual Question Evaluation in Engineering	Arun Ramaswamy Srinivasa, Naveen Thomas, Rujun Gao, Xiaodi Li, Xiaosu Guo, Arun Balajiee Lekshmi Narayanan
A Pre-Trained Graph-Based Model for Adaptive Sequencing of Educational Documents	Anan Schütt, Elisabeth Andre, Jean Vassoyan, Jill-Jênn Vie, Nicolas Vayatis, Arun Balajiee Lekshmi Narayanan
Generating Reading Assessment Passages Using a Large Language Model	Matthias von Davier, Ummugul Bezirhan
Leveraging Grounded Large Language Models to Automate Educational Presentation Generation	Aidong Zhang, Eric Xie, Guangzhi Xiong, Haolin Yang

Call for Papers

Submission URL: Please submit your work via Openreview.

Format: All submissions must be in PDF format and anonymized. Submissions are limited to nine content pages, including all figures and tables; unlimited additional pages containing references and supplementary materials are allowed. Reviewers may choose to read the supplementary materials but will not be required to.

Style file: You must format your submission using the NeurIPS 2024 LaTeX style file. The maximum file size for submissions is 50MB. Submissions that violate the NeurIPS style (e.g., by decreasing margins or font sizes) or page limits may be rejected without further review.

Double-blind reviewing: The reviewing process will be double blind at the level of reviewers (i.e., reviewers cannot see author identities). Authors are responsible for anonymizing their submissions. In particular, they should not include author names, author affiliations, or acknowledgements in their submissions and they should avoid providing any other identifying information (even in the supplementary material).

Important Dates:

Submission Deadline: Sept 15, 2024
Notification: Oct 9, 2024
Camera-ready Deadline: Oct 30, 2024
Event: Dec 15, 2024

Camera Ready:

download

Workshop Organizers

Program Committee Member

Saed Rezayi

NBME

Peter Organisciak

University of Denver

Jiasen Lu

Allen Institute for Artificial Intelligence

Sheng Li

University of Virginia, Charlottesville

Dongliang Guo

University of Virginia, Charlottesville

Okan Bulut

University of Alberta

Wanjing Anya Ma

Stanford University

Xuansheng Wu

University of Georgia

Hariram Veeramani

University of California, Los Angeles

Wei Ai

University of Maryland, College Park

Jiawei Xiong

University of Georgia

Daiqing Qi

University of Virginia, Charlottesville

Shumin Jing

University of California, Santa Cruz

Jiangang Hao

Educational Testing Service

Deborah J Harris

University of Iowa

Matthias Vondavier

Boston College

Zhongmin Cui

CFA Institute

Scott Crossley

Vanderbilt University

Guangya Wan

University of Virginia, Charlottesville

Reiji Suzuki

Nagoya University

Overview

Invited Speakers

Hong Jiao

Diyi Yang

James Sharpnack

Susan Lottridge

Vered Shwartz

Schedule

Accepted Papers

Call for Papers

Workshop Organizers

Sheng Li

Zhongmin Cui

Jiasen Lu

Deborah Harris

Daiqing Qi

Dongliang Guo

Program Committee Member

Saed Rezayi

Peter Organisciak

Jiasen Lu

Sheng Li

Dongliang Guo

Okan Bulut

Wanjing Anya Ma

Xuansheng Wu

Hariram Veeramani

Wei Ai

Jiawei Xiong

Daiqing Qi

Shumin Jing

Jiangang Hao

Deborah J Harris

Matthias Vondavier

Zhongmin Cui

Scott Crossley

Guangya Wan

Reiji Suzuki