NeurIPS 2024 Workshop on

Large Foundation Models for Educational Assessment

Date: December 15, 2024
Location: MTG 19&20






Overview

The advanced generative artificial intelligence (AI) techniques, such as large language models and large multimodal models, are transforming many aspects of educational assessment. The integration of AI into education has the potential to revolutionize not only test development and evaluation but also the way students can learn. Over the past years, some successful adoptions of machine learning in this area are using natural language processing for automated scoring, or applying collaborative filtering to predict student responses. The rapid advances of large foundation models (e.g., ChatGPT, GPT-4, Llama, Gemini) demonstrate the potential of intelligent assessment with data-driven AI systems. These models could potentially benefit test construct identification, automatic item generation, multimodal item design, automated scoring, and assessment administration. Meanwhile, new research challenges arise in the intersection of AI and educational assessments. For instance, the explainability and accountability of current large foundations models are still inadequate to convince the stakeholders in the educational ecosystem, which limits the adoption of AI techniques in large-scale assessments. Also, it is still unclear whether the large foundation models are capable of assisting complex assessment tasks that involve creative thinking or high-order reasoning. Tackling these research challenges would require collaborative efforts from researchers and practitioners in both AI and educational assessment.


This one-day workshop provides a forum for researchers from AI and educational assessment to review and discuss the recent advances of applying large foundation models for educational assessment. The workshop includes keynote talks and peer-reviewed papers (oral and poster). Original high-quality contributions are solicited on the following topics:

  • Large foundation models for automated scoring
  • Large foundation models for automated item generation
  • Large foundation models for computerized adaptive testing
  • Large foundation models for educational content generation
  • Large foundation models for knowledge tracing
  • Large foundation models for creating technology-enhanced items
  • Knowledge augmentation of large models for educational assessment
  • Knowledge editing of large models for educational assessment
  • Finetune large foundation models for educational assessment
  • Generative AI for assessment security and accountability
  • Trustworthy AI (Fairness, Explainability, Privacy) for educational assessment

Invited Speakers




Hong Jiao

University of Maryland

Diyi Yang

Stanford University

Susan Lottridge

Cambium Assessment

Vered Shwartz

The University of British Columbia

Schedule (tentative)

Date: Dec. 15, 2024
Location: MTG 19&20

9:00AM - 9:10AM Opening Remarks
9:10AM - 9:50AM Keynote Talk 1 (Dr. Diyi Yang)
9:50AM - 10:30AM Keynote Talk 2 (Dr. Vered Shwartz)
10:30AM - 11:00AM Coffee Break
11:00AM - 11:40AM Keynote Talk 3 (Dr. Hong Jiao)
11:40AM - 12:25PM Oral Paper Session 1 (3 papers, 15 minutes per paper)
12:25PM - 1:00PM Poster Session
1:00PM - 2:00PM Lunch Break (Poster Session continue)
2:00PM - 2:40PM Keynote Talk 4 (Dr. Susan Lottridge)
2:40PM - 3:25PM Oral Paper Session 2 (3 papers, 15 minutes per paper)
3:25PM - 4:00PM Coffee Break
4:00PM - 4:40PM Keynote Talk 5 (Dr. James Sharpnack)
4:40PM - 5:25PM Oral Paper Session 3 (3 papers, 15 minutes per paper)
5:25PM - 5:30PM Closing Remarks


Oral Session Time Title Speaker
Session 1
11:40AM - 12:25PM
11:40AM - 11:55AM PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M Murphy, Nev Jones, Kate V Hardy, Hong Shen, Fei Fang, Zhiyu Chen
11:55AM - 12:10PM Automated Feedback Generation for Open-Ended Questions: Insights from Fine-Tuned LLMs Elisabetta Mazzullo, Okan Bulut
12:10PM - 12:25PM MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation Aniket Deroy, Subhankar Maity, Sudeshna Sarkar
Session 2
2:40PM - 3:25PM
2:40PM - 2:55PM DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing Haneul Yoo, Jieun Han, So-Yeon Ahn, Alice Oh
2:55PM - 3:10PM Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring Jiazheng Li, Hainiu Xu, ZHAOYUE SUN, Yuxiang Zhou, David West, Cesare Aloisi, Yulan He
3:10PM - 3:25PM Leveraging Grounded Large Language Models to Automate Educational Presentation Generation Eric Xie, Guangzhi Xiong, HaolinYang, Aidong Zhang
Session 3
4:40PM - 5:25PM
4:40PM - 4:55PM Enhancing Non-Cognitive Assessments with GPT: Innovations in Item Generation and Translation for the University Belonging Questionnaire Mingfeng Xue, Yunting Liu, huaxia xiong
4:55PM - 5:10PM A Graph-Based Foundation Model for Sample-Efficient Adaptive Learning Jean Vassoyan, Anan Schütt, Jill-Jênn Vie, Arun Balajiee Lekshmi Narayanan, Nicolas Vayatis, Elisabeth Andre
5:10PM - 5:25PM Generating Reading Assessment Passages Using a Large Language Model Ummugul Bezirhan, Matthias von Davier
 

Keynote Talk 1: Teaching Social Skill via Large Language Models, Diyi Yang
Today, social skills are essential to success both on the job and in life. However, practice environments for social skills are typically out of reach for most people. How can we make social skill training more available, accessible, and inviting? Existing mechanisms for practice and feedback largely rely on expert supervision, making training difficult to scale especially given the shortage of trained professionals. In this talk, I share two of our recent works on social skill training using LLMs. The first one explores how to empower therapists in learning therapy skills with LLM-empowered feedback and the second one looks at training people with conflict resolution skills via simulated practices. We conclude by discussing risks, concerns and mitigation strategies related to LLMs-based simulation for social skill training.

Keynote Talk 5: Should an LLM grade your students’ exams?, Vered Shwartz
Evaluating and grading students’ work is a difficult and time consuming task. The general-purpose nature of large language models (LLMs), along with their vast knowledge across a wide range of domains, position them as a strong candidate for automatic assessment of free-text answers. However, there are various limitations pertaining to the reliability of LLMs as evaluators and fairness issues that arise from LLM-based automated assessment. This talk will discuss several factors that need to be considered when deciding whether and how to utilize LLMs for this task.

Keynote Talk 3: AI-Enhanced Practices in Educational Assessment, Hong Jiao
Assessment of student learning outcomes are expected to be valid, reliable, accurate, and efficient. Technology has played a significant role in shaping and enhancing educational assessment practices. As the landscape of artificial intelligence (AI) evolves rapidly, AI technology is currently revolutionizing the test development process, particularly in terms of assessment design, item development, implementation, and psychometric analysis. This presentation will first demonstrate the transformative impact of AI technology on educational assessment by presenting successful use cases of AI in test development such as automated scoring, cheating detection, and process data analysis. Next, the presentation will explore additional possibilities and opportunities that AI can bring to enhance educational assessment practices. This includes the use of generative AI for item generation and item parameter prediction modeling. Finally, the presentation will address the challenges in the applications of AI in educational assessment. It will highlight the potential bias and fairness issues, as well as the ethical considerations for the responsible use of AI in large-scale test development. This presentation will emphasize the importance of thoughtful implementation and continuous evaluation to ensure the validity, reliability, and fairness of AI-powered assessment systems.

Keynote Talk 4: Building AI Applications for Large Scale Assessment: A Case Study in Writing Feedback, Susan Lottridge
Building AI applications for large scale assessment requires substantial efforts not just from data science, but also psychometrics, hand-scoring, item development, UI/UX, and machine learning engineering to ensure products can support trustworthy AI (e.g., NIST AI Risk Management Framework) in ways that are also cost effective. This presentation will discuss the key elements and steps our team took when designing and building a writing feedback tool called "Write On with Cambi!" that walks students through reviewing their essay using structured feedback and highlighting organizational and grammatical elements in the essay to emphasize areas for review. The presentation will cover the how the overall purpose for product drove the design, structure, and human labeling of the annotations, the AI modeling using fine-tuned lightweight open source models, the evaluation of the human and AI annotations including bias evaluations, as well as the mapping of elements to structured feedback, the UI/UX, and finally efficacy studies conducted with students and teachers.

Keynote Talk 2: Building a High Stakes Online Test with Human-in-the-loop AI, James Sharpnack
In this talk, we present a comprehensive overview of the construction of the Duolingo English Test (DET), a high-stakes, large-scale, fully online English language proficiency test built using human-in-the-loop (HIL) AI. The DET is a computerized adaptive test where test takers respond to items designed to assess their proficiency in speaking, writing, reading, and listening. Human oversight plays a critical role in ensuring the DET's fairness, lack of bias, construct validity, and security. Additionally, AI and foundation models enable the scalability of key processes, including real-time security features and automated scoring. We will take a tour of this process, organized into five parts. First, items are constructed using generative AI models like GPT, followed by human review to evaluate potential bias and fairness. Second, machine learning models automatically grade items by predicting expert human ratings based on construct-aligned AI features. Third, items are calibrated using explanatory item response theory models and custom item embeddings, aligning them with historical test taker performance. Fourth, tests are administered via Thompson sampling, motivated by framing computerized adaptive testing as a contextual bandit problem, and scored using a fully Bayesian approach. Finally, automated AI signals, such as eye-gaze tracking and LLM response detection, support proctors in detecting cheating and ensuring adherence to test rules. Throughout this process, humans and AI collaborate seamlessly to maintain the DET's integrity and effectiveness.

To ensure the accessibility of our workshop for virtual attendees, we will stream all presentations and facilitate questions from online attendees via Rocketchat.

Call for Papers

Submission URL: Please submit your work via Openreview.

Format: All submissions must be in PDF format and anonymized. Submissions are limited to nine content pages, including all figures and tables; unlimited additional pages containing references and supplementary materials are allowed. Reviewers may choose to read the supplementary materials but will not be required to.

Style file: You must format your submission using the NeurIPS 2024 LaTeX style file. The maximum file size for submissions is 50MB. Submissions that violate the NeurIPS style (e.g., by decreasing margins or font sizes) or page limits may be rejected without further review.

Double-blind reviewing: The reviewing process will be double blind at the level of reviewers (i.e., reviewers cannot see author identities). Authors are responsible for anonymizing their submissions. In particular, they should not include author names, author affiliations, or acknowledgements in their submissions and they should avoid providing any other identifying information (even in the supplementary material).

Important Dates:

  • Submission Deadline: Sept 15, 2024
  • Notification: Oct 9, 2024
  • Camera-ready Deadline: Oct 30, 2024
  • Event: Dec 15, 2024

Camera Ready:

    You should format your camera-ready paper with the FM-Assess LaTeX camera-ready template (download). However, if you choose to opt for the PMLR proceedings, you should use the PMLR template (download).

Workshop Organizers




Sheng Li

University of Virginia

Zhongmin Cui

CFA Institute

Jiasen Lu

Apple Inc.

Deborah Harris

University of Iowa

Daiqing Qi

University of Virginia

Dongliang Guo

University of Virginia

Program Committee Member

Saed Rezayi

NBME

Peter Organisciak

University of Denver

Jiasen Lu

Allen Institute for Artificial Intelligence

Sheng Li

University of Virginia, Charlottesville

Dongliang Guo

University of Virginia, Charlottesville

Okan Bulut

University of Alberta

Wanjing Anya Ma

Stanford University

Xuansheng Wu

University of Georgia

Hariram Veeramani

University of California, Los Angeles

Wei Ai

University of Maryland, College Park

Jiawei Xiong

University of Georgia

Daiqing Qi

University of Virginia, Charlottesville

Shumin Jing

University of California, Santa Cruz

Jiangang Hao

Educational Testing Service

Deborah J Harris

University of Iowa

Matthias Vondavier

Boston College

Zhongmin Cui

CFA Institute

Scott Crossley

Vanderbilt University

Guangya Wan

University of Virginia, Charlottesville

Reiji Suzuki

Nagoya University