The advanced generative artificial intelligence (AI) techniques, such as large language models and large multimodal models, are transforming many aspects of educational assessment. The integration of AI into education has the potential to revolutionize not only test development and evaluation but also the way students can learn. Over the past years, some successful adoptions of machine learning in this area are using natural language processing for automated scoring, or applying collaborative filtering to predict student responses. The rapid advances of large foundation models (e.g., ChatGPT, GPT-4, Llama, Gemini) demonstrate the potential of intelligent assessment with data-driven AI systems. These models could potentially benefit test construct identification, automatic item generation, multimodal item design, automated scoring, and assessment administration. Meanwhile, new research challenges arise in the intersection of AI and educational assessments. For instance, the explainability and accountability of current large foundations models are still inadequate to convince the stakeholders in the educational ecosystem, which limits the adoption of AI techniques in large-scale assessments. Also, it is still unclear whether the large foundation models are capable of assisting complex assessment tasks that involve creative thinking or high-order reasoning. Tackling these research challenges would require collaborative efforts from researchers and practitioners in both AI and educational assessment.
This one-day workshop provides a forum for researchers from AI and educational assessment to review and discuss the recent advances of applying large foundation models for educational assessment. The workshop includes keynote talks and peer-reviewed papers (oral and poster). Original high-quality contributions are solicited on the following topics:
Cambium Assessment
The University of British Columbia
Date: Dec. 15, 2024
Location: MTG 19&20
9:00AM - 9:10AM Opening Remarks
9:10AM - 9:50AM Keynote Talk 1 (Dr. Diyi Yang)
9:50AM - 10:30AM Keynote Talk 2 (Dr. Vered Shwartz)
10:30AM - 11:00AM Coffee Break
11:00AM - 11:40AM Keynote Talk 3 (Dr. Hong Jiao)
11:40AM - 12:25PM Oral Paper Session 1 (3 papers, 15 minutes per paper)
12:25PM - 1:00PM Poster Session
1:00PM - 2:00PM Lunch Break (Poster Session continue)
2:00PM - 2:40PM Keynote Talk 4 (Dr. Susan Lottridge)
2:40PM - 3:25PM Oral Paper Session 2 (3 papers, 15 minutes per paper)
3:25PM - 4:00PM Coffee Break
4:00PM - 4:40PM Keynote Talk 5 (Dr. James Sharpnack)
4:40PM - 5:25PM Oral Paper Session 3 (3 papers, 15 minutes per paper)
5:25PM - 5:30PM Closing Remarks
Oral Session | Time | Title | Speaker | Note |
---|---|---|---|---|
Session 1 11:40AM - 12:25PM |
11:40AM - 11:55AM | PATIENT-Ψ: Using Large Language Models to Simulate Patients for Training Mental Health Professionals | Ruiyi Wang, Stephanie Milani, Jamie C. Chiu, Jiayin Zhi, Shaun M. Eack, Travis Labrum, Samuel M Murphy, Nev Jones, Kate V Hardy, Hong Shen, Fei Fang, Zhiyu Chen | |
11:55AM - 12:10PM | Automated Feedback Generation for Open-Ended Questions: Insights from Fine-Tuned LLMs | Elisabetta Mazzullo, Okan Bulut | ||
12:10PM - 12:25PM | MIRROR: A Novel Approach for the Automated Evaluation of Open-Ended Question Generation | Aniket Deroy, Subhankar Maity, Sudeshna Sarkar | Remote (India) | |
Session 2 2:40PM - 3:25PM |
2:40PM - 2:55PM | DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing | Haneul Yoo, Jieun Han, So-Yeon Ahn, Alice Oh | |
2:55PM - 3:10PM | Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring | Jiazheng Li, Hainiu Xu, ZHAOYUE SUN, Yuxiang Zhou, David West, Cesare Aloisi, Yulan He | Possible Remote (London) | |
3:10PM - 3:25PM | Leveraging Grounded Large Language Models to Automate Educational Presentation Generation | Eric Xie, Guangzhi Xiong, HaolinYang, Aidong Zhang | ||
Session 3 4:40PM - 5:25PM |
4:40PM - 4:55PM | Enhancing Non-Cognitive Assessments with GPT: Innovations in Item Generation and Translation for the University Belonging Questionnaire | Mingfeng Xue, Yunting Liu, huaxia xiong | |
4:55PM - 5:10PM | A Graph-Based Foundation Model for Sample-Efficient Adaptive Learning | Jean Vassoyan, Anan Schütt, Jill-Jênn Vie, Arun Balajiee Lekshmi Narayanan, Nicolas Vayatis, Elisabeth Andre | ||
5:10PM - 5:25PM | Generating Reading Assessment Passages Using a Large Language Model | Ummugul Bezirhan, Matthias von Davier |
Keynote Talk 1: Teaching Social Skill via Large Language Models, Dr. Diyi Yang
Today, social skills are essential to success both on the job and in life. However, practice environments for social skills are typically out of reach for most people. How can we make social skill training more available, accessible, and inviting? Existing mechanisms for practice and feedback largely rely on expert supervision, making training difficult to scale especially given the shortage of trained professionals. In this talk, I share two of our recent works on social skill training using LLMs. The first one explores how to empower therapists in learning therapy skills with LLM-empowered feedback and the second one looks at training people with conflict resolution skills via simulated practices. We conclude by discussing risks, concerns and mitigation strategies related to LLMs-based simulation for social skill training.
Keynote Talk 2: Building a High Stakes Online Test with Human-in-the-loop AI, Dr. James Sharpnack
In this talk, we present a comprehensive overview of the construction of the Duolingo English Test (DET), a high-stakes, large-scale, fully online English language proficiency test built using human-in-the-loop (HIL) AI. The DET is a computerized adaptive test where test takers respond to items designed to assess their proficiency in speaking, writing, reading, and listening. Human oversight plays a critical role in ensuring the DET's fairness, lack of bias, construct validity, and security. Additionally, AI and foundation models enable the scalability of key processes, including real-time security features and automated scoring.
We will take a tour of this process, organized into five parts. First, items are constructed using generative AI models like GPT, followed by human review to evaluate potential bias and fairness. Second, machine learning models automatically grade items by predicting expert human ratings based on construct-aligned AI features. Third, items are calibrated using explanatory item response theory models and custom item embeddings, aligning them with historical test taker performance. Fourth, tests are administered via Thompson sampling, motivated by framing computerized adaptive testing as a contextual bandit problem, and scored using a fully Bayesian approach. Finally, automated AI signals, such as eye-gaze tracking and LLM response detection, support proctors in detecting cheating and ensuring adherence to test rules. Throughout this process, humans and AI collaborate seamlessly to maintain the DET's integrity and effectiveness.
Keynote Talk 3: AI-Enhanced Practices in Educational Assessment, Hong Jiao
Assessment of student learning outcomes are expected to be valid, reliable, accurate, and efficient. Technology has played a significant role in shaping and enhancing educational assessment practices. As the landscape of artificial intelligence (AI) evolves rapidly, AI technology is currently revolutionizing the test development process, particularly in terms of assessment design, item development, implementation, and psychometric analysis. This presentation will first demonstrate the transformative impact of AI technology on educational assessment by presenting successful use cases of AI in test development such as automated scoring, cheating detection, and process data analysis. Next, the presentation will explore additional possibilities and opportunities that AI can bring to enhance educational assessment practices. This includes the use of generative AI for item generation and item parameter prediction modeling. Finally, the presentation will address the challenges in the applications of AI in educational assessment. It will highlight the potential bias and fairness issues, as well as the ethical considerations for the responsible use of AI in large-scale test development. This presentation will emphasize the importance of thoughtful implementation and continuous evaluation to ensure the validity, reliability, and fairness of AI-powered assessment systems.
Keynote Talk 4: Building AI Applications for Large Scale Assessment: A Case Study in Writing Feedback, Susan Lottridge
Building AI applications for large scale assessment requires substantial efforts not just from data science, but also psychometrics, hand-scoring, item development, UI/UX, and machine learning engineering to ensure products can support trustworthy AI (e.g., NIST AI Risk Management Framework) in ways that are also cost effective. This presentation will discuss the key elements and steps our team took when designing and building a writing feedback tool called "Write On with Cambi!" that walks students through reviewing their essay using structured feedback and highlighting organizational and grammatical elements in the essay to emphasize areas for review. The presentation will cover the how the overall purpose for product drove the design, structure, and human labeling of the annotations, the AI modeling using fine-tuned lightweight open source models, the evaluation of the human and AI annotations including bias evaluations, as well as the mapping of elements to structured feedback, the UI/UX, and finally efficacy studies conducted with students and teachers.
To ensure the accessibility of our workshop for virtual attendees, we will stream all presentations and facilitate questions from online attendees via Rocketchat.
Submission URL: Please submit your work via Openreview.
Format: All submissions must be in PDF format and anonymized. Submissions are limited to nine content pages, including all figures and tables; unlimited additional pages containing references and supplementary materials are allowed. Reviewers may choose to read the supplementary materials but will not be required to.
Style file: You must format your submission using the NeurIPS 2024 LaTeX style file. The maximum file size for submissions is 50MB. Submissions that violate the NeurIPS style (e.g., by decreasing margins or font sizes) or page limits may be rejected without further review.
Double-blind reviewing: The reviewing process will be double blind at the level of reviewers (i.e., reviewers cannot see author identities). Authors are responsible for anonymizing their submissions. In particular, they should not include author names, author affiliations, or acknowledgements in their submissions and they should avoid providing any other identifying information (even in the supplementary material).
Important Dates:
Camera Ready:
University of Virginia
CFA Institute
Apple Inc.
University of Iowa
University of Virginia
University of Virginia
NBME
University of Denver
Allen Institute for Artificial Intelligence
University of Virginia, Charlottesville
University of Virginia, Charlottesville
University of Alberta
Stanford University
University of Georgia
University of California, Los Angeles
University of Maryland, College Park
University of Georgia
University of Virginia, Charlottesville
University of California, Santa Cruz
Educational Testing Service
University of Iowa
Boston College
CFA Institute
Vanderbilt University
University of Virginia, Charlottesville
Nagoya University