UMAP '25: Proceedings of the 33rd ACM Conference on User Modeling, Adaptation and Personalization

Full Citation in the ACM Digital Library

SESSION: Full Papers

Assessing Medical Training Skills via Eye and Head Movements

We examined eye and head movements to gain insights into skill development in clinical settings. A total of 24 practitioners participated in simulated baby delivery training sessions. We calculated key metrics, including pupillary response rate, fixation duration, or angular velocity. Our findings indicate that eye and head tracking can effectively differentiate between trained and untrained practitioners, particularly during labor tasks. For example, head-related features achieved an F1 score of 0.85 and AUC of 0.86, whereas pupil-related features achieved F1 score of 0.77 and AUC of 0.85. The results lay the groundwork for computational models that support implicit skill assessment and training in clinical settings by using commodity eye-tracking glasses as a complementary device to more traditional evaluation methods such as subjective scores.

Augmenting Personalized Memory via Practical Multimodal Wearable Sensing in Visual Search and Wayfinding Navigation

Working memory involves the temporary retention of information over short periods. It is a critical cognitive function that enables humans to perform various online processing tasks, such as dialing a phone number, recalling misplaced items’ locations, or navigating through a store. However, inherent limitations in an individual’s capacity to retain information often result in forgetting important details during such tasks. Although previous research has successfully utilized wearable and assistive technologies to enhance long-term memory functions (e.g., episodic memory), their application to supporting short-term recall in daily activities remains underexplored. To address this gap, we present Memento, a framework that uses multimodal wearable sensor data to detect significant changes in cognitive state and provide intelligent in situ cues to enhance recall. Through two user studies involving 15 and 25 participants in visual search navigation tasks, we demonstrate that participants receiving visual cues from Memento achieved significantly better route recall, improving approximately 20-23% compared to free recall. Furthermore, Memento reduced cognitive load and review time by 46% while also achieving a substantial reduction in computation time (3.86 secs vs. 15.35 secs), offering an average 75% effective compared to computer vision-based cues selection approaches.

Comparing Cognitive and Affective Theory of Mind for an Assistive Robotics Application

Human-robot interaction in cooperative and assistive scenarios requires robotic systems to assess the task state and coherently choose their next move. Moreover, it is also fundamental to correctly recognize how the user’s stress and emotional response are changing to offer support appropriately. The robot should be able to adapt to different user reactions, considering the situational context, and displaying empathetic behaviors aiming to support and encourage the users. In this work, we aim to assess the impact of empathetic supporting behaviors on the perception of the robot and the users’ performance during a collaborative task, as opposed to assistive strategies focusing only on the task’s performance. With this objective in mind, we propose a robotic architecture to assist a user in playing a memory game in real-time using a Furhat robot. We conducted a user study where 60 participants played with the robot to evaluate the effects of the two types of Theory of Mind on the assistive task and their perception of the robot. To this extent, the participants interacted with a robot endowed with either Cognitive or Affective Theory of Mind to respectively allow the robot to understand intentions and beliefs, and to show empathetic behaviors to improve the collaboration. The two conditions resulted in achieving the same results in terms of task performance, but the participants rated the emotionally engaged robot higher in perceived social intelligence.

Disentangling Stakeholder Role and Expertise in User-Centered Explainable AI

Identifying explanation needs based on user characteristics has been the focus of human-centred research within XAI for some time. In Ribera et al.’s proposal of user-centred XAI, expertise was used as a proxy for characterising the user, and in turn guide explanation design. Since then, the research landscape has evolved to include a broader notion of stakeholders, ranging from AI developers to external regulators to the affected users of AI decisions. However, with this broadening of stakeholder roles, there emerged a pattern of conflating expertise and role, such as the term “end user” being used interchangeably for domain experts using (X)AI for decision-making and lay users impacted by AI decisions, with both having vastly different explanatory needs. In this work, we revisit previous surveys with the aim to identify and classify stakeholders in the XAI ecosystem. We propose to consistently categorise stakeholders along separate expertise and role dimensions. By disentangling both, we present a framework that highlights the diversity of stakeholder goals and the challenges of aligning explanation design with varied user requirements. Our analysis maps stakeholders onto these dimensions and discusses how using both expertise and role can inform the development of more tailored and effective XAI solutions.

Empowering Recommender Systems based on Large Language Models through Knowledge Injection Techniques

Recommender systems (RSs) have become increasingly versatile, finding applications across diverse domains. Large Language Models (LLMs) significantly contribute to this advancement since the vast amount of knowledge embedded in these models can be easily exploited to provide users with high-quality recommendations. However, current RSs based on LLMs have room for improvement. As an example, knowledge injection techniques can be used to fine-tune LLMs by incorporating additional data, thus improving their performance on downstream tasks. In a recommendation setting, these techniques can be exploited to incorporate further knowledge, which can result in a more accurate representation of the items. Accordingly, in this paper, we propose a pipeline for knowledge injection specifically designed for RS. First, we incorporate external knowledge by drawing on three sources: (a) knowledge graphs; (b) textual descriptions; (c) collaborative information about user interactions. Next, we lexicalize the knowledge, and we instruct and fine-tune an LLM, which can easily return a list of recommendations. Extensive experiments on movie, music, and book datasets validate our approach. Moreover, the experiments showed that knowledge injection is particularly needed in domains (i.e., music and books) where the encoded knowledge within LLMs may not be suitable for recommendation tasks, even if such content was used during the training of the model. This finding points to several promising future research directions.

“Even explanations will not help in trusting [this] fundamentally biased system”: A Predictive Policing Case-Study

In today’s society, where Artificial Intelligence (AI) has gained a vital role, concerns regarding user’s trust have garnered significant attention. The use of AI systems in high-risk domains have often led users to either under-trust it, potentially causing inadequate reliance or over-trust it, resulting in over-compliance. Therefore, users must maintain an appropriate level of trust. Past research has indicated that explanations provided by AI systems can enhance user understanding of when to trust or not trust the system. However, the utility of presentation of different explanations forms still remains to be explored especially in high-risk domains. Therefore, this study explores the impact of different explanation types (text, visual, and hybrid) and user expertise (retired police officers and lay users) on establishing appropriate trust in AI-based predictive policing. While we observed that the hybrid form of explanations increased the subjective trust in AI for expert users, it did not led to better decision-making. Furthermore, no form of explanations helped build appropriate trust. The findings of our study emphasize the importance of re-evaluating the use of explanations to build [appropriate] trust in AI based systems especially when the system’s use is questionable. Finally, we synthesize potential challenges and policy recommendations based on our results to design for appropriate trust in high-risk based AI-based systems.

Familiarizing with Music: Discovery Patterns for Different Music Discovery Needs

Humans have the tendency to discover and explore. This natural tendency is reflected in data from streaming platforms as the amount of previously unknown content accessed by users. Additionally, in domains such as that of music streaming there is evidence that recommending novel content improves users’ experience with the platform. Therefore, understanding users’ discovery patterns, such as the amount to which and the way users access previously unknown content, is a topic of relevance for both the scientific community and the streaming industry, particularly the music one. Previous works studied how music consumption differs for users of different traits and looked at diversity, novelty, and consistency over time of users’ music preferences. However, very little is known about how users discover and explore previously unknown music, and how this behavior differs for users of varying discovery needs. In this paper we bridge this gap by analyzing data from a survey answered by users of the major music streaming platform Deezer in combination with their streaming data. We first address questions regarding whether users who declare a higher interest in unfamiliar music listen to more diverse music, have more stable music preferences over time, and explore more music within a same time window, compared to those who declare a lower interest. We then investigate which type of music tracks users choose to listen to when they explore unfamiliar music, identifying clear patterns of popularity and genre representativeness that vary for users of different discovery needs.

Our findings open up possibilities to infer users’ interest in unfamiliar music from streaming data as well as possibilities to develop recommender systems that guide users in exploring music in a more natural way.

GAL-KARS: Exploiting LLMs for Graph Augmentation in Knowledge-Aware Recommender Systems

In this paper, we propose a recommendation model that exploits a graph augmentation technique based on Large Language Models (LLMs) to enrich the information encoded in its underlying Knowledge Graph (KG). Our work relies on the assumption that the triples encoded in a KG can often be noisy or incomplete, and this may lead to sub-optimal modeling of both the characteristics of items and the users’ preferences. In this setting, graph augmentation can be a suitable solution to improve the quality of the data model and provide users with high-quality recommendations.

Accordingly, in this work, we align with this research line and propose GAL-KARS (Graph Augmentation with LLMs for Knowledge-Aware Recommender Systems). In our framework, we start from a KG, and we design some prompts for querying an LLM and augmenting the graph by incorporating: (a) further features describing the items; (b) further nodes describing the preferences of the users, obtained by reasoning over the items they like. The resulting KG is then passed through a Knowledge Graph Encoder that learns users’ and items’ embeddings based on the augmented KG. These embeddings are finally used to train a recommendation model and provide users with personalized suggestions. As shown in the experimental session, graph augmentation based on LLMs can significantly improve the predictive accuracy of our recommendation model, thus confirming the effectiveness of the model and the validity of our intuitions.

Generative Framework for Personalized Persuasion: Inferring Causal, Counterfactual, and Latent Knowledge

We hypothesize that optimal system responses emerge from adaptive strategies grounded in causal and counterfactual knowledge. Counterfactual inference allows us to create hypothetical scenarios to examine the effects of alternative system responses. We enhance this process through causal discovery, which identifies the strategies informed by the underlying causal structure that govern system behaviors. Moreover, we consider the psychological constructs and unobservable noises that might be influencing user-system interactions as latent factors. We show that these factors can be effectively estimated. We employ causal discovery to identify strategy-level causal relationships among user and system utterances, guiding the generation of personalized counterfactual dialogues. We model the user utterance strategies as causal factors, enabling system strategies to be treated as counterfactual actions. Furthermore, we optimize policies for selecting system responses based on counterfactual data. Our results using a real-world dataset on social good demonstrate significant improvements in persuasive system outcomes, with increased cumulative rewards validating the efficacy of causal discovery in guiding personalized counterfactual inference and optimizing dialogue policies for a persuasive dialogue system.

Granular Feedback: Leveraging Domain Expertise and Explainable AI to Effectively Steer Models

The use of large language models (LLMs) for automated content generation has seen a steady rise in recent years in domains such as education. While research has increasingly explored human-AI collaboration, including the use of feedback and control to enable teachers to improve model performance with domain knowledge, no studies have explored the level of detail teachers are willing to provide in their feedback towards LLM systems. In an automated question generation system, we introduce the concept of granular feedback, which allows teachers to provide feedback on generated questions through critiquing individual features revealed by the model (i.e. question difficulty, Blooms taxonomy), and compare it a more general but widely used 5-point rating that rates the question overall. Through in-depth interviews with 16 teachers, we explore how a detailed but more time-consuming granular feedback compares to a more general but familiar 5-point rating on the question as a whole. Results show a strong preference towards granular feedback over general feedback, driven by factors such as long-term efficiency, personalisation and personal reassurance. Additionally, we highlight several factors that positively influence a user’s willingness to give feedback, such as the optional nature of giving feedback and explicit disclosures on model improvement. As usefulness of granular feedback strongly depended on the features users can give feedback on, we discuss how those were perceived by participants and changing these could further improve feedback. To conclude, we propose several design suggestions related to designing granular feedback, such as aligning feedback options with their mental models and providing means to introduce additional contextual information to limit repetition in provided feedback.

Impact of Adaptive Feedback on Learning Programming with a Serious Game in High Schools’ Classes

This study evaluates the impact of an adaptive feedback system in Pyrates, a programming serious game designed to ease the transition from block-based to text-based programming in high school classes. The adaptive feedback system was implemented to support student learning and lessen teachers’ intervention workload in the classroom. To assess its effectiveness, a field user study was conducted with 190 high school students across two institutions. Results show that students progressed significantly further in the game when using the adaptive feedback system, as compared to playing without feedback, although it did not affect learning gains. We discuss the implications of these results for the design of adaptive feedback in programming serious games.

Legal but Unfair: Auditing the Impact of Data Minimization on Fairness and Accuracy Trade-off in Recommender Systems

Data minimization, required by recent data privacy regulations, is crucial for user privacy, but its impact on recommender systems remains largely unclear. The core problem lies in the fact that reducing or altering the training data of these systems can drastically affect their performance. While previous research has explored how data minimization affects recommendation accuracy, a critical gap remains: How does data minimization impact consumers’ and providers’ fairness? This study addresses this gap by systematically examining how data minimization influences multiple objectives in recommender systems, i.e., the trade-offs between accuracy, user fairness, and provider fairness. Our investigation includes (i) an analysis of how the data minimization strategies affect RS performance across these objectives, (ii) an assessment of data minimization techniques to determine those that can balance better the trade-off among the considered objectives, and (iii) an evaluation of the robustness of different recommendation models under diverse minimization strategies to identify those that best maintain performance. The findings reveal that data minimization can sometimes undermine provider fairness, albeit enhancing group-based consumer fairness to the detriment of accuracy. Additionally, different strategies can offer diverse trade-offs for the assessed objectives. The source code supporting this study is available at https://github.com/salvatore-bufi/DataMinimizationFairness.

Mindful Escape: a Mobile Serious Game to Predict the Personality Trait Cooperation

Personality plays a crucial role in predicting preferences, behaviors, and interactions. Its importance in accurately characterizing individuals has led to its application in areas such as movies, music, and tourism. Although personality questionnaires have traditionally been used to measure personality, they are prone to biases, such as inflated or false responses. In response to these limitations, serious games emerge as innovative alternatives for assessing personality, studying the player's behavior. This study developed and evaluated Mindful Escape, a short duration mobile serious game, as a concept proof to implicitly measure the personality trait of cooperation. The game adapts concepts from the Prisoner's Dilemma and the Tragedy of the Commons to create an Escape Room environment to encourage both cooperative and competitive interactions. Experiments with real users were performed (\(n \)=78), where significant correlations between the game's metrics and cooperation were identified. Additionally, other traits such as modesty, morality, altruism, and anger, also showed correlations. The game's duration exceeded the planned 5min, averaging ca. 10min, mainly due to difficulties related to gameplay by less experienced users, which needs to be addressed in the future. Nevertheless, the participants’ feedback was highly positive, highlighting the immersive and engaging experience offered by the game. The results show short-duration mobile games offer a viable and unobtrusive method for assessing users' detailed personality traits, paving the way to replace traditional personality questionnaires, and their integration into personality-based systems.

Personalizing LLM Responses to Combat Political Misinformation

Despite various efforts to tackle online misinformation, people inevitably encounter and engage with it, especially on social media platforms. Recent advances in LLMs present an opportunity to develop personalized interventions to address misinformed beliefs, and potentially offer more effective approaches than existing non-tailored methods. In this paper, we design and evaluate personalized LLM agent that can consider users’ demographics and personalities to tailor responses to mitigate misinformed beliefs. Our pipeline is grounded in facts through an external Retrieval Augmented Generation (RAG) knowledge base and is able to generate diverse output as a result of the personalization, with an average cosine similarity of 0.538. Our pipeline scores an average rating of 3.99 out of 5 when evaluated by a GPT-4o-mini LLM judge for response persuasiveness. Our methods can be adapted to design similar personalized agents in other domains.

Pilot Trainees Benefit from Modelling and Adaptive Feedback

Limited training capacity has contributed to a critical shortage of licensed commercial pilots. Adaptive educational technologies and simulators could alleviate current training bottlenecks if these technologies could assess trainee performance and provide appropriate feedback. Agents can be used to assess trainee performance, but there is insufficient guidance on how to provide concurrent feedback in simulation-based learning environments. So, we designed 4 feedback conditions that provide varying degrees of elaboration and used a within-subject study (n = 20) to compare feedback approaches. Trainee performance was best when they received highly-elaborative feedback that modeled expert behaviour. Variability in participant performance and preferences indicates a need to adapt the feedback type to individual learners and provides insight into the use of concurrent feedback in simulation-based learning environments. Specifically, learners appreciated the expert model because it facilitated a sense of control which was associated with lower negative affect and lower extraneous cognitive load.

Sentence Encoder-Based Clustering Method for Modeling Students' Learning Programming Behavior

Introductory programming courses are widely known for their difficulty among students. Success in courses is commonly measured in the form of final grades, which might not capture the challenges students face during their learning process. In this paper, we predict students’ success and their future compiler errors based on previously made errors. Furthermore, we examine the effect of applying two clustering techniques before making the predictions and identify key weeks and errors that have the greatest impact on predictions. Experimental results show that students’ compiler errors observed through the semester are an important predictor of students’ achievement and future struggles. Predictions are further improved using sentence encoder-generated embeddings with K-Means algorithm. Our study suggests that students’ errors, particularly the most recent ones, enable meaningful clustering that enhances performance prediction after only three weeks of the semester.

Should We Tailor the Talk? Understanding the Impact of Conversational Styles on Preference Elicitation in Conversational Recommender Systems

Conversational recommender systems (CRSs) provide users with an interactive means to express preferences and receive real-time personalized recommendations. The success of these systems is heavily influenced by the preference elicitation process. While existing research mainly focuses on what questions to ask during preference elicitation, there is a notable gap in understanding what role broader interaction patterns—including tone, pacing, and level of proactiveness—play in supporting users in completing a given task. This study investigates the impact of different conversational styles on preference elicitation, task performance, and user satisfaction with CRSs. We conducted a controlled experiment in the context of scientific literature recommendation, contrasting two distinct conversational styles—high involvement (fast-paced, direct, and proactive with frequent prompts) and high considerateness (polite and accommodating, prioritizing clarity and user comfort)—alongside a flexible experimental condition where users could switch between the two. Our results indicate that adapting conversational strategies based on user expertise and allowing flexibility between styles can enhance both user satisfaction and the effectiveness of recommendations in CRSs. Overall, our findings hold important implications for the design of future CRSs.

"Show Me How": Benefits and Challenges of Agent-Augmented Counterfactual Explanations for Non-Expert Users

Counterfactual explanations offer actionable insights by illustrating how changes to inputs can lead to different outcomes. However, these explanations often suffer from ambiguity and impracticality, limiting their utility for non-expert users with limited AI knowledge. Augmenting counterfactual explanations with Large Language Models (LLMs) has been proposed as a solution, but little research has examined their benefits and challenges for non-experts. To address this gap, we developed a healthcare-focused system that leverages conversational AI agents to enhance counterfactual explanations, offering clear, actionable recommendations to help patients at high risk of cardiovascular disease (CVD) reduce their risk. Evaluated through a mixed-methods study with 34 participants, our findings highlight the effectiveness of agent-augmented counterfactuals in improving actionable recommendations. Results further indicate that users with prior experience using conversational AI demonstrated greater effectiveness in utilising these explanations compared to novices. Furthermore, this paper introduces a set of generic guidelines for creating augmented counterfactual explanations, incorporating safeguards to mitigate common LLM pitfalls, such as hallucinations, and ensuring the explanations are both actionable and contextually relevant for non-expert users.

Synthetic Voices: Evaluating the Fidelity of LLM-Generated Personas in Representing People’s Financial Wellbeing

Large Language Models (LLMs) can impersonate the writing style of authors, characters, and groups of people, but can these personas represent their opinions? If so, it creates opportunities for businesses to obtain early feedback on ideas from a synthetic customer-base. In this paper, we test whether LLM synthetic personas can answer financial wellbeing questions similarly to the responses of a financial wellbeing survey of more than 3,500 Australians. We focus on identifying salient biases of 765 synthetic personas using four state-of-the-art LLMs built over 35 categories of personal attributes. We noticed clear biases related to age, and as more details were included in the personas, their responses increasingly diverged from the survey toward lower financial wellbeing. With these findings, it is possible to understand the areas in which creating synthetic LLM-based customer personas can yield useful feedback for faster product iteration in the financial services industry and potentially other industries.

Task-specific, personalized Automatic Speech Recognition

Voice User Interfaces (VUI) are particularly useful if the operator has to work hands-free or if her cognitive load is very high. This is the case, e.g., when the operator can be easily disturbed by the environment, the operational task induces stress and there is little or no fault tolerance. However, the factors that contribute to the usefulness of a VUI also complicate the design. Automatic Speech Recognition (ASR) must be robust in noisy environments, under non-optimal microphone conditions and for different types of speech – including stress-induced shouting, hyper articulation and heavy breathing, among others. Commercially available, generic ASR solutions do not fulfil high robustness requirements under these conditions. However, ASR systems can be made robust if they are tailored to their respective use cases and personalised for specific users. This paper introduces a method to customize a Large Vocabulary Continuous Speech Recognizer (LVCSR) system to achieve such robustness. An LVCSR system includes a language model (LM) and an acoustic model (AM). The customization involves adapting both the LM and the AM to the specific operational context. For LM customization, we employ an Use Case Editor (UCE) that provides an intuitive interface, enabling users to align linguistic models with their unique needs. For AM customization, a Multi Speaker Text to Speech Synthesis (MSTTS) module is used to automatically generate personalized speech data, ensuring the model captures the distinctive characteristics of individual speakers. Together, these adaptations ensure the LVCSR system is configured to meet the demands of challenging environments and diverse users.

The Effect of Nudging Techniques on the Customisation and Usability of Visual Analytics Dashboards

Visual analytics dashboards have become essential tools for decision-making. However, information overload and mismatches between designers’ expected graph literacy and users’ actual graph literacy can limit their effectiveness. Customisation has been proposed to mitigate these challenges and accommodate diverse user needs. Yet, customising dashboards is often time-consuming; users may not be aware of existing customisation features, or they may not have sufficient technical skills to use them. In this paper, we conduct an experiment (N=50) to examine if we can use nudging techniques to promote short-term surface customisations while not sacrificing the usability of interactive visual analytics dashboards. We found that while nudges do not necessarily increase the use of customisation functionalities, they benefit usability. Specifically, the Social Comparisons nudge supports decision-making, while the Just-in-Time Prompts nudge reduces task completion time. Our findings suggest that nudges should be tailored to graph literacy as users with moderate graph literacy can benefit the most from nudges.

The role of GPT as an adaptive technology in climate change journalism

Recent advancements in Large Language Models (LLMs), such as GPT-4o, have enabled automated content generation and adaptation, including summaries of news articles. To date, LLM use in a journalism context has been understudied, but can potentially address challenges of selective exposure and polarization by adapting content to end users. This study used a one-shot recommender platform to test whether LLM-generated news summaries were evaluated more positively than ‘standard’ 50-word news article previews. Moreover, using climate change news from the Washington Post, we also compared the influence of different ‘emotional reframing’ strategies to rewrite texts and their impact on the environmental behavioral intentions of end users. We used a 2 (between: Summary vs. 50-word previews) x 3 (within: fear, fear-hope or neutral reframing) research design. Participants (N = 300) were first asked to read news articles in our interface and to choose a preferred news article, while later performing an in-depth evaluation task on the usability (e.g., clarity) and trustworthiness of different framing strategies. The results showed that evaluations of summaries, while being positive, were not significantly better than those of previews. However, we did observe that a fear-hope reframing strategy of a news article, when paired with a GPT-generated summary, led to higher pro-environmental intentions compared to neutral framing. We discuss the potential benefits of this technology.

UI-JEPA: Towards Active Perception of User Intent through Onscreen User Activity

Generating user intent descriptions from a sequence of user interface (UI) snippets is a core challenge in comprehensive UI understanding and cross-modal text generation. Recent advancements in multi-modal large language models (MLLMs) have led to substantial progress in this area, but their demands for extensive model parameters, computing power, and high latency makes them impractical for scenarios requiring lightweight, on-device solutions with low latency or heightened privacy. Additionally, the lack of high-quality datasets has hindered the development of such lightweight models. To address these challenges, we propose UI-JEPA, a novel framework that employs masking strategies to learn abstract UI embeddings from unlabeled UI data through self-supervised learning, combined with an LLM decoder fine-tuned for user intent summarization. We also introduce two new UI-grounded multi-modal datasets, “Intent in the Wild” (IIW) and “Intent in the Tame” (IIT), designed for few-shot and zero-shot UI understanding tasks on mobile phones. IIW consists of 1.7K videos across 219 intent categories, while IIT contains ∼ 900 videos across 10 categories. We establish the first baselines for these datasets, showing that representations learned using a JEPA-style objective, combined with an LLM decoder, can achieve high-quality user intent summarization that match the performance of state-of-the-art large MLLMs, but with significantly reduced annotation and deployment resources. Measured by a scoring function that aggregates both n-gram overlap and embedding similarity, UI-JEPA outperforms GPT-4 Turbo and Claude 3.5 Sonnet by 10.0% and 7.2% respectively, averaged across the two datasets. Notably, UI-JEPA also accomplishes the performance with a 50.5x reduction in computational cost and a 6.6x improvement in latency in the IIW dataset. These results underscore the effectiveness of UI-JEPA, highlighting its potential for lightweight, high-performance UI understanding and intent summarization.

Uncertainty in Repeated Implicit Feedback as a Measure of Reliability

Recommender systems rely heavily on user feedback to learn effective user and item representations. Despite their widespread adoption, limited attention has been given to the uncertainty inherent in the feedback used to train these systems. Both implicit and explicit feedback are prone to noise due to the variability in human interactions, with implicit feedback being particularly challenging. In collaborative filtering, the reliability of interaction signals is critical, as these signals determine user and item similarities. Thus, deriving accurate confidence measures from implicit feedback is essential for ensuring the reliability of these signals.

A common assumption in academia and industry is that repeated interactions indicate stronger user interest, increasing confidence in preference estimates. However, in domains such as music streaming, repeated consumption can shift user preferences over time due to factors like satiation and exposure. While literature on repeated consumption acknowledges these dynamics, they are often overlooked when deriving confidence scores for implicit feedback.

This paper addresses this gap by focusing on music streaming, where repeated interactions are frequent and quantifiable. We analyze how repetition patterns intersect with key factors influencing user interest and develop methods to quantify the associated uncertainty. These uncertainty measures are then integrated as consistency metrics in a recommendation task. Our empirical results show that incorporating uncertainty into user preference models yields more accurate and relevant recommendations. Key contributions include a comprehensive analysis of uncertainty in repeated consumption patterns, the release of a novel dataset, and a Bayesian model for implicit listening feedback.

What Is Serendipity? An Interview Study to Conceptualize Experienced Serendipity in Recommender Systems

Serendipity has been associated with numerous benefits in the context of recommender systems, e.g., increased user satisfaction and consumption of long-tail items. Despite this, serendipity in the context of recommender systems has thus far remained conceptually ambiguous. This conceptual ambiguity has led to inconsistent operationalizations between studies, making it difficult to compare and synthesize findings. In this paper, we conceptualize the user’s experience of serendipity. To this effect, we interviewed 17 participants and analyzed the data following the grounded theory paradigm. Based on these interviews, we conceptualize experienced serendipity as a user experience in which a user unintentionally encounters content that feels fortuitous, refreshing, and enriching. We find that all three components—fortuitous, refreshing and enriching—are necessary and together are sufficient to classify a user’s experience as serendipitous. However, these components can be satisfied through a variety of conditions. Our conceptualization unifies previous definitions of serendipity within a single framework, resolving inconsistencies by identifying distinct flavors of serendipity. It highlights underexposed flavors, offering new insights into how users experience serendipity in the context of recommender systems. By clarifying the components and conditions of experienced serendipity in recommender systems, this work can guide the design of recommender systems that stimulate experienced serendipity in their users, and lays the groundwork for developing a standardized operationalization of experienced serendipity in its many flavors, enabling more consistent and comparable evaluations.

With Friends Like These, Who Needs Explanations? Evaluating User Understanding of Group Recommendations

Group Recommender Systems (GRS) employing social choice-based aggregation strategies have previously been explored in terms of perceived consensus, fairness, and satisfaction. At the same time, the impact of textual explanations has been examined, but the results suggest a low effectiveness of these explanations. However, user understanding remains fairly unexplored, even if it can contribute positively to transparent GRS. This is particularly interesting to study in more complex or potentially unfair scenarios when user preferences diverge, such as in a minority scenario (where group members have similar preferences, except for a single member in a minority position). In this paper, we analyzed the impact of different types of explanations on user understanding of group recommendations. We present a randomized controlled trial (n = 271) using two between-subject factors: (i) the aggregation strategy (additive, least misery, and approval voting), and (ii) the modality of explanation (no explanation, textual explanation, or multimodal explanation). We measured both subjective (self-perceived by the user) and objective understanding (performance on model simulation, counterfactuals and error detection). In line with recent findings on explanations for machine learning models, our results indicate that more detailed explanations, whether textual or multimodal, did not increase subjective or objective understanding. However, we did find a significant effect of aggregation strategies on both subjective and objective understanding. These results imply that when constructing GRS, practitioners need to consider that the choice of aggregation strategy can influence the understanding of users. Post-hoc analysis also suggests that there is value in analyzing performance on different tasks, rather than through a single aggregated metric of understanding.

SESSION: Short Papers

Addressing Personalized Diversity in Eyewear Recommendation: a Lenskart Case Study

This study addresses the challenge of limited diversity in recommender systems on e-commerce category pages, which often leads to reduced user engagement and satisfaction. Recognizing the limitations of traditional Factorization Machines (FM) in generating diverse recommendations, we propose a personalized diversity approach that combines re-ranking strategies with FM, enhanced by Generalist-Specialist (GS) scores to tailor diversity to individual user preferences. The re-ranking strategies explored include Maximal Marginal Relevance (MMR) and Determinantal Point Processes (DPP). Our results show improved balance between relevance and personalized diversity in offline experiments. Additionally, we investigate an alternative approach to personalized diversity through a contextual bandit model (LinUCB), where diversity emerges by balancing exploration and exploitation in predicted preferences. This evaluation highlights LinUCB’s ability to anticipate diverse recommendations by simulating adaptive responses without relying on active user feedback, offering a contrast to traditional re-ranking methods.

Bridging Preferences: Multi-Stakeholder Insights on Ideal News Recommendations

In the evolving realm of recommender systems, our study contributes to the understanding of potential improvements in news recommendation beyond accuracy. Central to our research is the integration of insights from news industry experts and prospective readers, compared with automated news recommendations. We conducted a labeling study with 168 articles, using Best-Worst Scaling (BWS) for ranking and topic modeling. This approach enabled a thorough examination of stakeholder expectations for ideal reading recommendations, specifically by investigating the gap between stated and revealed preferences. Our findings show alignment in ranking behavior among journalists, prospective readers, and the BM-25 algorithm. However, preferences for different beyond-accuracy measures varied. Accompanying this work, a corpus of news articles and the labeled rankings have been made available.

Can Path-Based Explainable Recommendation Methods based on Knowledge Graphs Generalize for Personalized Education?

Knowledge graphs enable transparent reasoning in recommender systems. While widely studied in other domains, the generalizability of reasoning methods over knowledge graphs to education remains underexplored due to data and evaluation inconsistencies. In this paper, we investigate three classes of explainable reasoning methods for course recommendation. Comparing them with state-of-the-art baselines, we assess utility, beyond-utility, and explainability metrics. Our results show that methods from the generative class perform well in utility, coverage, and explanation diversity, while baselines are still competitive in some beyond-utility metrics under sparsity. With lower sparsity, the gap among methods decreases. Code: https://bit.ly/kg-reasoning-for-pers-course-recsys.

Circumventing Misinformation Controls: Assessing the Robustness of Intervention Strategies in Recommender Systems

Recommender systems are essential on social media platforms, shaping the order of information users encounter and facilitating news discovery. However, these systems can inadvertently contribute to the spread of misinformation by reinforcing algorithmic biases, fostering excessive personalization, creating filter bubbles, and amplifying false narratives. Recent studies have demonstrated that intervention strategies, such as Virality Circuit Breakers and accuracy nudges, can effectively mitigate misinformation when implemented on top of recommender systems. Despite this, existing literature has yet to explore the robustness of these interventions against circumvention—where individuals or groups intentionally evade or resist efforts to counter misinformation. This research aims to address this gap, examining how well these interventions hold up in the face of circumvention tactics. Our findings highlight that these intervention strategies are generally robust against misinformation circumvention threats when applied on top of recommender systems.

Effects of Quantitative Explanations on Fairness Perception in Group Recommender Systems

Group recommender systems (GRS) aim to deliver recommendations to groups of individuals, assisting them in planning activities such as going to the cinema with friends, organizing a family vacation, or dining out with colleagues. Unlike traditional recommender systems (RS), GRS must account for the preferences of multiple individuals, often balancing potentially conflicting goals. In this context, it is crucial to provide recommendations that are perceived as fair by all group members. While numerous aggregation strategies have been proposed, understanding users’ perspectives on fairness remains an open challenge. In this paper, we present the results of a user study in which real participants acted as external judges, evaluating the fairness of group recommendations. The study investigates the impact of quantitative explanations, conditioned by specific GRS and group types, on fairness perception. Our findings suggest that, without additional information, the task may be too difficult for users, and their ability to distinguish between different group types is limited, further underscoring the importance of explanations. Study data are available from https://osf.io/9fpyr/.

Enhancing Digital Narrative Medicine through Emotion Analysis in Conversational Agents

This paper presents the development of CArEN (Conversational AgEnt supporting Narrative medicine) that integrates a text-based emotional recognition module to personalize therapeutic pathways in the context of Narrative-Based Medicine (NBM). NBM combines traditional medicine, therapies, symptom monitoring, and vital parameters detection with a conversation-based approach that allows considering not only physical well-being but also the psychosocial and emotional impact of illness on the patient’s life. A study was carried out to evaluate the models’ effectiveness in real-world contexts and collect user feedback on the conversational agent’s performance and empathic support. The results demonstrate good accuracy in emotion recognition and positive user feedback, highlighting the conversational agent’s potential as an effective means of supporting narrative medicine techniques.

Exploring Persuasive Engagement to Reduce Over-Reliance on AI-Assistance in a Customer Classification Case

Users often over-rely on AI-assisted decisions without analytically engaging with them, even in practical domains. In this work, we explore persuading users to analytically engage with AI assistance to reduce their over-reliance using a complex business case of customer classification. We explore the effect of persuasive cognitive engagement through explanations and communicating system uncertainty to examine the behavior of participants having diverse expertise. We leverage their feedback and objective behavior to understand their perception of the AI performance. Our findings show a contrast in participants’ subjective and objective behavior, indicating inappropriate reliance on AI assistance with the perception of system performance. However, we observe the positives of interactive cognitive engagement and identify further directions to get deeper insights into expert domains with personalized AI assistance and behavioral persuasion.

GNN’s FAME: Fairness-Aware MEssages for Graph Neural Networks

Graph Neural Networks (GNNs) have shown success in various domains but often inherit societal biases from training data, limiting their real-world applications. Historical data can contain patterns of discrimination related to sensitive attributes like age or gender. GNNs can even amplify these biases due to their topology and message-passing mechanism, where nodes with similar sensitive attributes tend to connect more frequently. While many studies have addressed algorithmic fairness in machine learning through pre-processing and post-processing techniques, few have focused on bias mitigation within the GNN training process.

In this paper, we propose FAME (Fairness-Aware MEssages), an in-processing bias mitigation technique that modifies the GNN training’s message-passing algorithm to promote fairness. By incorporating a bias correction term, the FAME layer adjusts messages based on the difference between the sensitive attributes of connected nodes. FAME is compatible with Graph Convolutional Networks, and a variant called A-FAME is designed for attention-based GNNs. Experiments conducted on three datasets evaluate the effectiveness of our approach against three classes of algorithms and six models, considering two notions of algorithmic fairness. Results show that the proposed approaches produce accurate and fair node classifications. These results provide a strong foundation for further exploration and validation of this methodology. The source code is available at https://github.com/HannanJaved/FAME.

Integrating Expert Knowledge With Automated Knowledge Component Extraction for Student Modeling

Knowledge tracing is a method to model students’ knowledge and enable personalized education in many STEM disciplines such as mathematics and physics, but has so far still been a challenging task in computing disciplines. One key obstacle to successful knowledge tracing in computing education lies in the accurate extraction of knowledge components (KCs), since multiple intertwined KCs are practiced at the same time for programming problems. In this paper, we address the limitations of current methods and explore a hybrid approach for KC extraction, which combines automated code parsing with an expert-built ontology. We use an introductory (CS1) Java benchmark dataset to compare its KC extraction performance with the traditional extraction methods using a state-of-the-art evaluation approach based on learning curves. Our preliminary results show considerable improvement over traditional methods of student modeling. The results indicate the opportunity to improve automated KC extraction in CS education by incorporating expert knowledge into the process.

Learning User Interface Preferences via Contextual Discrete Choice Experimentation

Designing effective user interfaces (UIs) is a complex decision-making process that often relies on usability testing and understanding users’ preferences. However, user preferences can vary widely based on contextual information (such as age, nationality, or use-case of the software system), posing a significant challenge in creating universally effective studies. To address this, we propose a novel framework of contextual discrete choice experimentation (DCE) to learn the relationship between contextual information and user preferences, enabling the creation of more statistically efficient studies for new cohorts of participants. In this framework, users are presented with a sequence of questions where they choose their preferred option between two or more design alternatives. This preference data, combined with contextual information, is used to develop a statistical model that recommends UI designs to new or existing users. We detail the methodology for designing contextual DCEs and demonstrate its application with a real-world example involving users of a statistical software system. Our results indicate that the contextual DCE framework effectively captures user preferences and provides personalized UI recommendations.

Leveraging LLMs to Explain the Consequences of Recommendations

Recommender systems help users make better decisions by suggesting products that match their preferences. However, users often do not understand why certain products are recommended, which can reduce trust and satisfaction. While explanations address this issue, they often fail to communicate the individual impact the decision for an item will have. To address this, we present an LLM-based framework for generating consequence-based explanations. These explanations provide comprehensible personalized insights into the positive and negative consequences of user decisions. To support the assessment and selection of the most effective prompting strategy, we introduce evaluation metrics tailored to consequence-aware explanations and systematically compare different prompting strategies for an apartment recommendation example.

LLMs and Emotional Intelligence: Evaluating Emotional Understanding Through Psychometric Tools

This study investigates the Emotional Intelligence (EI) of LLMs by evaluating their ability to replicate human-like emotional reasoning and self-assessment using established psychometric tools. Prior research has utilized standardized tests to evaluate LLMs in unambiguous emotional reasoning contexts, where a single correct answer is expected. However, emotional situations are rarely simple, and subjective evaluation can often produce equally valid alternative interpretations. This preliminary study proposes to investigate LLM EI using psychometric tests developed for humans, where the emotional reasoning contexts are more complex, subject to multiple interpretations, and often subjective. Four mainstream LLMs (GPT-4o, Mixtral, Gemma-2, and Llama-3.1) are evaluated using two well-established EI assessments: the Trait Emotional Intelligence Questionnaire (TEIQUE) and the Situational Test of Emotional Understanding (STEU). TEIQUE assesses self-perceived emotional capabilities across the facets of well-being, self-control, emotionality, and sociability, while STEU evaluates situational emotional reasoning in real-world contexts. The experiments reveal distinctive self-awareness traits in LLMs, varying levels of safety alignment, and their ability to interpret emotional situations. A human study is conducted to evaluate the reasonableness of LLM emotional appraisals that deviate from expected responses. The results highlight the potential of LLMs to act as tools for emotional interpretation, transcending the deterministic outputs of traditional NLP systems. Finally, this study concludes by discussing the need for non-deterministic and more sophisticated EI assessments that better align with human EI.

Mitigating Risks in Marketplace Semantic Search: A Dataset for Harmful and Sensitive Query Alignment

Semantic search engines have transformed user interaction with online marketplaces, creating a need for effective methods to moderate harmful and sensitive content. Existing approaches often struggle with ambiguous query intent, content classification challenges, and noisy data, making it difficult to ensure user safety while maintaining relevance in search results. To address these challenges, we introduce SHIELD, a synthetic dataset designed for classifying user queries into harmful, sensitive, and normal categories. SHIELD is generated using a large language model with a structured taxonomy, followed by automated filtering using a reward model to ensure data quality and relevance. To demonstrate SHIELD’s utility, we evaluate three classification approaches: (1) BM25, a computationally efficient retrieval-based method; (2) a sentence transformer with FAISS, which improves classification by leveraging semantic embeddings; and (3) MoralBERT, a fine-tuned transformer model trained on SHIELD for direct query classification. We discuss the trade-offs among these methods in terms of accuracy, resource requirements, and explainability, highlighting their applicability in real-world semantic search systems. This work provides a foundation for developing AI-driven content moderation systems in semantic search, offering insights into the trade-offs between efficiency, accuracy, and explainability. The SHIELD dataset, pre-trained model, and generation details are publicly available to support future research and real-world deployment: https://github.com/flpspacek/SHIELD.

Warning: This paper contains examples of language and content that some readers may find offensive. These examples are included solely to illustrate challenges in content moderation and to highlight the importance of ethical considerations in semantic search systems.

"Strangers in a new culture see only what they know": Evaluating Effectiveness of GPT-4 Omni for Detecting Cross-Cultural Communication Norm Violations

Cross-cultural communication often results in misaligned norms and expectations, leading to misunderstandings or harm. As the internet increasingly facilitates cross-cultural communication online, such misalignments also increase. However, there is an opportunity to use Large Language Models (LLMs) to detect such misunderstandings and assist in addressing them. To that end, this study investigates whether cross-cultural norm violations can be detected and mitigated using popular LLMs. Using a set of carefully constructed cross-cultural communication scenarios, half of which present norm violations, we test the ability of OpenAI’s GPT-4 Omni (GPT-4o) model to identify cross-cultural communication norm violations. We find that GPT-4o classification accuracy varies by the stated age, gender, and nationality of the communicators described in the scenarios, suggesting a lack of fairness and a potential cultural gap in GPT-4o’s detection.

Training Green and Sustainable Recommendation Models: Introducing Carbon Footprint Data into Early Stopping Criteria

With the growing focus on Green AI, there is an urgent need for algorithms that are designed to minimize their environmental impact while maintaining satisfying performance. In this paper, we introduce a novel early stopping strategy that considers carbon footprint data while training a recommendation algorithm. In particular, during the training phase, our criterion epoch-by-epoch analyzes the improvement in terms of predictive accuracy and compares it to the increase in carbon emissions. Then, we analyze the trade-off between the scores, and when the accuracy improves at a rate that is not favorable, the training is stopped.

In the experimental evaluation, we showed that our strategy could significantly reduce the carbon footprint of several state-of-the-art recommendation models, with a limited decrease in accuracy and fairness. While more work is needed to automatically balance the trade-off between accuracy and emissions, this paper sheds light on the need for more sustainable recommendation models and takes a significant step toward designing green training strategies.

Unveiling Creativity in Student Code: A Gaussian Mixture Model Approach

Creativity, characterized by the capacity to generate novel and valuable ideas or solutions through imaginative thinking and unique problem-solving, differs widely between individuals. Despite its importance, this variability is often overlooked in research on personalization in education. In this study, our goal is to personalize creativity within a programming learning platform for school students. Leveraging a unique dataset of students’ initial coding attempts, we employ a Gaussian Mixture Model to identify distinct creativity profiles among learners. By integrating these insights into user modeling, this work lays the foundation for developing personalized programming curricula tailored to each student’s creative strengths, highlighting the potential of creativity-aware adaptive systems in education. We make our data and code publicly available at: https://github.com/sveron/Creativity.

Why Context Matters: Exploring How Musical Context Impacts User Behavior, Mood, and Musical Preferences

Music consumption is shaped by both internal factors (e.g., mood, motivation) and external factors (e.g., activity, social environment), which together influence listeners’ behavior (e.g., focus, songs’ skips) and reactions (e.g., mood changes). While prior research has explored real-life or survey-based, context-aware music listening with limited available context information, we introduce a dataset comprising 216 music listening sessions collected in real-world settings through a custom-built Android mobile application designed to assess a wide range of contextual factors. The dataset captures static (e.g., activity, social environment, motivation) and dynamic (e.g., mood changes) contextual factors, along with music interaction data (e.g., skipped or fully listened songs), listening focus levels, and participant traits (e.g., demographics, music education, listening preferences, personality).

Our analysis highlights key insights into how different contextual factors influence user behavior and mood. demonstrating significant differences in skipping songs, focus levels, and genre diversity. We show that music listening sessions grouped by context are significantly different in terms of music listening behaviors (focus, skipping, and session genre diversity) and mood changes (happiness, sadness, stress, and energy). Furthermore, we explore the correlations between personality traits and listening behaviors (mean skip rate and genre diversity). Ultimately, our findings emphasize the importance of understanding context, as different situations lead to distinct music preferences and have varying impacts on user behavior and emotional responses.

SESSION: Industry Papers

Adaptive User Modeling in Visual Merchandising: Balancing Brand Identity with Operational Efficiency

Maintaining a consistent brand identity across a global network of retail stores while adhering to local constraints has long challenged Visual Merchandisers. Legacy processes, often reliant on subjective “by-eye” adjustments, can drive up operational costs and lead to inconsistent in-store execution. We formalize a user modeling framework implementing a multi-criteria utility function that balances brand identity and operational overhead. We integrated our framework in a 3D virtual tour design platform, deploying it in the ecosystem of OVS, a global fashion firm. Through a preliminary user study, we showcase that our solution enables lower iteration cycles and decreases store-to-store discrepancies.

Enhancing Personalisation in Fantasy Sports with Graph-Based Representations

In this paper, we propose a technique that leverages graph-based representation learning using the GraphSAGE algorithm to furnish diverse personalized communications tailored to each user’s unique engagement patterns within the fantasy sports ecosystem. By curating such personalized user suggestions, we promote diverse user engagement that is more customer centric while maintaining business metrics. We perform offline and online experiments to evaluate the effectiveness of our approaches concerning their impact on different user engagement and business metrics.

Finding Interest Needle in Popularity Haystack: Improving Retrieval by Modeling Item Exposure

Recommender systems operate in closed feedback loops, where user interactions reinforce popularity bias, leading to over-recommendation of already popular items while under-exposing niche or novel content. Existing bias mitigation methods, such as Inverse Propensity Scoring (IPS) and Off-Policy Correction (OPC), primarily operate at the ranking stage or during training, lacking explicit real-time control over exposure dynamics. In this work, we introduce an exposure-aware retrieval scoring approach, which explicitly models item exposure probability and adjusts retrieval-stage ranking at inference time. Unlike prior work, this method decouples exposure effects from engagement likelihood, enabling controlled trade-offs between fairness and engagement in large-scale recommendation platforms. We validate our approach through online A/B experiments in a real-world video recommendation system, demonstrating a 25% increase in uniquely retrieved items and a 40% reduction in the dominance of over-popular content, all while maintaining overall user engagement levels. Our results establish a scalable, deployable solution for mitigating popularity bias at the retrieval stage, offering a new paradigm for bias-aware personalization.

Personalized Fashion Advertising with Large Language Models: A Case Study on Fine-Tuning for Marketing Copy Generation

The rapid digitalization of the fashion industry has transformed marketing strategies, emphasizing the need for personalized and adaptive advertising content. This paper presents a case study on fine-tuning Large Language Models (LLMs) for fashion advertising, focusing on OVS, a major Italian fashion retailer. By leveraging real-world marketing data from OVS’s newsletters and social media campaigns, we developed a fine-tuned model capable of generating engaging and stylistically coherent promotional content. To evaluate the effectiveness of this approach, we introduced a novel brand compliance index, measuring the alignment of AI-generated text with key branding requirements, such as audience targeting, event specificity, and platform appropriateness. Experimental results show that the fine-tuned model achieved a compliance score of 0.82, significantly outperforming the baseline model (0.63). Although this approach introduces a minor increase in generation latency, the enhanced alignment with brand identity justifies its use in marketing automation. Our findings highlight the potential of fine-tuned LLMs to streamline advertising content generation while maintaining brand consistency, offering valuable insights for the future of AI-driven digital marketing.

Predicting Movie Hits Before They Happen with LLMs

Addressing the cold-start issue in content recommendation remains a critical ongoing challenge. In this work, we focus on tackling the cold-start problem for movies on a large entertainment platform. Our primary goal is to forecast the popularity of cold-start movies using Large Language Models (LLMs) leveraging movie metadata. This method could be integrated into retrieval systems within the personalization pipeline or could be adopted as a tool for editorial teams to ensure fair promotion of potentially overlooked movies that may be missed by traditional or algorithmic solutions. Our study validates the effectiveness of this approach compared to established baselines and those we developed.

SESSION: Doctoral Consortium Papers

AI-Assisted Learning

Introductory programming courses present significant challenges for novice learners, often leading to frustration and difficulty in identifying learning gaps. This research aims to develop an AI-driven tool that provides personalized guidance, moving beyond traditional "one-size-fits-all" approaches. Recognizing the limitations of relying solely on digital interaction logs in the era of generative AI, we explore the integration of student personal characteristics and fine-grained programming interactions to predict learning behavior and performance. We will investigate how to accurately predict student outcomes early in the semester, analyze the dynamics of learning behaviors, and design an AI-assisted tool to recommend tailored learning materials and feedback. Our goal is to foster effective learning and mitigate the risks associated with over-reliance on general-purpose AI, ultimately enhancing knowledge retention and problem-solving skills.

Cognitive-Emotional Modeling and Hybrid Intelligence: A User-Centered Approach to Psychomotor Interventions in Active Aging

Aging presents challenges to society. Scientific research suggests that regular physical activity and cognitive stimulation are essential for an active aging process. In this context, this research proposes a novel approach that combines hybrid intelligence and neuroscience to model the cognitive functions and emotional state of people to generate psychomotor interventions to promote active aging.

Enabling Novices to Diagnose Robot Failures by Aligning Users' Mental Models of Robots

As robots continue to be adopted into our everyday lives they may encounter unforeseen circumstances, resulting in failures that require assistance from nearby people. When people enter interactions with robots, they leverage their mental models of the system and its functions. These mental models are based on a person’s knowledge of and experiences with that robot and others. For this reason, the models are often incomplete or inaccurate, resulting in inefficient interactions. Understanding a complex robot and its functions is difficult, especially for novices. Therefore, when robots require assistance it is necessary for them to explain their failures in a manner that not only provides enough context for a person to resolve the error, but that also helps correct people’s misaligned mental models. Through this work, I aim to enable non-experts to more efficiently and effectively diagnose and resolve robot failures.

Integrating Indoor Positioning, Recommendation, and Personalization to Enhance Museum Visitor Experiences

Personalization in Cultural Heritage (CH) settings is crucial for transforming visitor experiences into meaningful interactions accommodating diverse expectations and preferences. This research presents a holistic framework to enhance visitor experiences in CH physical institutions, like Galleries, Libraries, Archives, and Museums, through the combination of Indoor Positioning Systems (IPS), Recommender Systems, and Large Language Models (LLMs). Our Bluetooth beacon-based IPS implementation has been successfully deployed in a major gallery in Rome. The system covers 17 rooms and over 100 artworks and provides the user’s position with high accuracy. We conceptualized a recommendation algorithm to optimize visitor engagement by progressively increasing mean dwell time while considering spatial and temporal constraints. Moreover, our experiments with LLM-generated audioguides demonstrate that visitors prefer content tailored to established visitor categories, validating our approach to personalization. These findings provide empirical support for personalized digital interpretation in GLAM contexts, though challenges remain regarding IPS precision over time and LLM hallucination mitigation. Future work will focus on collecting visitor interaction data, implementing the recommender system, and potentially releasing datasets to address the scarcity of CH-specific positioning and recommendation data.

Investigating Speech and Multimedia Integration in Assistive Robots

This doctoral research investigates the integration of speech and multimedia elements in Human-Robot Interaction to enhance communication between humans and assistive robots. The study aims to define, design, implement, and validate adaptive interaction models that can facilitate social engagement, cognitive stimulation, and emotional support through personalized and dynamic interactions. A key objective is to improve natural language dialogue by adapting robot communication to users’ preferences and emotions, dynamically extracted during interaction. This personalization seeks to make the “social moment” more effective, tailoring dialogue strategies to the unique characteristics of individuals.

Just a Chill Robot. Strategies for relatable and personalized Assistive Robots for Autistic Children

This research explores innovative approaches to improve Robotic Therapeutic Assistant for autistic children1. It aims to design a social robot that is believable as a peer for the children, implementing a relatable personality, youth-inspired communication style, and interests reflecting the children’s. Robot-Assisted Therapies (RT) proved to be engaging and produced better outcomes than traditional ones with autistic children. Still, there is a need for improved customisation of robot behaviour and communication, as well as greater inclusivity in the field, characterised by the unbalanced participation of boys interested in robotics. This research welcomes the literature’s hints on the possibility of these issues being addressed by applying a user-centred approach and leveraging a dynamic user model (UM) and offers proposals to customise the robot’s behaviour, making it a more versatile tool for therapists. As part of the "FeelGood!" project2, this study benefits from the expertise of a multidisciplinary team, contributing with perspectives of different neurotypes. The co-design process, with the innovative idea to include both autistic and allistic children, counts to comprise every involved group’s feedback on designing engaging, relatable and therapeutically meaningful interactions.

Segment, Recommend, and Explain: Advancing Conversational Recommender Systems with Large Language Model Agents

From personalized shopping assistants to streaming service recommendations, Conversational Recommender Systems (CRSs) have become essential tools for decision-making. However, users often struggle to understand why particular items are recommended, leading to reduced trust, lower engagement, and less effective decision-making. At the same time, CRSs usually face challenges in recommendation accuracy as they struggle to integrate historical data, real-time feedback, and contextual signals effectively. This combination of unclear explanations and suboptimal recommendations diminishes user experience and system reliability. To address these challenges, we explore three key questions: (i) how the integration of structured and unstructured data can enhance downstream tasks in a Multi-Agent System (MAS)-based CRS architecture, (ii) how a MAS can generate dynamic and user-centric explanations, and (iii) how cross-agent collaboration can optimize recommendation accuracy. MAS-based architectures distribute critical tasks such as decision-making, data retrieval, optimization, and reasoning across specialized agents. A meta-agent oversees these agents, ensuring coordination and adaptability. This research aims to enhance the clarity of explanations and improve recommendation accuracy by integrating modular agents in a coordinated MAS framework, providing more personalized explanations and adapting to evolving user preferences.

Supporting User Information Processing Through Large Language Models Within the Political Sphere

How do we support information within the political domain? By incorporating personalization and guardrails, large language model (LLM) systems can be leveraged to support navigation through the information ecosystem. In this work, I outline a proposal for designing LLM systems within two areas: mitigating misinformation belief and bolstering information processing. These tools aim to draw from theories within persuasion, information processing, and motivated reasoning to ultimately speak to the end user and nudge them to pursue accuracy when presented with information. These interventions will not only extend research within these relevant domains, but also support an individual’s ability to interpret information provided.

Teaming in the AI Era: AI-Augmented Frameworks for Forming, Simulating, and Optimizing Human Teams

Effective teamwork is essential across diverse domains. During the team formation stage, a key challenge is forming teams that effectively balance user preferences with task objectives to enhance overall team satisfaction. In the team performing stage, maintaining cohesion and engagement is critical for sustaining high team performance. However, existing computational tools and algorithms for team optimization often rely on static data inputs, narrow algorithmic objectives, or solutions tailored for specific contexts, failing to account for the dynamic interplay of team members’ personalities, evolving goals, and changing individual preferences. Therefore, teams may encounter member dissatisfaction, as purely algorithmic assignments can reduce members’ commitment to team goals or experience suboptimal engagement due to the absence of timely, personalized guidance to help members adjust their behaviors and interactions as team dynamics evolve. Ultimately, these challenges can lead to reduced overall team performance.

Driven by these challenges, my Ph.D. dissertation aims to develop AI-augmented team optimization frameworks and practical systems that enhance team satisfaction, engagement, and performance. First, I propose a team formation framework that leverages a multi-armed bandit algorithm to iteratively refine team composition based on user preferences, ensuring alignment between individual needs and collective team goals to enhance team satisfaction. Second, I introduce tAIfa (“Team AI Feedback Assistant”), an AI-powered system that utilizes large language models (LLMs) to deliver immediate, personalized feedback to both teams and individual members, enhancing cohesion and engagement. Finally, I present PuppeteerLLM, an LLM-based simulation framework that simulates multi-agent teams to model complex team dynamics within realistic environments, incorporating task-driven collaboration and long-term coordination. My work takes a human-centered approach to advance AI-driven team optimization through both theoretical frameworks and practical systems to improve team members’ satisfaction, engagement, and performance.

Towards Intelligent VR Training: A Physiological Adaptation Framework for Cognitive Load and Stress Detection

Adaptive Virtual Reality (VR) systems have the potential to enhance training and learning experiences by dynamically responding to users’ cognitive states. This research investigates how eye tracking and heart rate variability (HRV) can be used to detect cognitive load and stress in VR environments, enabling real-time adaptation. The study follows a three-phase approach: (1) conducting a user study with the Stroop task to label cognitive load data and train machine learning models to detect high cognitive load, (2) fine-tuning these models with new users and integrating them into an adaptive VR system that dynamically adjusts training difficulty based on physiological signals, and (3) developing a privacy-aware approach to detect high cognitive load and compare this with the adaptive VR in Phase two. This research contributes to affective computing and adaptive VR using physiological sensing, with applications in education, training, and healthcare. Future work will explore scalability, real-time inference optimization, and ethical considerations in physiological adaptive VR.

Towards Personalized Physiotherapy via Common Semantic Fusion: Multi-Modal Learning, Computer Vision and Empathetic NLP

This research develops an AI-driven framework for personalized physiotherapy by integrating multi-modal learning, computer vision, and empathetic NLP. It focuses on user modeling and personalization to enhance physiotherapy assessments via an optimized YOLO Pose algorithm, fusing visual, auditory, and textual data for comprehensive mobility evaluation. Preliminary results show improved pose estimation, supporting the potential for clinical validation and integration of additional modalities such as inertial sensors.