The Platform for OPen Recommendation and Online eXperimentation (POPROX) is a new resource to allow recommender systems and personalization researchers to conduct online user research without having to develop all of the necessary infrastructure and recruit users themselves. Our first domain is personalized news recommendations: POPROX 1.0 provides a daily newsletter (with content from the Associated Press) to users who have already consented to participate in research, along with interfaces and protocols to support researchers in conducting studies that assign subsets of users to various experimental algorithms and/or interfaces.
The purpose of this tutorial is to introduce the platform and its capabilities to researchers in the UMAP community who may be interested using the system. Participants will walk through the implementation of a sample experiment to demonstrate the mechanics of designing and running user studies with POPROX.
This tutorial provides practical training in designing and conducting online user experiments with personalized systems, and in statistically analyzing the results of such experiments. This tutorial will be useful for anyone seeking to conduct user-centric evaluations of UMAP systems—ranging from adaptive user interfaces to human-like conversational artificial intelligence (AI) systems powered by large language models (LLMs). It covers the development of a research question and hypotheses, the selection of study participants, the manipulation of system aspects and measurement of behaviors, perceptions and user experiences, and the evaluation of subjective measurement scales and study hypotheses. Slides will be made available at https://www.usabart.nl/QRMS/.
The Digital Services Act (DSA) establishes a regulatory framework for online platforms and search engines in the European Union, focusing on mitigating systemic risks such as illegal content dissemination, fundamental rights violations, and impacts on electoral processes, public health, and gender-based violence. Very Large Online Platforms (VLOPs) and Very Large Search Engines (VLOSEs), defined as those with over 45 million active recipients, must provide data access for research to enable investigations into these risks and the development of solutions. This tutorial is tailored for the UMAP community, addressing the implications of the DSA for user modelling research. It will cover the DSA’s key provisions and definitions, outline the procedural steps for accessing VLOP and VLOSE data, and discuss the technical aspects of data access requests. Participants will also explore the challenges and opportunities involved in working with this data. By the end of the tutorial, attendees will have a thorough understanding of the DSA’s data access provisions, the technical and procedural requirements for accessing VLOP and VLOSE data, and the regulation’s implications for user modelling research. They will be equipped to navigate the complexities of the DSA and contribute to the development of responsible and transparent online platforms.
Further information and resources about the tutorial are available on the website: https://erasmopurif.com/tutorial-dsa-umap25/.
Well-being as a paradigm shift in human-computer interaction (HCI) becomes increasingly important, especially due to the increase of proactive interfaces and artificial intelligence use. Beyond the usability and simplicity of interaction, HCI research is confronted with long-term perspectives that demand alternative doctrines, especially concerning the well-being of humans and societies. In this tutorial, three separate research areas will be joining to create a comprehensive tutorial for considering well-being orientations in developing intelligent user interfaces. Firstly, the development of intelligent user interfaces will be highlighted from an artificial intelligence (AI) engineering perspective using modality-based thinking and algorithmic evaluation to build well-being-centered systems. Secondly, this perspective will be extended by the human-centered artificial intelligence (HCAI) approach, and lastly, how the actual interaction from a human-centered design perspective can be developed regarding well-being orientations, manifesting specific interaction principles.
This tutorial explores the intersection of sustainability and recommender systems, focusing on aligning user needs and values with sustainable practices. It emphasizes two dimensions: (1) understanding and modeling users to deliver more sustainable recommendations; and (2) fostering sustainability through system design and functionality. Participants will learn how recommender systems can encourage sustainable behaviors and how to enhance system efficiency while minimizing resource consumption and ethical challenges. Through theoretical insights and hands-on sessions, this tutorial proposes discussion and actionable strategies to design human-centered, sustainable recommender systems, addressing both societal impact and technological responsibility.
Fairness in recommender systems is often framed around demographic attributes. In this work, we explore a novel direction—evaluating fairness across latent behavioural communities derived from user interactions on a real-world news platform. Using graph-based community detection (Louvain and Infomap), we identify large user groups and examine how different network modelling choices affect fairness outcomes in both traditional and fairness-aware recommender systems. Experiments on an Austrian news dataset reveal that small changes in graph construction considerably impact community formation and recommendation quality. Notably, fairness-aware algorithms show only marginal improvements over standard approaches, underscoring the complexity of achieving equitable outcomes in real-world systems and raising important questions for future research.
Lack of data is a recurring problem in Artificial Intelligence, as it is essential for training and validating models. This is particularly true in the field of cultural heritage, where the number of open datasets is relatively limited and where the data collected does not always allow for holistic modeling of visitors’ experience due to the fact that data are ad hoc (i.e. restricted to the sole characteristics required for the evaluation of a specific model). To overcome this lack, we conducted a study between February and March 2019 aimed at obtaining comprehensive and detailed information about visitors, their visit experience and their feedback. We equipped 51 participants with eye-tracking glasses, leaving them free to explore the 3 floors of the museum for an average of 57 minutes, and to discover an exhibition of more than 400 artworks. On this basis, we built an open dataset combining contextual data (demographic data, preferences, visiting habits, motivations, social context...), behavioral data (spatiotemporal trajectories, gaze data) and feedback (satisfaction, fatigue, liked artworks, verbatim...). Our analysis made it possible to re-enact visitor identities combining the majority of characteristics found in the literature [3, 8, 9, 10, 16, 19] and to reproduce the Veron and Levasseur profiles [17]. This dataset will ultimately make it possible to improve the quality of recommended paths in museums by personalizing the number of points of interest (POIs), the time spent at these different POIs, and the amount of information to be provided to each visitor based on their level of interest. Dataset URL: https://mbanv2.loria.fr/
Group Recommender Systems aim to support groups in making collective decisions, and research has consistently shown that the more we understand about group members and their interactions, the better support such systems can provide. In this work, we propose a conceptual framework for modeling group dynamics from group chat interactions, with a particular focus on decision-making scenarios. The framework is designed to support the development of intelligent agents that provide advanced forms of decision support to groups. It consists of modular, loosely coupled components that process and analyze textual and multimedia content, which is shared in group interactions, to extract user preferences, emotional states, interpersonal relationships, and behavioral patterns. By incorporating sentiment analysis, summarization, dialogue state tracking, and conflict resolution profiling, the framework captures both individual and collective aspects of group behavior. Unlike existing approaches, our model is intended to operate dynamically and adaptively during live group interactions, offering a novel foundation for group recommender and decision support systems.
ChatGPT has demonstrated remarkable versatility across various domains, including Recommender Systems (RSs). Unlike traditional RSs, ChatGPT generates recommendations through natural language, leveraging contextual cues and large-scale knowledge representations. However, it remains unclear whether these recommendations implicitly encode collaborative patterns, rely on semantic item similarities, or follow a fundamentally different paradigm. In this work, we systematically analyze ChatGPT’s recommendation behavior by comparing its generated lists to collaborative and content-based filtering baselines across three domains: Books, Movies, and Music. Using established list similarity metrics, we quantify the alignment of ChatGPT’s recommendations with traditional paradigms. Additionally, we investigate the most recommended items by ChatGPT and the other recommenders, comparing the distribution of frequently recommended items across models. Our findings reveal that ChatGPT exhibits strong similarities to collaborative filtering (CF) and amplifies popular yet underrepresented items in the dataset, suggesting a broader domain knowledge encoded in the language model and the need for future research on leveraging LLMs for recommendation tasks.
Multitask learning (MTL) has emerged as a promising paradigm for improving recommendation systems by learning multiple related tasks together. In this paper, we present significant improvements in personalized news recommendation by integrating auxiliary and cascaded tasks. Our study compares single-task learning (STL) models with multitask architectures. We evaluated performance on the primary task of news-click prediction, along with predicting user interest in news categories and topics as auxiliary tasks, and fully scrolled prediction and reading pattern prediction as cascaded tasks. Our results indicate that MTL models outperform STL baselines in terms of AUC metrics. These results underscore the benefits of using multiple related tasks to capture richer signals of user behavior, while also highlighting challenges that remain, such as effectively integrating non-click samples in cascaded tasks.
Action Quality Assessment (AQA) plays an important role in evaluating human performance in different domains, including fitness, sports, and healthcare. This work introduces a novel AQA approach by fine-tuning large multimodal models (LMMs) for personalized activity evaluation. We used the Fitness-AQA Dataset, which provides detailed annotations of exercise errors under realistic conditions, and we adapt the LLaVA-Video model, a state-of-the-art LMM comprising the Qwen2 large language model and the SigLIP vision encoder. We have implemented a customized data preparation pipeline that transforms video-based exercise annotations into a conversational format specific for fine-tuning. To our knowledge, this study is among the first to fine-tune LMMs for AQA tasks and the very first to explore activity evaluation in this context. The experimental evaluation shows that our model achieves results slightly lower than the baseline, even though it is able to generalize across multiple exercises. The full-reproducible code is available on GitHub https://github.com/GaetanoDibenedetto/UMAP25.
Aspect-Based Sentiment Analysis (ABSA) aims to identify sentiments associated with specific aspects within a text. It plays a crucial role in applications such as product reviews and customer feedback analysis, where understanding nuanced opinions is essential. However, progress in ABSA remains constrained by the need for fine-grained labeled data, limiting the applicability of supervised models in real-world scenarios. In this study, we propose an unsupervised transformer-based approach that leverages sentence-level sentiment annotations to induce aspect-level sentiment representations. By supervising attention distributions during pretraining, our model learns to aggregate token-level sentiment cues into context-aware aspect sentiment predictions aligned with sentence-level supervision. We further introduce an attention-based correction mechanism to refine aspect sentiment classification by accounting for the local context of each aspect term. Evaluated on benchmark datasets including Restaurants, Laptops, and Twitter domains, our method outperforms unsupervised baselines on aspect category classification while remaining comparable with strong supervised baselines on aspect term sentiment tasks. These results demonstrate that attention-guided pretraining enables robust, domain-adaptive ABSA without requiring aspect-level supervision.
This work presents a platform that facilitates sustainability learning through waste management principles - reducing and reusing waste from everyday items. The platform features (a) an AI assistant that helps users express and develop their ideas about sustainable practices and (b) a commentary interface to support asynchronous collaboration. The AI agent provides context-aware suggestions while encouraging users to modify and personalize these recommendations, creating a collaborative approach to sustainability ideation. A user study was conducted, and the effectiveness of the approach was evaluated. Results demonstrated increased sustainability awareness among participants after using the platform, with varying patterns of improvement across different sustainability approaches. In particular, users who actively modified AI suggestions produced higher-quality contributions with more specific actionable recommendations compared to those who directly copied AI’s responses. The study also reveals insights into how AI assistance affects content quality. These findings contribute to understanding how AI can be effectively integrated into sustainability education platforms to enhance learning outcomes.
Specifying user stories and epics for feature requirements in agile software development is essential but time-consuming, demanding significant stakeholder effort. To improve this situation, we introduce InteractiveReq, an interactive critiquing-based recommender system that simplifies the generation of high-quality requirements. By using Large Language Models (LLMs), InteractiveReq enables an iterative process where stakeholders refine feature requirements through interactive feedback, addressing the issue of incomplete initial specifications. The system recommends custom drafts of epics and user stories based on the project context and an initial feature description, which users can refine through natural language critiques until their needs are satisfied. This approach aims to reduce workload and offers an intuitive method for requirement management. Preliminary results indicate that InteractiveReq effectively supports the creation of complete and accurate specifications.
Understanding how students perceive and utilize Large Language Models (LLMs) and how these interactions relate to their learning behavior and individual differences is crucial for optimizing educational process and outcomes. This paper introduces a novel dataset comprising weekly self-reported data from students in an introductory programming course, i.e., students’ AI tool usage, perceived difficulty of weekly subject areas, personality traits, preferred learning styles, and general attitudes toward AI. We present a descriptive overview of the collected data and conduct a correlation analysis to gain first insights into the students’ individual differences and their learning outcomes, frequency of AI tools usage, as well as their attitudes toward AI. The findings reveal that while individual student characteristics did not show significant correlations with final performance or frequency of AI tool usage, the combination of students’ expectations for success and their perceived value of the task (constructs of expectancy theory) were significantly associated with both course outcomes and how often they used the AI tool. Additionally, motivational factors may be the key to fostering positive attitudes toward AI, while personality traits, particularly those related to negative emotionality, may play a more significant role in shaping resistance. This initial analysis lays the groundwork for future investigations on the prospects of AI in support of the students’ learning process.
When shopping for new products, people typically tend to adopt maximizer or satisficer behavior patterns. While satisficers stop searching as soon as they find a suitable product, maximizers seek the best option among all available choices. Even though most people normally behave as satisficers, they tend to adopt maximizer patterns in high-stakes decisions. In this work, we argue that contemporary e-commerce solutions are well-suited to supporting satisficers, but they often lack features that assist maximizers. Out of this missing functionality, cut-off alerts, i.e., reassurances that the user has already covered all/most of the potentially relevant options, have not yet been explored in related research. To address this gap, we leverage the observation that high-stakes decisions often occur in content-rich domains. Building on this, we propose an enhanced human-computer interaction model incorporating contextual explanations and cut-off alerts. The proposed functionality was evaluated in a user study where the stopping criteria interface variant substantially outperformed the unseen statistics, which resembles the interfaces commonly available on e-commerce.
In this work, we propose JARVIS. It aims to provide LLMs with a stronger degree of personalization via a two-hemisphere architecture inspired by the biological organization of the human brain, following a Large Agentic Model (LAM) architecture. The subjective hemisphere operates by dynamically modeling the user’s preferences and iteratively optimizing its behaviors, through a training phase grounded on LoRA (Low-Rank Adaptation), DPO (Proximal Policy Optimization), human feedback, and synthetic data ("digital dreams"). Conversely, the objective hemisphere serves a rational-like role, reducing hallucination and the chances of getting dangerous misinformation using more structural approaches. In JARVIS, such hemispheres are ground on a dual-level memory capability. Short-term memory keeps track of short-term preferences, ensuring continuity in dialogues and long-term user behaviors and interactions. Long-term memory is gradually developed to collect all the possible user ground preferences, skills, and general behavioral routines. Unlike current state-of-the-art approaches, JARVIS provides a personalized and context-aware alternative, facilitating seamless and fluent interactions with the end-user.
Effective recommender systems rely on the assumption of well-defined user preferences, an assumption frequently violated in practice. So, to assist users in developing and understanding their preferences, we designed six different visualizations that juxtapose users’ predicted preferences against those of a larger audience. We conducted think-aloud studies to investigate which visualization helps optimally develop and understand preferences. Our findings contribute to a broader call to develop recommender systems that support users’ self-actualization and long-term perspective.
Public and private organizations rely on opinion surveys to inform business and policy decisions. Yet, empirical surveys are costly and time-consuming. Recent advances in large language models (LLMs) have sparked interest in generating synthetic survey data, i.e., simulated answers based on target demographics, as an alternative to real human data. But how well can LLMs replicate human opinions? In this ongoing project, we develop and critically evaluate methods for synthetic survey sampling. As an empirical benchmark, we collected responses from a representative U.S. sample (n = 461) on preferences for a common consumer good (soft drinks). Then, we developed ASPIRE (Automated Synthetic Persona Interview and Response Engine), a tool that pairs each human participant with a “digital twin” based on their demographic profile and generates synthetic responses via LLM technology. Synthetic data achieved better-than-chance accuracy in matching human responses and approximated aggregate subjective rankings for both binary and Likert-scale items. However, LLM-simulated data overestimated humans’ tendencies to provide positive ratings and exhibited substantially reduced variance compared to real data. The match of synthetic and real data was not systematically related to participants’ age, gender, or ethnicity, indicating no demographic bias. Overall, while synthetic sampling shows promise for modeling aggregate opinion trends, it currently falls short in replicating the variability and complexity of real human opinions. We discuss insights of our ongoing project for accurate and responsible user opinion modeling via LLMs.
AI-driven decision-support systems (DSSs) are increasingly shaping critical choices across industries, yet non-experts often struggle to understand and assess their fairness. This study explores how individuals engage with AI-generated decisions that impact others and how social factors, such as majority or minority opinions, influence fairness judgments. Our initial findings highlight that different groups vary in their interest in understanding AI decisions, with social influence playing a key role in shaping fairness perceptions. These insights provide a valuable foundation for future research on transparency and trust in AI decision-making.
The majority of research in recommender systems, be it algorithmic improvements, context-awareness, explainability, or other areas, evaluates these systems on datasets that capture user interaction over a relatively limited time span. However, recommender systems can very well be used continuously for extended time. Similarly so, user behavior may evolve over that extended time. Although media studies and psychology offer a wealth of research on the evolution of user preferences and behavior as individuals age, there has been scant research in this regard within the realm of user modeling and recommender systems. In this study, we investigate the evolution of user preferences and behavior using the LFM-2b dataset, which, to our knowledge, is the only dataset that encompasses a sufficiently extensive time frame to permit real longitudinal studies and includes age information about its users. We identify specific usage and taste preferences directly related to the age of the user, i.e., while younger users tend to listen broadly to contemporary popular music, older users have more elaborate and personalized listening habits. The findings yield important insights that open new directions for research in recommender systems, providing guidance for future efforts.
We present a systematic study of provider-side data poisoning in retrieval-augmented recommender systems (RAG-based). By modifying only a small fraction of tokens within item descriptions—for instance, adding emotional keywords or borrowing phrases from semantically related items—an attacker can significantly promote or demote targeted items. We formalize these attacks under token-edit and semantic-similarity constraints, and we examine their effectiveness in both promotion (long-tail items) and demotion (short-head items) scenarios. Our experiments on MovieLens, using two large language model (LLM) retrieval modules, show that even subtle attacks shift final rankings and item exposures while eluding naive detection. The results underscore the vulnerability of RAG-based pipelines to small-scale metadata rewrites, and emphasize the need for robust textual consistency checks and provenance tracking to thwart stealthy provider-side poisoning.
Recommender systems play a vital role in helping users discover content in streaming services, but their effectiveness depends on users understanding why items are recommended. In this study, explanations were based solely on item features rather than personalized data, simulating recommendation scenarios. We compared user perceptions of one-sided (purely positive) and two-sided (positive and negative) feature-based explanations for popular movie recommendations. Through an online study with 129 participants, we examined how explanation style affected perceived trust, transparency, effectiveness, and satisfaction. One-sided explanations consistently received higher ratings across all dimensions. Our findings suggest that in low-stakes entertainment domains such as popular movie recommendations, simpler positive explanations may be more effective. However, the results should be interpreted with caution due to potential confounding factors such as item familiarity and the placement of negative information in explanations. This work provides practical insights for explanation design in recommender interfaces and highlights the importance of context in shaping user preferences.
Conducting user studies that involve physiological and behavioral measurements is very time-consuming and expensive, as it not only involves a careful experiment design, device calibration, etc. but also a careful software testing. We propose Thalamus, a software toolkit for collecting and simulating multimodal signals that can help the experimenters to prepare in advance for unexpected situations before reaching out to the actual study participants and even before having to install or purchase a specific device. Among other features, Thalamus allows the experimenter to modify, synchronize, and broadcast physiological signals (as coming from various data streams) from different devices simultaneously and not necessarily located in the same place. Thalamus is cross-platform, cross-device, and simple to use, making it thus a valuable asset for HCI research.
This paper presents our study on implementing generative AI technology to enhance Reminiscence Therapy (RT). Building on the promising feedback from an earlier prototype tested with individuals aged 70 who were pre-diagnosed with dementia, we refined the system and conducted further evaluations with nonagenarians (aged 90+) diagnosed with late-stage dementia, a group which has been largely underrepresented in existing research. Our system incorporated an image-augmented interaction mechanism, using Azure Computer Vision to analyze photos and Azure Cognitive Services to convert users’ voice input into generative prompts. GPT-4o and Stable Diffusion were used to converse with users and for image generation. Through our system, users were able to engage in conversation and discuss their past experiences as part of a reminiscence exercise, during which corresponding images were generated to increase the vividness of their reminiscence experience. Four pairs of nonagenarians with late-stage dementia and their caregivers participated in our study and used our system. Based on interviews conducted with their caregivers, we found that, due to their severe decline in cognitive capacity, nonagenarians with late-stage dementia were unlikely to tolerate cultural misrepresentation or draw meaningful inferences from the AI-generated images. Moreover, an iterative approach to generating final images might be more effective, as individuals with late-stage dementia could struggle to express themselves fully. Finally, we report the findings and lessons learned from this group of participants and reflect on the practical impact of our study.
Code understanding is a common and important use case for generative AI code assistance tools. Yet, a user’s background, context, and goals may impact the kinds of code explanations that best fit their needs. Our aim was to understand the kinds of configurations users might want for their code explanations and how those relate to their context. We ran an exploratory study with a medium-fidelity prototype and 10 programmers. Participants valued having configurations and desired automated personalization of code explanations. They found particular merit in being able to configure the structure and detail level in code explanations and felt that their needs might change depending on their prior experience and goals.
As artificial intelligence systems take on increasingly agentic roles, they begin making decisions on behalf of users rather than merely supporting them. Consequently, it becomes crucial to understand how closely these systems can replicate human choices. In this study, we examine the extent to which digital traces of user behavior can serve as a foundation for modeling individual preferences. Specifically, we use Facebook status updates, a form of self-disclosed digital traces. Based on these digital traces, the goal is to predict users’ Facebook likes across various categories (e.g., Food, Movies, Public Figures, etc.), which serve as behavioral expressions of preference. Tested over 10,000 queries, we find that most categories achieve a prediction accuracy exceeding 60%, indicating generally robust performance of the Large Language Model. These findings suggest that digital traces such as Facebook status updates can reveal meaningful patterns that allow AI systems to learn more about decisions in other contexts.
Human Robot Interaction (HRI) is a field of study dedicated to understanding, designing, and evaluating robotic systems for use by, or with, humans. In HRI there is a consensus about the design and implementation of robotic systems that should be able to adapt their behavior based on user actions and behavior. The robot should adapt to emotions, personalities, and it should also have a memory of past interactions with the user to become believable. This is of particular importance in the field of social robotics and social HRI. The aim of this Workshop is to bring together researchers and practitioners who are working on various aspects of social robotics and adaptive interaction. The expected result of the workshop is a multidisciplinary research agenda that will inform future research directions and hopefully, forge some research collaborations.
Theory of Mind is described as the capability to attribute mental states to oneself and others, and it can be essential for robots to favor more collaborative, adaptive, and emotionally appropriate behaviors when they are deployed in human-centered environments. In this work, we survey existing methodologies to introduce ToM in Human-Robot Interaction, focusing on two main formalizations: Cognitive ToM, focusing on reasoning about beliefs and intentions in a more task oriented way, and Affective ToM, which requires the agent to recognize and adapt to others’ emotional states. While both approaches have advanced robot adaptability and user engagement, they are often developed in isolation. For this reason, we will discuss the need for integrated models that combine cognitive and affective reasoning, and outline future directions for more socially intelligent and emotionally aware robotic systems.
This paper proposes using social robots to enhance children’s experiences in museums. Specifically, we aim to equip these social robots with multimodal large language models (MLLMs) to generate questions that engage children interactively. To achieve this, we evaluate the capabilities of LLaVA models in generating diverse and relevant questions about artworks, comparing their performance on visual questions with contextual questions. We utilize a subset of the AQUA dataset to assess both quantitative metrics and qualitative aspects of the generated questions. Additionally, we examine the models’ ability to create engaging questions tailored specifically for children. We emphasize how MLLMs can generate questions that may increase enjoyment during visits, promote active observation, and enhance children’s cognitive and emotional engagement with artworks. This approach aims to contribute to more inclusive and effective learning experiences in museum settings.
This paper presents a review of the integration of quantum computing techniques into social robotics, focusing on the potential for enhancing robot adaptability, decision-making, and emotional intelligence. We analyze the current state of research, examining how quantum algorithms, such as Grover’s algorithm, can be applied to improve human-robot interaction. The review, as a task of the QUADRI project (QUAntum-enhanceD human-Robot Interaction: Pioneering Intelligent Social Robotics), provides a comprehensive overview of the opportunities and challenges in this emerging field, setting the foundation for future research and practical applications in domains such as mental health, education, and workplace stress management.
While Human-Robot Interaction (HRI) has seen extensive exploration, Animal-Robot Interaction (ARI) remains a less mature field. This paper presents a first AI-based prototype designed to enable a humanoid robot to recognize emotional and postural states in dogs and adapt its behavior accordingly. Using a deep learning-based pipeline for real-time detection and classification, the robot could adapt its movements to better accommodate canine responses. We propose that such an adaptive approach paves the way for more natural coexistence between robots and animals in domestic settings, raising new challenges in perception, behavior design, and ethics within ARI.
This paper presents a novel approach to the use of the humanoid robot Pepper in educational contexts, focusing on dialogic and multimodal interaction for teaching abstract concepts in mathematics and physics. Unlike traditional lecture-based models, our system supports learner-centered lessons where students engage in spontaneous dialogue with the robot while interacting with visual content displayed on its tablet. The robot responds to user questions through the coordination of speech, synchronized textual output, and dynamic visual cues, such as real-time modifications to vectorial images that highlight relevant elements. To promote accessibility and engagement, the system includes customizable visual features such as color schemes and adjustable font settings for visual-impaired users. This approach aims to foster a more inclusive, personalized, and interactive learning experience by adapting the lesson content to the learner’s interests and inquiries.
This paper presents the development of a simulated assistive system based on the NAO humanoid robot, designed to support cognitive engagement and well-being in elderly users. Leveraging the Webots simulation environment, we integrated advanced functionalities including voice interaction through Google Speech Recognition, contextual dialogue using the LLaMA language model, and speech synthesis via pyttsx3. The system enables the virtual NAO (vNAO) to conduct conversational interactions, administer cognitive exercises, issue reminders, and guide users through physical activities, all within a personalized, elderly-friendly virtual environment. Our implementation demonstrates that a simulation-based approach can provide a scalable, accessible framework for testing and deploying socially assistive robotics.
In recent years, adaptive and personalized systems, underpinned by cutting-edge technologies such as Large Language Models (LLMs), have emerged as pivotal forces in reshaping the digital landscape. These systems, seamlessly woven into the fabric of everyday life, manifest in diverse forms—from conversational agents that emulate human-like dialogue to recommendation algorithms that tailor content like music, films, and consumer products to individual preferences. Their pervasive integration into digital platforms has fundamentally altered the ways in which users engage with information, make decisions, and interact with technology, positioning them as indispensable tools in modern society. The transformative potential of these technologies lies in their ability to enhance user engagement, streamline content delivery, and support decision-making processes with unprecedented precision. However, as their influence continues to expand, so too does the urgency to confront critical challenges surrounding transparency, fairness, and user trust. ExUM workshop seeks to explore these pressing issues, advocating for a balanced approach that prioritizes not only technological efficacy but also ethical integrity and user empowerment.
Developing personalized systems requires architectures that ensure adaptability, explainability, and ethical compliance while maintaining user engagement and trust. To assess whether a system meet these principles, this article puts to the test a novel framework, denominated CARAIX (Collaborative, Adaptive, and Responsible Artificial Intelligence -AI- assisted by eXplainability), designed to develop intelligent systems with a human-centered approach, to support real-time feedback and bias-aware AI decision-making. CARAIX is inspired by the principles of the Hybrid Intelligence (HI) paradigm and emphasizes the integration of explainable AI techniques in the development process to enhance user interaction and system reliability. This paper analyses, using a peer-validated rubric, how the dimensions of the HI paradigm are integrated across four diverse and real-world learning scenarios, including intelligent tutoring systems, psychomotor skill acquisition, autonomous driving training, and acquiring occupational safety competences. CARAIX is designed for scalability and reuse, facilitating integration into various AI-driven educational domains. We aim to share its potential for sustainable and ethically sound AI-enhanced multidisciplinary learning environments and for the assessment of whether a system complies with HI principles.
Sentiment Analysis (SA) has proven to be an effective tool for recognizing opinions in text. However, the mechanisms by which these models arrive at specific predictions often remain unclear. This paper explores how eXplainable Artificial Intelligence (XAI) techniques can enhance interpretability in sentiment classification. Specifically, we leverage SHAP (SHapley Additive exPlanations) and counterfactual generation to identify words influencing sentiment predictions in movie reviews. Our approach integrates a neural classifier and generates counterfactual examples to reveal how slight text modifications affect model decisions. Experimental results show that SHAP-based attribution and counterfactual analysis provide deeper insights into the linguistic factors driving sentiment classification.
Generative AI, particularly Large Language Models (LLMs), has revolutionized human-computer interaction by enabling the generation of nuanced, human-like text. This presents new opportunities, especially in enhancing explainability for AI systems like recommender systems, a crucial factor for fostering user trust and engagement. LLM-powered AI-Chatbots can be leveraged to provide personalized explanations for recommendations. Although users often find these chatbot explanations helpful, they may not fully comprehend the content. Our research focuses on assessing how well users comprehend these explanations and identifying gaps in understanding. We also explore the key behavioral differences between users who effectively understand AI-generated explanations and those who do not. We designed a three-phase user study with 17 participants to explore these dynamics. The findings indicate that the clarity and usefulness of the explanations are contingent on the user asking relevant follow-up questions and having a motivation to learn. Comprehension also varies significantly based on users’ educational backgrounds.
The increasing availability of user data on music streaming platforms opens up new possibilities for analyzing music consumption. However, understanding the evolution of user preferences remains a complex challenge, particularly as their musical tastes change over time. This paper uses the dictionary learning paradigm to model user trajectories across different musical genres. We define a new framework that captures recurring patterns in genre trajectories, called pathlets, enabling the creation of comprehensible trajectory embeddings. We show that pathlet learning reveals relevant listening patterns that can be analyzed both qualitatively and quantitatively. This work improves our understanding of users’ interactions with music and opens up avenues of research into user behavior and fostering diversity in recommender systems. A dataset of 2000 user histories tagged by genre over 17 months, supplied by Deezer (a leading music streaming company), is also released with the code.
The PHaSE project promotes healthier and more sustainable eating habits through the integration of recommender systems and conversational agents. It aims to enhance users’ awareness and understanding of responsible eating by enabling them to explore the healthiness of recipes, identify more sustainable ingredient substitutes, and receive personalized advice.
Recommender systems (RS) have gained widespread adoption in digital platforms across a variety of domains. However, the users’ understanding of how these systems function and how they might adjust it typically remains opaque. When users perceive a lack of ability to control or personalize the system, this can lead to a loss of trust and lower perceived usefulness of the RS. In this study, we explore the user perceptions of control over YouTube and the strategies they employ to exercise agency. Using a thematic analysis of 200 discussion threads from Reddit, this study examines how users exercise agency, drawing on self-reported user experiences with YouTube’s recommender system. Our findings provide insights into the users’ understanding of the various control mechanisms and their ability to align the system with their personal preferences.
Accurately modeling user preferences is vital not only for improving recommendation performance but also for enhancing transparency in recommender systems. Conventional user-profiling methods—such as averaging item embeddings—often overlook the evolving, nuanced nature of user interests, particularly the interplay between short-term and long-term preferences. In this work, we leverage large language models (LLMs) to generate natural language summaries of users’ interaction histories, distinguishing recent behaviors from more persistent tendencies. Our framework not only models temporal user preferences but also produces natural language profiles that can be used to explain recommendations in an interpretable manner. These textual profiles are encoded via a pre-trained model, and an attention mechanism dynamically fuses the short-term and long-term embeddings into a comprehensive user representation. Beyond boosting recommendation accuracy over multiple baselines, our approach naturally supports explainability: the interpretable text summaries and attention weights can be exposed to end users, offering insights into why specific items are suggested. Experiments on real-world datasets underscore both the performance gains and the promise of generating clearer, more transparent justifications for content-based recommendations.
Understanding how failures occur and how they can be corrected is essential for debugging, maintaining user trust, and developing personalised policies. Counterfactual sequences, which provide an alternative sequence of actions that delivers an improved outcome, have been used to correct failure in sequential decision-making tasks. However, prior work on counterfactual sequences has focused primarily on the algorithmic side of sequence generation, mostly overlooking the potential of counterfactuals as a user-friendly explanation method. In this work, we lay the groundwork for human-centred and personalised counterfactual sequence generation. Informed by insights from psychology and cognitive science, we propose a set of desiderata for understandable and useful counterfactual sequences. We then introduce an algorithm based on these desiderata that generates diverse counterfactual sequences, enabling the user to correct the failure in line with their preferences.
AI assistance can be dynamically adapted to persuade users to build reliance on AI systems. Personalizing AI assistance based on users’ latent traits and real-time behavior can also improve human-AI collaborative decision-making. However, there is limited exploration in the literature on personalizing AI assistance to user traits and behavior. Understanding how users engage and interact with personalized explanations from the lens of reducing over-reliance is also underexplored. In this position paper, we present a rationale for personalized and persuasive interventions to build appropriate reliance and enhance user engagement with AI assistance. We examine the current literature and argue that user-centric persuasion and engagement improve analytical system evaluation and foster reliance on AI assistance. Considering persuasive and personalized AI assistance, we posit a study design for user-centered engagement to improve appropriate reliance.
As generative AI systems become increasingly integrated into real-world applications, the need to analyze and interpret their outputs grows in importance. This paper addresses the challenge of assessing whether generative outputs exhibit specific characteristics—such as toxicity, a certain sentiment, or bias. We borrow a concept from the traditional Explainable AI literature- counterfactual explanations- but argue that it needs to be significantly rethought. We propose a flexible framework that extends counterfactual explanations to non-deterministic generative AI systems, specifically in scenarios where downstream classifiers can reveal characteristics of their outputs.
When users with different preferences and entitlements compete for limited access to a shared resource, maximizing total utility can lead to significant disparities in how users are served. We examine this tension in the context of allocating satellite observation windows, where users differ in their willingness to pay or their contribution to the system. The goal is to schedule observations efficiently while promoting balanced, fair access among users. This challenge is amplified in settings where coordination is decentralized and users negotiate outcomes without a central authority. We propose a hybrid algorithm designed to balance fairness and efficiency in distributed scheduling. Our method produces allocations that retain high efficiency while reducing inequality. Although developed for satellite scheduling, the algorithm applies more broadly to decentralized systems where users with heterogeneous preferences share limited resources.
Family caregivers play a vital role in supporting children with chronic health conditions, such as neonates diagnosed with hypoxic-ischemic encephalopathy (HIE). However, navigating complex medical information can be overwhelming due to the quantity and quality of available literature. This study leverages Retrieval-Augmented Generation (RAG)-based Large Language Models (LLMs) to develop a chatbot that integrates peer-reviewed scientific literature and provides personalized, simplified summaries for caregivers. A user study involving six caregivers and five healthcare providers demonstrated the chatbot’s ability to enhance clarity, improve comprehension, and deliver essential medical information concisely. Our findings highlight the potential of RAG-based LLMs to enhance caregivers’ health literacy and support their information-seeking behavior, while also underscoring the importance of thoughtfully navigating the differing expectations of caregivers and healthcare providers regarding the type, depth, and presentation of medical information.
Hate speech is often subtle and context-dependent, making it especially difficult to detect particularly when it requires contextual familiarity related to the targeted group. Exposure to hate speech and toxic content can lead to significant psychological harm, including increased stress and anxiety levels and content moderators are particularly vulnerable due to exposure to such harmful material. This work explores the role of personalization in content moderation by examining how alignment between a moderator’s background and the targeted group affects emotional and cognitive responses. We propose a target substitution method that replaces references to real communities in hate speech with fictional characters, aiming to reduce emotional distress while preserving the semantic integrity necessary for accurate moderation. Through both automated and human evaluations, we find that substitution significantly reduces emotional distress across all groups with a trade-off in accuracy. Moreover, we observe that moderators demonstrate higher accuracy when moderating content aligned with their own demographic background, even after substitution. This suggests the key role of contextual familiarity in interpreting implicit hate. Additionally, our study highlights the cumulative impact of prolonged exposure to hate speech, showing that moderators experience increased emotional distress over time, particularly in non-targeted scenarios. Despite this, target substitution consistently mitigates distress while maintaining moderation efficacy.
The nationwide surge of Black Lives Matter (BLM) protests that followed George Floyd's murder in May 2020 offers a rare, large-scale natural experiment for interrogating fairness in information access. We combine three complementary data sources: (1) an original U.S. survey linking political attitudes to self-reported queries; (2) state-level Google Trends signals; and (3) a collection of 1,500 Google Search ranked URLs elicited with attitude-conditioned queries, to trace how user intent and algorithmic ranking jointly shape what people see when they “search for BLM”. The analyses reveal three fairness-relevant patterns. First, survey respondents who opposed BLM reported different queries (e.g., “protester violence”) than supporters (“equality”), indicating that query formulation is shaped by political stance. Second, aggregate Trends data showed that BLM-supportive states generated more search traffic for BLM-affirming queries than states with lower support, indicating politically slanted collective search interest. Third, result-page audits found a slight left-of-center domain bias- even for anti-BLM queries- while topic modeling showed subtly distinct content framings depending on the queries’ stance. Taken together, the study provides empirical evidence that can inform fairness interventions and design implications for adaptive systems to anticipate and counteract ideologically skewed information pathways.
Recommender systems play an essential role in connecting users with items. Traditionally, research in this field has focused on refining recommendation algorithms within monolithic systems that reside in a single platform. We are exploring alternative architectures in which users have a choice over recommendation algorithms. In this work, we use simulation grounded in real-world data to explore the impact of such alternative designs on recommendation stakeholders. We show that consumers of niche items and producers of such items can both benefit from algorithmic choice.
Data is an essential resource for studying recommender systems. While there has been significant work on improving and evaluating state-of-the-art models and measuring various properties of recommender system outputs, less attention has been given to the data itself, particularly how data has changed over time. Such documentation and analysis provide guidance and context for designing and evaluating recommender systems, particularly for evaluation designs making use of time (e.g., temporal splitting). In this paper, we present a temporal explanatory analysis of the UCSD Book Graph dataset scraped from Goodreads, a social reading and recommendation platform active since 2006. We measure the book interaction data using a set of activity, diversity, and fairness metrics; we then train a set of collaborative filtering algorithms on rolling training windows to observe how the same measures evolve over time in the recommendations. Additionally, we explore whether the introduction of algorithmic recommendations in 2011 was followed by observable changes in user or recommender system behavior.
Generative AI tools (GAITs) fundamentally differ from traditional machine learning tools in that they allow users to provide as much or as little information as they choose in their inputs. This flexibility often leads users to omit certain details, relying on the GAIT to infer and fill in less critical information based on distributional knowledge of user preferences. Inferences about preferences lead to natural questions about fairness, since a GAIT’s “best guess” may skew towards the preferences of larger groups at the expense of smaller ones. Unlike more traditional recommender systems, GAITs can acquire additional information about a user’s preferences through feedback or by explicitly soliciting it. This creates an interesting communication challenge: the user is aware of their specific preference, while the GAIT has knowledge of the overall distribution of preferences, and both parties can only exchange a limited amount of information. In this work, we present a mathematical model to describe human-AI co-creation of content under information asymmetry. Our results suggest that GAITs can use distributional information about overall preferences to determine the “right” questions to ask to maximize both welfare and fairness, opening up a rich design space in human-AI collaboration.
As Artificial Intelligence (AI) systems are increasingly integrated into high-stakes domains, the demand for transparency has become paramount. The opacity of "black-box" models poses significant challenges in trust, fairness, and accountability. Explainable AI (XAI) is a vital approach for addressing these concerns by enabling transparency, fostering trust, and ensuring ethical deployment across various sectors, including healthcare, human resources, finance, autonomous systems, and more. This paper explores how XAI methods can be used throughout the AI lifecycle for creating human-centered, ethical, and responsible AI systems by enhancing transparency, reducing bias, and protecting data privacy. Furthermore, the paper introduces XAI4RE, a theoretical framework that links XAI principles and purposes to concrete stages of the AI lifecycle, demonstrating how to address ethical considerations effectively. This approach involves engaging different stakeholders, such as developers, regulators, and users, at each stage. The framework highlights the critical role of XAI in promoting fairness, accountability, and human-centric design using general guidelines that discuss the relevant insights that can be drawn from XAI at each lifecycle stage. Ultimately, this paper underscores the importance of XAI in bridging the gap between technical advancements and ethical AI practices to foster societal trust and responsible systems.
Group Recommender Systems (GRSys) are designed to recommend items that address the needs of groups of people. Compared to individual users, groups are dynamic entities where interpersonal relationships, group dynamics, emotional contagion, etc., substantially affect the group’s needs. Nevertheless, these characteristics are often poorly defined or overlooked in system modeling. The fourth GMAP workshop brought together a community of scholars focused on group modeling, adaptation, and personalization. The event was dedicated to exploring the challenges and opportunities of supporting collective decision-making, fostering interdisciplinary dialogue, and forging new collaborations. The four presented papers covered a diverse range of topics: (i) an exploratory analysis of LLM applications to group meeting transcripts, (ii) an extensive review of the growing methodological divide in group recommender systems, (iii) a novel application of group modeling for personalizing public displays, and (iv) a detailed examination of prompt design for group recommendations using LLMs.
Group recommender systems (GRSys) focus on the challenges of recommending to groups of users with possibly contradicting needs and preferences. Methodologically, we distinguish between approaches aiming to aggregate preferences of group members and aggregating per-user recommendations. In early GRSys research, this methodological duality did not affect the connected research objectives and evaluation methodology much. However, nowadays, we witness a gradual rift in the research induced by both algorithm classes. In this work, based on a survey of 110 recent GRSys papers, we aim to quantify this rift along several aspects, including involved communities, evaluation datasets, objectives, and baselines. We showcase how little both subtrees have in common nowadays and discuss missed opportunities this rift causes. In conclusion, we also highlight novel research avenues that may contribute towards bridging the rift to the benefit of both research areas.
Large interactive displays in semi-public areas are shared by diverse users whose cultural backgrounds influence how they perceive user interfaces. Personalizing such interfaces with cultural differences in mind requires aggregating individual cultural user models (based on Hofstede’s cultural dimensions) into a group profile. This paper investigates the applicability of group modeling strategies for this purpose as a preliminary exploration, addressing the unique characteristics of cultural dimension values, which differ from traditional numeric ratings. Using an example dataset representing an intercultural group, strategies were identified to rank cultural dimensions and aggregate their values. Borda count produced the clearest ranking, while average without misery and fairness emerged as promising value aggregation strategies. These findings demonstrate how group modeling can support intercultural personalization of shared interfaces and extend the use of these strategies to other types of preference values.
Large Language Models (LLMs) are increasingly applied in recommender systems aimed at both individuals and groups. Previously, Group Recommender Systems (GRS) often used social choice-based aggregation strategies to derive a single recommendation based on the preferences of multiple people. In this paper, we investigate under which conditions language models can perform these strategies correctly based on zero-shot learning and analyse whether the formatting of the group scenario in the prompt affects accuracy. We specifically focused on the impact of group complexity (number of users and items), different LLMs, different prompting conditions, including In-Context learning or generating explanations, and the formatting of group preferences. Our results show that performance starts to deteriorate when considering more than 100 ratings. However, not all language models were equally sensitive to growing group complexity. Additionally, we showed that In-Context Learning (ICL) can significantly increase the performance at higher degrees of group complexity, while adding other prompt modifications, specifying domain cues or prompting for explanations, did not impact accuracy. We conclude that future research should include group complexity as a factor in GRS evaluation due to its effect on LLM performance. Furthermore, we showed that formatting the group scenarios differently, such as rating lists per user or per item, affected accuracy. All in all, our study implies that smaller LLMs are capable of generating group recommendations under the right conditions, making the case for using smaller models that require less computing power and costs.
In many situations, groups of people need to collaborate to achieve a shared goal or solve a common problem. However, reaching an agreement can be inefficient, and critical perspectives can be overlooked during lengthy discussions. To improve this situation, this paper introduces a practical approach that uses an LLM to analyze recorded group discussions and provide informed recommendations. It analyzes meeting transcripts to identify discussed options, summarize outcomes, track decision dynamics, and generate helpful recommendations. This automation could save time, enhance transparency, and improve productivity. Through real-world case studies, we evaluate the approach to explore the strengths and limitations of using LLMs to support group decision-making.
Hybrid AI, which integrates symbolic and sub-symbolic methods, has emerged as a promising paradigm for advancing human-centric personalization. By combining machine learning with structured knowledge representations, hybrid AI enables interpretable and adaptive user models that account for human factors such as biases, mental models, and affective states. The HyPer workshop provides a venue to discuss how hybrid AI approaches, combining neural architectures, symbolic representations, and cognitive/behavioral frameworks, can bridge the gap between explainability, cognitive modeling, and automated adaptation to user preferences.
AI assistance is increasingly used to improve human-AI collaborative decision-making. However, how domain experts integrate their knowledge with grounded constraints and formulate intent with AI systems remains underexplored. In this position paper, we argue for “cognitively aligned” AI assistance, where users engage interactively with symbolic (logic-based) and sub-symbolic AI to interpret, influence, and co-construct decisions. Through this lens, we believe that users can build effective reliance on AI assistance, iteratively anchoring their domain knowledge to adapt their mental models and AI assistance. We explore the current literature and emphasize the need for cognitive (analytical) engagement with AI assistance to improve semantic alignment and interactive affordances for domain experts. We outline a plan for a research study that explores users’ interaction with AI assistance and quantitative reasoning in business decision-making.
As recommender systems become increasingly complex, transparency is essential to increase user trust, accountability, and regulatory compliance. Neuro-symbolic approaches that integrate symbolic reasoning with sub-symbolic learning offer a promising approach toward transparent and user-centric systems. In this work-in-progress, we investigate using fuzzy neural networks (FNNs) as a neuro-symbolic approach for recommendations that learn logic-based rules over predefined, human-readable atoms. Each rule corresponds to a fuzzy logic expression, making the recommender’s decision process inherently transparent. In contrast to black-box machine learning methods, our approach reveals the reasoning behind a recommendation while maintaining competitive performance. We evaluate our method on a synthetic and MovieLens 1M datasets and compare it to state-of-the-art recommendation algorithms. Our results demonstrate that our approach accurately captures user behavior while providing a transparent decision-making process. Finally, the differentiable nature of this approach facilitates an integration with other neural models, enabling the development of hybrid, transparent recommender systems. 1
Recommender systems often rely on sub-symbolic machine learning approaches that operate as opaque black boxes. These approaches typically fail to account for the cognitive processes that shape user preferences and decision-making. In this vision paper, we propose a hybrid user modeling framework based on the cognitive architecture ACT-R that integrates symbolic and sub-symbolic representations of human memory. Our goal is to combine ACT-R’s declarative memory, which is responsible for storing symbolic chunks along sub-symbolic activations, with its procedural memory, which contains symbolic production rules. This integration will help simulate how users retrieve past experiences and apply decision-making strategies. With this approach, we aim to provide more transparent recommendations, enable rule-based explanations, and facilitate the modeling of cognitive biases. We argue that our approach has the potential to inform the design of a new generation of human-centered, psychology-informed recommender systems.
Human motion cannot be fully modeled by subsymbolic representations. While these extract precise hidden patterns in motion data, they are often task-specific and lack a semantic understatement of motion. Symbolic systems that mirror human cognition and explicit expressive processes are necessary for richer motion synthesis and analysis, enabling physical reasoning and expert knowledge encoding. In this work, we propose a neurosymbolic framework that combines Labanotation and Laban Movement Analysis (LMA), originally developed for dance, to represent and analyze human motion symbolically. We expand the existing LabanEditor to support full-body annotation and integrate it with AMASS, Mediapipe, and Kinect inputs through a SMPL-based format. Our system supports automatic annotation for the local functional and expressive aspects of motion, and enables bidirectional conversion between symbols and motion. While still a work in progress, this framework lays the groundwork for explainable, expressive motion modeling that can support human-robot interaction, motion preservation, and psychomotor learning systems.
Large Language Models (LLMs) represent a landmark achievement in Artificial Intelligence (AI), demonstrating unprecedented proficiency in procedural tasks such as text generation, code completion, and conversational coherence. These capabilities stem from their architecture, which mirrors human procedural memory—the brain’s ability to automate repetitive, pattern-driven tasks through practice. However, as LLMs are increasingly deployed in real-world applications, it becomes impossible to ignore their limitations operating in complex, unpredictable environments. This paper argues that LLMs, while transformative, are fundamentally constrained by their reliance on procedural memory. To create agents capable of navigating “wicked” learning environments—where rules shift, feedback is ambiguous, and novelty is the norm—we must augment LLMs with semantic memory and associative learning systems. By adopting a modular architecture that decouples these cognitive functions, we can bridge the gap between narrow procedural expertise and the adaptive intelligence required for real-world problem-solving.
This study investigated how displaying AI confidence levels affected user trust and effectiveness in decision-making contexts. Current chatbot interfaces lack transparency in response reliability, which could lead to misguided trust in AI-generated content. We addressed this limitation through a confidence rating interface that visually communicates model certainty and provides prompt improvement suggestions. We conducted a between-subjects study (n=20) comparing a standard chatbot interface with a confidence rating interface that displays three features: 1) confidence rating, 2) confidence factors, and 3) prompt improvement suggestions. Participants completed tasks which could be done in an enterprise setting. These tasks included asking about travel planning suggestions, fact verification about unfamiliar topics, multi-step problem solving for a timezone, and decision making about a stock value. While we didn’t reach statistical significance with this small sample size, results showed that the confidence rating interface tended to improve user effectiveness and confidence, particularly in tasks requiring verification or reasoning. Our findings suggest that combining confidence indicators with prompt suggestions could enhance information evaluation when working with AI systems, with implications for enterprise applications where trust is essential.
Automated Machine Learning (AutoML) has significantly advanced Machine Learning (ML) applications, including model compression, machine translation, and computer vision. Recommender Systems (RecSys) can be seen as an application of ML. Yet AutoML has received little attention from the RecSys community, and RecSys has not received notable attention from the AutoML community. Only a few relatively simple Automated Recommender Systems (AutoRecSys) libraries exist that adopt AutoML techniques. However, these libraries are based on student projects and do not offer the features and thorough development of AutoML libraries. We set out to determine how AutoML libraries perform in the scenario of an inexperienced user who wants to implement a recommender system. We compared the predictive performance of 60 AutoML, AutoRecSys, ML, and RecSys algorithms from 15 libraries, including a mean predictor baseline, on 14 explicit feedback RecSys datasets. We found that AutoML and AutoRecSys libraries performed best. AutoML libraries performed best for six of the 14 datasets (43%), but the same AutoML library did not always perform best. The single-best library was the AutoRecSys library Auto-Surprise, which performed best on five datasets (36%). On three datasets (21%), AutoML libraries performed poorly, and RecSys libraries with default parameters performed best. Although while obtaining 50% of all placements in the top five per dataset, RecSys algorithms fall behind AutoML on average. ML algorithms generally performed the worst.
Exploratory search begins without a specific goal, often leading to information overload and search fatigue as users attempt to understand, interpret, and retrieve information. While recent advances have enabled more sophisticated search agents, there remains a gap in understanding user behavior and cognitive processes during exploratory search in complex E-commerce environments, where users navigate through multiple information types while making rapid decisions. In this formative study (n=8), we collected browsing data and user feedback about search stages and information needs in a fashion e-commerce platform. Through qualitative analysis, we identified four distinct user orientations (brand, price, style, and popularity-driven) and mapped behaviors to specific exploratory search stages. These behavioral insights could inform future approaches to personalization that better align with users’ cognitive processes during different search stages, potentially contributing to the development of more human-centric systems for complex online shopping environments.
Large Language Models (LLMs) are transforming personalized services by enabling adaptive, context-aware recommendations and interactions. However, deploying these models at scale raises significant concerns about environmental impact, fairness, privacy, and trustworthiness, including high energy consumption, biased outputs, privacy breaches, and hallucinations. The LLM4Good workshop is a half-day workshop that addresses these challenges by fostering dialogue on sustainable and ethical approaches to LLM-based personalization. Participants will explore energy-efficient techniques, bias mitigation, privacy-preserving methods, and responsible deployment strategies. The workshop aligns with Sustainable Development Goals and Digital Humanism principles. It aims to guide the development of trustworthy, human-centric LLM systems that positively impact education, healthcare, and other domains.
This study evaluates the effectiveness of three prompting strategies: standard prompting, chain-of-thought (CoT) prompting, and informed CoT prompting on the performance of the GPT-J model in solving mathematical reasoning tasks from the GSM8K dataset. Using the full test set of 1,319 problems, we assess the model’s performance through accuracy, F1 score, BLEU, and ROUGE metrics. The findings suggest that while providing relevant context, such as math topics, can modestly enhance performance, the gains are limited. This underscores the importance of carefully designing prompts in adaptive systems and indicates that additional strategies may be necessary to achieve practical utility in educational applications.
Tribal communities face unique challenges in disaster response, often lacking resources and infrastructure to effectively respond to emergencies. This study explores the potential of generative Artificial Intelligence (AI) to enhance disaster response within these communities. We designed a multi-modality generative AI system for disaster assessment from user-generated photos and organized reports with community in-kind cost sharing. We introduced the system prototype at the 2024 National Congress of American Indians (NCAI) conference with emergency department professionals from diverse tribal nations and other stakeholders. Through a workshop-focused group discussion, we discussed the perception, ideas, and concerns for introducing generative AI technology to tribal communities to increase disaster resilience. Our findings suggest considerations about developing strategies and possible governance models when introducing LLM-based models to marginalized local communities with limited resources. This research contributes to literature of the potential and limitations of AI in supporting disaster preparedness and response within indigenous communities, ultimately informing strategies for enhanced tribal disaster resilience and sustainable development goals.
Large Language Models (LLMs) offer scalable opportunities to personalize feedback in education, yet their trustworthiness and effectiveness remain underexplored. We present a study conducted in an introductory programming and data science course with approximately 1,400 first-year university students. A subset of these students received both peer and LLM-generated feedback on their individual programming projects. Our results show that 56% of students preferred the LLM feedback, and 52% could not reliably distinguish it from human-written feedback. Student ratings suggest that LLM feedback is perceived as helpful, constructive, and relevant, though it often lacks personalized depth and motivational nuance. These findings underline the potential of LLMs to support scalable, personalized education, while pointing to key areas for responsible improvement. Based on these insights, we outline the future roadmap for the course in which LLM-generated feedback supports students in their learning journey but also instructors through monitoring student performance and helping to allocate instructional resources more effectively. Given limited human resources this approach enables personalized instructor feedback to be scaled to a large group of students.
Hacking poses a significant threat to cybersecurity, inflicting billions of dollars in damages annually. To mitigate these risks, ethical hacking, or penetration testing, is employed to identify vulnerabilities in systems and networks. Recent advancements in large language models (LLMs) have shown potential across various domains, including cybersecurity. However, there is currently no comprehensive, open, end-to-end penetration testing benchmark to drive progress and evaluate the capabilities of these models in security contexts. This paper introduces a novel open benchmark1 for LLM-based penetration testing, addressing this critical gap. We first evaluate the performance of LLMs, including GPT-4o and LLama 3.1-405B, using the state-of-the-art PentestGPT tool. Our findings reveal that while LLama 3.1 demonstrates an edge over GPT-4o, both models currently fall short of performing end-to-end penetration testing even with some minimal human assistance. Next, we advance the state-of-the-art and present ablation studies that provide insights into improving the PentestGPT tool2. Our research illuminates the challenges LLMs face in each aspect of Pentesting, e.g. enumeration, exploitation, and privilege escalation. This work contributes to the growing body of knowledge on AI-assisted cybersecurity and lays the foundation for future research in automated penetration testing using large language models.
Interviews in social and health sciences are resource intensive and susceptible to interviewer bias, inconsistency, and variability across interviewers. Moreover, human-led interviews may inhibit participant openness, especially regarding sensitive topics, due to judgment, compromised anonymity, or discomfort in face-to-face interactions. These shortcomings limit the quality of the data collected. To this end, we propose the Embodied Conversational Interview Agentic Service (ELIAS). Informed by human-developed interview guides, ELIAS aims to streamline the interview process by combining an empathetic and bias-free embodied conversational interview agent with a semi-supervised content analysis and coding agent. We describe the development of the first version of ELIAS and also present results from a first evaluation study with five participants. We assessed the acceptance of and alliance with the embodied conversational interview agent. The evaluation shows positive perceptions and a strong alliance with the conversational agent. Suggestions for improvement will guide our future work.
School students need to make decisions about their career paths after graduating. In Germany, students can choose between more than 300 vocational training programs, which can be overwhelming. Frequently, the students hesitate to talk with career counselors. The objective of this research is, therefore, to provide a recommendation system for school students to support their decision-making, which is based on their interests and provides recommendations with explanations based on a LLM. This system was developed with a social robot as the user interface to make it easy to use and appeal to the young target group. Based on user observations, preliminary findings indicate that the system is a valuable and engaging approach to support career counseling activities.
Following the success of previous editions, PATCH 2025 again serves as a meeting point at the intersection of cutting-edge cultural heritage research and personalized technologies. The workshop focuses on the use of ICT, both onsite and online, to enhance personal experiences in settings of natural and cultural heritage, with particular attention to ubiquitous and adaptive scenarios. PATCH 2025 brings researchers and practitioners from different disciplines together to explore how personalization and technology can enrich cultural heritage experiences. The workshop fosters the exchange of innovative ideas, encourages multidisciplinary dialogue, and aims to shape future research directions through collaboration. This summary provides an overview of the papers accepted for presentation and inclusion in the workshop proceedings, highlighting the latest advances and emerging trends in this dynamic field.
This paper investigates the application of a Multimodal Large Language Model to enhance visitor experiences in cultural heritage settings through Visual Question Answering (VQA) and Contextual Question Answering (CQA). We evaluate the zero-shot capabilities of LLaVA-7b (Large Language and Vision Assistant) on QA using the AQUA dataset. We assess how effectively it can answer questions about artwork, visual content, and contextual information through three experimental approaches. Our findings reveal that LLaVA demonstrates promising performance on visual questions, outperforming previous baselines but facing challenges with questions requiring contextual understanding. The selective knowledge integration approach showed the best overall performance, suggesting an efficient knowledge retrieval systems could enhance performance. Moreover, we show how to exploit such models to provide correct personalized answers using a well-established visitor model.
Cultural Heritage is opening up from the professional community to a wider public, generating an increasing demand for culture and an associated economic turnaround. This step requires to differentiate the behavior of Cultural Heritage systems, dealing with a wide variety of backgrounds, expectations, contexts, aims, educational and cultural level, preferences and interests. Computer Science and Artificial Intelligence can play a key role in this landscape, fine-tuning the fruition of cultural items to every kind of stakeholder and even to single users. In this paper we present an approach to personalization of Cultural Heritage fruition based on Knowledge Graphs. An approach to describe user models, and to use them for extracting personalized information, is proposed, and a platform that embeds this approach is described.
My Heritage Companion is a mobile-first framework that reimagines cultural heritage engagement through ethically adaptive, simulation-based personalization. The system enables users to upload personal visual artifacts—such as sketches, heirlooms, or travel photographs—which serve as entry points for AI-informed, persona-driven storytelling. Rather than relying on behavioral tracking or social media integration, it employs a cold-start personalization approach using rule-based persona modeling to deliver cognitively accessible, culturally contextualized narratives. The framework integrates four core modules: image ingestion, simulated AI-based visual matching, persona-driven narrative adaptation, and privacy safeguards guided by the FATE principles (Fairness, Accountability, Transparency, and Ethics). The current prototype simulates AI behavior using real heritage imagery from Failaka Island—an archaeological site of multi-era significance spanning the Dilmun, Hellenistic, and early Islamic periods. This simulation pipeline validates user experience logic, interface adaptability, and narrative delivery across five user personas. My Heritage Companion advances digital museology by supporting inclusive access to cultural heritage in privacy-sensitive, low-infrastructure contexts. It demonstrates how mobile-first systems can ethically bridge personal memory and public history through adaptive storytelling—empowering users to become co-creators of heritage experiences.
The cold start problem remains a major challenge in visual art recommendation, where limited user feedback often forces systems to rely on content-based filtering. While effective with sufficient data, content similarity-based recommendation can reinforce filter bubbles, narrowing user exposure to mainstream content. Popularity and diversity are both critical factors in recommendation systems, as they impact the visibility of niche items and overall user satisfaction. Yet, existing platforms often rely on popularity-centric algorithms that may discourage exploration and overshadow lesser-known items. To address this gap, our work investigates whether users’ preferences for popular and diverse recommendations remain stable over short sessions of recommendation. We propose an interactive, user-adjustable mechanism allowing individuals to control the balance between mainstream and novel suggestions in real-time. We implement this approach within a Web gallery recommendations. Through user study, we examine changes in user behavior. Our findings suggest that while many users initially gravitate toward popular and diverse content, providing controls encourages later adjustments and exploratory behavior. This highlights the need for cultural institutions to move from a tightly managed centralized model to offering users greater affordances for managing the popularity and diversity of personalized recommendations.
Movement disciplines like dance or martial arts are carriers of cultural knowledge, identity, and tradition. However, oral traditions and video recordings make the preservation of this knowledge susceptible to being lost. Expert movement notation, in turn, holds the potential for precise capture and knowledge inheritance. However, motion notation approaches are not widespread, the process is often time-consuming, and the movements are hard to visualize without expert knowledge. In this work, we use Labanotation and Laban Movement Analysis (LMA), a notation system and method originally developed for dance, as a symbolic, interpretable framework for motion representation and preservation. Our contribution resides in the expansion of an existing annotation system, the LabanEditor, to handle full-body motion and data from multiple sources, and support the work of experts in annotating the movements. Our development, called MoRTELaban, supports motion-to-notation and inverse mapping from notation to keyframes, enabling exchange between video, motion capture, and Labanotation formats. This allows for the documentation and reconstruction of traditional motion practices using expert-readable scores and 3D skeletons.
User models can be enhanced with context aware models (configurations) of the preferred avatar configuration. These models could be initiated by a set of rules connected to User Personality to mitigate the cold start problem. This joint model can be used and is appropriate for cultural heritage applications. This is a short paper exploring this idea and discussing possible methods of evaluation.
Wearable Devices (WDs), such as smartwatches and fitness trackers, continuously produce extensive data streams that reveal valuable information about physiological states, activity patterns, and user interactions. These devices enable the construction of advanced user models, offering dynamic insights into personal routines, health trends, and behavioural tendencies. Meanwhile, Brain-Computer Interfaces (BCIs) emerge as a transformative technology by capturing neural activity to provide unprecedented access to cognitive and emotional states. However, BCIs are not conventionally classified as wearables; the latest technological advancements have reduced their size to resemble everyday accessories like earphones, suggesting their potential integration into wearable formats in the near future.
Despite the promise of these technologies, the full exploitation of their data for user modelling and personalization—such as optimizing activities like media consumption or interaction design—remains underexplored. The convergence of WDs and BCIs opens up new avenues for understanding the complexity of human behaviour and preferences, and this potential is amplified by the integration of Large Language Models (LLMs). By synthesizing and interpreting multimodal datasets, LLMs can better understand the intricate interplay between physiological, cognitive, and behavioural signals, ultimately enriching user modelling processes.
Following the success of the first edition, this workshop seeks to delve into the deep impact of combining data from wearable devices, neural interfaces, and advanced machine learning models. Participants will explore the opportunities and challenges that arise in this innovative context, examining how these technologies can be harnessed to enhance the granularity and accuracy of user models. The discussions will also address practical implications, such as ethical considerations and the necessity of privacy-aware approaches when dealing with highly sensitive physiological and neural data.
Through collaborative exchanges, the initiative aspires to chart new directions in the field, fostering novel research trajectories and interdisciplinary partnerships. The interplay of WDs, BCIs, and LLMs can redefine user modelling by creating systems that dynamically adapt to individual needs and behaviours, paving the way for transformative advancements in personalized experiences. By drawing on cutting-edge research and practical expertise, the workshop aims to inspire innovative solutions that capitalize on these emerging synergies, advancing the boundaries of what is possible in user modelling and adaptive systems.
Sound synthesis plays a central role in the compositional process in the academic field of electroacoustic domain. This research aims to investigate and integrate the use of generative artificial intelligence models as tools for sound synthesis. Three models, SynthIo, MusicLM and MusicGen, were used in this study. Ten prompts were designed and tested with the aim of generating sound textures. To assess the consistency of the generated samples, three expert electroacoustic music composers evaluated the different samples related to specific requirements.
Emotion recognition methods using Artificial Intelligence (AI) and wearable/wireless Electroencephalography (wEEG) are promising, as wEEG signals effectively and conveniently capture brain activities related to emotions. However, conventional AI models require separate development for each wEEG channel configuration, limiting adaptability and increasing costs. To address this gap, this paper proposes a framework for leveraging text embedding models to transform wEEG signals into a standardised representation for different wEEG channel setups to be compatible with a single AI model. This approach enhances scalability, adaptability, and resource efficiency, making AI-driven emotion recognition more cost-effective and accessible. Our proposed method achieves an accuracy of 0.9368 and 0.9484 with snowflake-arctic-embed-l-v2.0 with 2-second epoching and multilingual-e5-large-instruct using 5-second epoching. This proposed method can be effectively applied across various wEEG channel configurations to support tasks to improve or explore human well-being, such as stress monitoring or emotion self-regulation.
In the electronic musical instrument scenario, the current paradigm of sound modification during live performance is predominantly based on the use of external control mechanisms to adjust sound configurations predefined by the performer. However, this approach is limited by the introduction of marginal latencies during the transition between sound configurations. To overcome these limitations, this study introduces a novel application of Brain-Computer Interface (BCI) technology in a control system environment for musical instruments during live performances. The proposed system exploits classification between mental states of activation and relaxation, employing a Machine Learning (ML) system that achieves an average Accuracy of 0.92. Using Beta Protocol, the system allows dynamic modulation of sound according to the mental state of the performer. Finally, an explainability analysis was performed to clarify the impact of specific features during the prediction process.
Wearable devices are revolutionizing smart computing in healthcare. In particular, electroencephalography (EEG) devices (e. g., electrode cap bundles, headbands) are currently enabling many healthcare applications that require real-time monitoring of brain electrical activity. Examples of those applications include: epilepsy diagnosis, sleep disorder diagnosis, tumor detection, autonomous navigation (e. g., to control wheelchairs), and stress reduction. In many of these applications, the use of clinical-grade EEG devices may not be feasible because of factors such as high cost, privacy concerns, and inconvenience. In this paper, we used the Granger causality test to study whether consumer-grade EEG devices can detect levels of cognitive stress that can reliably be shown to cause changes in vital signs such as blood volume pulse (BVP), electrodermal activity (EDA), and body temperature. Based on the obtained results, we were able to validate the viability of using consumer-grade wearable devices to build applications for stress monitoring and reduction without the need for advanced, expensive EEG devices.
Sonify Collective Human Intelligence (SCHI) is a musical installation that explores collective intelligence's cooperative functions by transforming human interaction into a multisensory experience. Biometric sensors capture real-time physiological variations within a group, translating psycho-emotional shifts into sound and visual design. This interaction creates a data flow that translates emotional state changes into real-time sound textures generated and processed by live electronics. An improvising soloist interacts with this evolving sound design, forming a self-regenerating sound cycle responsive to collective emotions. This research contributes an unexplored framework for integrating biometric data into artistic expression, demonstrating biofeedback's potential in collaborative, emotion-driven interaction, bridging psychology, music technology, and HCI.
Artificial intelligence (AI) is increasingly embedded in rehabilitation technologies designed for children with developmental disorders, offering new opportunities for personalised, adaptive care. However, the translation of these systems from lab to clinic is often decisively shaped by regulatory frameworks such as the EU General Data Protection Regulation (GDPR), the Medical Device Regulation (MDR), and the forthcoming AI Act. This position paper explores how these three regulatory pillars influence the ethical deployment of AI and the design and innovation process behind pediatric rehabilitation tools. Drawing from recent literature and ongoing policy developments, we argue that GDPR, MDR, and the AI Act should not be viewed merely as compliance hurdles but as co-design forces that enable trustworthy, interpretable, and clinically viable AI. We propose a forward-looking framework to align innovation with regulation to facilitate the safe and effective implementation of AI-powered rehabilitation in child-centred healthcare.