During online browsing, e.g. when looking to select a movie to watch, we are often confronted with multiple rejection-selection steps which can lead to tens or hundreds of decisions made in quick succession. It is unclear if showing the next “best” item, as often employed by standard recommenders, is the most efficient way to help users select an item. In this work, we show that we can reduce the number of decisions to selection with a reinforcement learning-based Decision Minimizer Network (DMN). By implementing a step-aware reward function we can penalize long sequences, leading to fewer decisions having to be made by humans. Using a task to select a movie to watch, we show that we can reduce the number of decisions to selection by 39% compared to heuristic strategies and by 20% compared to standard recommender while increasing user selection satisfaction. Minimizing the number of decision steps can finally help to reduce decision fatigue, which refers to the deteriorating quality of decisions made by an individual after a long session of decision steps, and help to prevent infinite scrolling.
There is a notable rise in websites and mobile apps that use manipulative (also known as "deceptive") designs or "dark patterns". Leveraging visual perception effects and cognitive biases or object manipulations, these designs influence user behavior in ways that may not be beneficial or can even be harmful for users. It is important to both warn and educate users about manipulative designs. While numerous studies have investigated warning designs across various domains, little attention has been given to exploring how to warn users about the presence of manipulative designs in applications. We conducted a user study with a three-level warning about the presence of manipulative designs on a simulated app page on the Google Play Store and explored the impact of different warning levels on user attention and decision-making. We also explored possibilities for personalization of warning levels based on the user’s personality (Big 5) characteristics. While our findings did not discover opportunities for personalization, they underscore the benefit of a multi-level warning design, and the pivotal role of visual elements in capturing attention, complemented by the contribution of textual explanations and more details on demand. We discuss the factors influencing users to install an app despite being informed about the presence of manipulative designs and demonstrate how app distribution platforms can embed warnings in the app information to prevent or mitigate the harms of manipulative designs.
Provider fairness aims at regulating the recommendation lists, so that the items of different providers/provider groups are suggested by respecting notions of equity. When group fairness is among the goals of a system, a common way is to use coarse groups since the number of considered provider groups is usually small (e.g., two genders, or three/four age groups) and the number of items per group is large. From a practical point of view, having few groups makes it easier for a platform to manage the distribution of equity among them. Nevertheless, there are sensitive attributes, such as the age or the geographic provenance of the providers that can be characterized at a fine granularity (e.g., one might group providers at the country level, instead of the continent one), which increases the number of groups and decrements the number of items per group. In this study, we show that, in large demographic groups, when considering coarse-grained provider groups, the fine-grained provider groups are under-recommended by the state-of-the-art models. To overcome this issue, in this paper, we present an approach that brings equity to both coarse and fine-grained provider groups. Experiments on two real-world datasets show the effectiveness of our approach.
When using web search engines to conduct inquiries on debated topics, searchers’ interactions with search results are commonly affected by a combination of searcher and system biases. While prior work has mainly investigated these biases in isolation, there is a lack of a comprehensive understanding of web search on debated topics. Addressing this gap, we conducted an exploratory user study (N = 255), aimed at advancing the understanding of the intricate searcher-system interplay. Particularly, we investigated the relations between (i) search system exposure, searchers’ attitude strength, prior knowledge, and receptiveness to opposing views, (ii) search interactions, and (iii) post-search epistemic states. We observed that search interaction was shaped by search system exposure, attitude strength, and prior knowledge, and that attitude change was influenced by the level of confirmation bias and initial attitude strength, but not search system exposure. Insights from this work suggest the need to adapt interventions that mitigate the risks of searcher and system bias to searchers’ nuanced pre-search epistemic states. They further emphasize the threat of customizing the search ranking to enhance user satisfaction in the context of debated topics to responsible opinion formation.
To increase trust in systems, engineers strive to create explanations that are as accurate as possible. However, if the system’s accuracy is compromised, providing explanations for its incorrect behavior may inadvertently lead to misleading explanations. This concern is particularly pertinent when the correctness of the system is difficult for users to judge. In an online survey experiment with 162 participants, we analyze the impact of misleading explanations on users’ perceived and demonstrated trust in a system that performs a hardly assessable task in an unreliable manner. Participants who used a system that provided potentially misleading explanations rated their trust significantly higher than participants who saw the system’s prediction alone. They also aligned their initial prediction with the system’s prediction significantly more often. Our findings underscore the importance of exercising caution when generating explanations, especially in tasks that are inherently difficult to evaluate. The paper and supplementary materials are available at https://doi.org/10.17605/osf.io/azu72
Conversational Machine Reading (CMR) systems answer high-level user questions by interpreting contextual information, asking clarification questions, and generating human-like responses. While effective, such systems often use knowledge about the task and the user in a non-transparent and non-scrutable way. For example, if a user wants to ask questions like “Why are you asking this?” or “Why is this the correct answer?”, the system should be able to highlight and return the relevant information that led to the decision in an interpretable manner. Similarly, if a user scrutinizes and edits their user profile, the final output of the model should change accordingly. To test the transparency and scrutability of conversational machine reading systems, we formalize two new tasks by extending the ShARC dataset to create the EXtrA-ShARC dataset. For transparency, we propose a baseline model that can simultaneously extract explanations and answer the user’s question. We will also publicly release counterfactual user profiles to test scrutability for all CMR models. Our dataset opens up a range of research directions for using natural language explanations and counterfactual profiles in conversational systems, both for evaluating the model and increasing transparency for end users.
We address demographic bias in neighborhood-learning models for collaborative filtering recommendations. Despite their superior ranking performance, these methods can learn neighborhoods that inadvertently foster discriminatory patterns. Little work exists in this area, highlighting an important research gap. A notable yet solitary effort, Balanced Neighborhood Sparse LInear Method (BNSLIM) aims at balancing neighborhood influence across different demographic groups. Yet, BNSLIM is hampered by computational inefficiency, and its rigid balancing approach often impacts accuracy. In that vein, we introduce two novel algorithms. The first, an enhancement of BNSLIM, incorporates the Alternating Direction Method of Multipliers (ADMM) to optimize all similarities concurrently, greatly reducing training time. The second, Fairly Sparse Linear Regression (FSLR), induces controlled sparsity in neighborhoods to reveal correlations among different demographic groups, achieving comparable efficiency while being more accurate. Their performance is evaluated using standard exposure metrics alongside a new metric for user coverage disparities. Our experiments cover various applications, including a novel exploration of bias in course recommendations by teachers’ country development status. Our results show the effectiveness of our algorithms in imposing fairness compared to BNSLIM and other well-known fairness approaches.
Personalised news recommender systems are effective in disseminating news content based on users’ reading histories but can also amplify and proliferate biased media. This work examines the potential of automated sentence rewriting methods, utilising word replacement methods and large language models (LLMs), to mitigate this side effect of recommender systems. We present a two-step workflow: the application of automated sentence rewriting methods to rewrite biased sentences, and the integration of these rewritten sentences into the recommendation process. We evaluate the effectiveness of sentence rewriting approaches in a simulation framework, to assess how well they mitigate the spread of biased news. Our study demonstrates that applying sentence rewriting to users’ reading histories can result in a significant reduction in the propagation of biased media. Our contributions are threefold: we pioneer the use of LLMs for mitigating the spread of biased news by recommender systems; we demonstrate that algorithms trained on debiased content maintain or improve recommendation accuracy; and we provide a comprehensive exploration of the effectiveness of applying sentence rewriting methods to various components within a recommender system, as well as an investigation of the underlying reasons for their efficacy. This work advances our understanding of media bias mitigation in news content and recommendation algorithms, providing valuable insights into how news recommender systems can prevent the dissemination of biased information.
The research community has become increasingly aware of possible undesired effects of algorithmic biases in recommender systems. One common bias in such systems is to over-proportionally expose certain items to users, which may ultimately result in a system that is considered unfair to individual stakeholders. From a technical perspective, calibration approaches are commonly adopted in such situations to ensure that the individual user’s preferences are better taken into account, thereby also leading to a more balanced exposure of items overall. Given the known limitations of today’s predominant offline evaluation approaches, our work aims to contribute to a better understanding of the users’ perception of the fairness and quality of recommendations when these are served in a calibrated way. Therefore, we conducted an online user study (N=500) in which we exposed the treatment groups with recommendations calibrated for fairness in terms of two different item characteristics. Our results show that calibration can indeed be effective in guiding the users’ choices towards the “fairness items” without negatively impacting the overall quality perception of the system. We however also found that calibration did not measurably impact the users’ fairness perceptions unless explanatory information is provided by the system. Finally, our study points to challenges when applying calibration approaches in practice in terms of finding appropriate parameters.
With the integration of AI systems into our daily lives, human-AI collaboration has become increasingly prevalent. Prior work in this realm has primarily explored the effectiveness and performance of individual human and AI systems in collaborative tasks. While much of decision-making occurs within human peers and groups in the real world, there is a limited understanding of how they collaborate with AI systems. One of the key predictors of human-AI collaboration is the characteristics of the task at hand. Understanding the influence of task characteristics on human-AI collaboration is crucial for enhancing team performance and developing effective strategies for collaboration. Addressing a research and empirical gap, we seek to explore how the features of a task impact decision-making within human-AI group settings. In a 2 × 2 between-subjects study (N = 256) we examine the effects of task complexity and uncertainty on group performance and behaviour. The participants were grouped into pairs and assigned to one of four experimental conditions characterized by varying degrees of complexity and uncertainty. We found that high task complexity and high task uncertainty can negatively impact the performance of human-AI groups, leading to decreased group accuracy and increased disagreement with the AI system. We found that higher task complexity led to a higher efficiency in decision-making, while a higher task uncertainty had a negative impact on efficiency. Our findings highlight the importance of considering task characteristics when designing human-AI collaborative systems, as well as the future design of empirical studies exploring human-AI collaboration.
In this study, we present an approach to utilizing variance in students’ performance across different formats (multiple-choice, numeric input, word problems) as a target for personalization. We have developed a measure called challenge variance, that indicates the degree to which different formats pose varying levels of challenge for individual learners. We investigated whether challenge variance could be a useful source of information for developing learner models by analyzing data from an online math tutoring platform. Results demonstrated that challenge variance has a relationship with an external activity, indicating its utility as a means of predicting how well a learner will perform in a new setting. We discuss the affordances and issues with the measure and whether or not it could be a useful additional tool in developing personalized learner models as an intuitive and platform-agnostic measure of performance.
As social media grapples with the proliferation of misinformation, flagging systems emerge as vital digital tools that alert users to potential falsehoods, balancing the preservation of free speech. The efficacy of these systems hinges on user interpretation and reaction to the flags provided. This study probes the influence of warning flags on user perceptions, assessing their effect on the perceived accuracy of information, the propensity to share content, and the trust users have in these warnings, especially when supplemented with fact-checking explanations. Through a within-subject experiment involving 348 American participants, we mimicked a social media feed with a series of COVID-19-related headlines, both true and false, in various conditions—with flags, with flags and explanatory text, and without any intervention. Explanatory content was derived from fact-checking sites linked to the news items. Our findings indicate that false news is perceived as less accurate when flagged or accompanied by explanatory text. The presence of explanatory text correlates with heightened trust in the flags. Notably, participants with high levels of neuroticism and a deliberative cognitive thinking style showed a higher trust for explanatory text alongside warning flags. Conversely, participants with conservative leanings exhibited distrust towards social media flagging systems. These results underscore the importance of clear explanations within flagging mechanisms and support a user-centric model in their design, emphasising transparency and engagement as essential in counteracting misinformation on social media.
This work investigates relationships between consistent attendance —attendance rates in a group that maintains the same tutor and students across the school year— and learning in small group tutoring sessions. We analyzed data from two large urban districts consisting of 206 9th-grade student groups (3 − 6 students per group) for a total of 803 students and 75 tutors. The students attended small group tutorials approximately every other day during the school year and completed a pre and post-assessment of math skills at the start and end of the year, respectively. First, we found that the attendance rates of the group predicted individual assessment scores better than the individual attendance rates of students comprising that group. Second, we found that groups with high consistent attendance had more frequent and diverse tutor and student talk centering around rich mathematical discussions. Whereas we emphasize that changing tutors or groups might be necessary, our findings suggest that consistently attending tutorial sessions as a group with the same tutor might lead the group to implicitly learn as a team despite not being one.
Lifelong personalised learning is often described as the holy grail of the educational data sciences, but work on the topic is sporadic and we are yet to achieve this goal in a meaningful form. In the wake of the skills shortages arising from national responses to COVID-19 this problem has again become a topic of interest. A number of proposals have emerged that some sort of a skills passport would help individuals, educational institutions, and employers to identify training and recruitment needs according to identified skills gaps. And yet, we are a long way from achieving a skills passport that could support lifelong learning despite more than 25 years of work on the topic. This paper draws attention to two of the critical socio-technical challenges facing skills passports, and lifelong learner models in general. This leads to a proposal for how we might move towards a useful skills passport that can cross the “skills sector border”.
Recommender Systems have played an important role in our daily lives for many years. However, it is only recently that their social impact has raised ethical issues and has thus been considered in the design of such systems. Particularly, News Recommender Systems (NRS) have a critical influence on individuals. NRS can provide overspecialized recommendations and enclose users into filter bubbles. Besides, NRS can influence users and make their original opinions diverge. Worse, they can orient users’ opinions towards more radical views. The literature has worked on these issues by leveraging diversity and fairness in the recommendation algorithms, but generally only one of these dimensions at a time. We propose to consider both diversity and fairness simultaneously to provide recommendations that are fair, diverse, and obviously accurate. To this end, we propose a novel recommendation framework, Accuracy-Diversity-Fairness (ADF), which considers that fairness is not at the expense of diversity. Concretely, fairness is approached as a constraint on diversity. Experiments highlight that constraining diversity by fairness remarkably contributes to providing recommendations 5 times more diverse than models of the literature, without any loss in accuracy.
Predicting the success of marketing campaigns on social media can help improve campaign managers’ decision-making (e.g., deciding to stop a marketing campaign) and thus increase their profits. Most research in the field of online marketing has focused on analyzing users’ behavior rather than improving campaign manager decision-making. Furthermore, determining the success of marketing campaigns is quite challenging due to the large number of possible metrics that must be analyzed daily. In this study, we suggest a method that incorporates machine learning models with traditional business rules to provide daily decision recommendations, based on the various metrics and considerations, and aimed at achieving the campaign’s goals. We evaluate our approach on a unique dataset collected from the most popular social networks, Facebook and Instagram. Our evaluation demonstrates the proposed method’s ability to outperform an expert-based method and the machine learning baselines examined, and dramatically increase the campaign managers’ profits.
Emotions constitute an important aspect when listening to music. While manual annotations from user studies grounded in psychological research on music and emotions provide a well-defined and fine-grained description of the emotions evoked when listening to a music track, user-generated tags provide an alternative view stemming from large-scale data. In this work, we examine the relationship between these two emotional characterizations of music and analyze their impact on the performance of emotion-based music recommender systems individually and jointly. Our analysis shows that (i) the agreement between the two characterizations, as measured with Cohen’s κ coefficient and Kendall rank correlation, is often low, (ii) Leveraging the emotion profile based on the intensity of evoked emotions from high-quality annotations leads to performances that are stable across different recommendation algorithms; (iii) Simultaneously leveraging the emotion profiles based on high-quality and large-scale annotations allows to provide recommendations that are less exposed to the low accuracy that algorithms might reach when leveraging one type of data, only.
In this paper, we introduce a Knowledge-aware Recommender System (KARS) based on Graph Neural Networks that exploit pre-trained content-based embeddings to improve the representation of users and items. Our approach relies on the intuition that textual features can describe the items in the catalog from a different point of view, so they are worth to be exploited to provide users with more accurate recommendations. Accordingly, we used encoding techniques to learn a pre-trained representation of the items in the catalogue based on textual content, and we used these embeddings to feed the input layer of a KARS based on GCNs. In this way, the GCN is able to encode both the knowledge coming from the unstructured content and the structured knowledge provided by the KG (ratings and item descriptive properties). As shown in our experiments, the exploitation of pre-trained embeddings improves the predictive accuracy of the KARS, which overcomes all the baselines we considered in several experimental settings.
Conversational Recommender Systems (CRS) have recently drawn attention due to their capacity of delivering personalized recommendations through multi-turn natural language interactions. In this paper, we fit into this research line and we introduce a Knowledge-Aware Sequential Conversational Recommender System (KASCRS) that exploits transformers and knowledge graph embeddings to provide users with recommendations in a conversational setting.
In particular, KASCRS is able to predict a suitable recommendation based on the elements that are mentioned in a conversation between a user and a CRS. To do this, we design a model that: (i) encodes each conversation as a sequence of entities that are mentioned in the dialogue (i.e., items and properties), and (ii) is trained on a cloze task, that is to say, it learns to predict the final element in the sequence - that corresponds to the item to be recommended - based on the information it has previously seen.
The model has two main hallmarks: first, we exploit Transformers and self-attention to capture the sequential dependencies that exist among the entities that are mentioned in the training dialogues, in a way similar to session-based recommender systems [25]. Next, we used knowledge graphs (KG) to improve the quality of the representation of the elements mentioned in each sequence. Indeed, we exploit knowledge graph embeddings techniques to pre-train the representation of items and properties, and we fed the input layer of our architecture with the resulting embeddings. In this way, KASCRS integrates both knowledge from the KGs as well as the dependencies and the co-occurrences emerging from conversational data, resulting in a more accurate representation of users and items. Our experiments confirmed this intuition, since KASCRS overcame several state-of-the-art baselines on two different datasets.
Music recommender systems play a pivotal role in catering to diverse user preferences and fostering personalized listening experiences. At the same time, sentiments can profoundly influence music by shaping its emotional expression and evoking specific moods onto listeners. Expressed in textual content, these sentiments may be analyzed through natural language processing techniques to gauge emotions or opinions, hopefully increasing their relevance when exploited for recommendation. This work aims to investigate how to better integrate such information and understand its potential impact on personalized music suggestions, attempting to enhance recommendation models by incorporating sentiment features into factorization machines. For this purpose, a dataset was collected from Last.fm and enhanced with sentiment information extracted from Wikipedia. Empirical results evidence that not all sentiment-related features are equally useful, showing that each tested factorization machine approach varies in sensitivity to these features. Source code and data are available at https://github.com/abellogin/SentiFMRecSys.
In recent years, people have been spending more and more time on social media. Within the realm of multimedia contents used by platforms, the quantity of visuals is certainly growing in significance. Interaction data enables to know the users’ favourite images. This information could be exploited to gain a deeper insight into their psychological profile, since the literature on automatic personality recognition suggests that personality traits may correlate with aesthetics. In this paper we explore the use of personal preference on multiple images to predict personality traits of users. Unlike previous works, we propose a model that exploits ResNet50, a Convolutional Neural Network, to automatically extract features from the images in the PsychoFlickr dataset. We then fit five independent linear regressors on these features to detect personality. In order to determine whether using more than one image leads to better results, we train the model multiple times, using one to five images as input, and we compare the performances. Our method seems to outperform the related state-of-the-art works.
Next-item recommender systems are often trained using only positive feedback with randomly-sampled negative feedback. We show the benefits of using real negative feedback both as inputs into the user sequence and also as negative targets for training a next-song recommender system for internet radio. In particular, using explicit negative samples during training helps reduce training time by ∼ 60% while also improving test accuracy by 6%; adding user skips as additional inputs also can considerably increase user coverage alongside improving accuracy. We test the impact of using a large number of random negative samples to capture a ‘harder’ one and find that the test accuracy increases with more randomly-sampled negatives, but only to a point. Too many random negatives leads to false negatives that limits the lift, which is still lower than if using true negative feedback. We also find that the test accuracy is fairly robust with respect to the proportion of different feedback types, and compare the learned embeddings for different feedback types.
In news media, recommender system technology faces several domain-specific challenges. The continuous stream of new content and users deems content-based recommendation strategies, based on similar-item retrieval, to remain popular. However, a persistent challenge is to select relevant features and corresponding similarity functions, and whether this depends on the specific context. We evaluated feature-specific similarity metrics using human similarity judgments across national and local news domains. We performed an online experiment (N = 141) where we asked participants to judge the similarity between pairs of randomly sampled news articles. We had three contributions: (1) comparing novel metrics based on large language models to ones traditionally used in news recommendations, (2) exploring differences in similarity judgments across national and local news domains, and (3) examining which content-based strategies were perceived as appropriate in the news domain. Our results showed that one of the novel large language model based metrics (SBERT) was highly correlated with human judgments, while there were only small, most non-significant differences across national and local news domains. Finally, we found that while it may be possible to automatically recommend similar news using feature-specific metrics, their representativeness and appropriateness varied. We explain how our findings can guide the design of future content-based and hybrid recommender strategies in the news domain.
In the context of recommender systems (RS), the concept of diversity is probably the most studied perspective beyond mere accuracy. Despite the extensive development of diversity measures and enhancement methods, the understanding of how users perceive diversity in recommendations remains limited. This gap hinders progress in multi-objective RS, as it challenges the alignment of algorithmic advancements with genuine user needs. Addressing this, our study delves into two key aspects of diversity perception in RS. We investigate user responses to recommendation lists generated using varied diversity metrics but identical diversification thresholds, and lists created with the same metrics but differing thresholds. Our findings reveal a user preference for metadata and content-based diversity metrics over collaborative ones. Interestingly, while users typically recognize more diversified lists as being more diverse in scenarios with significant diversification differences, this perception is not consistently linear and quickly diminishes when the diversification variance between lists is less pronounced. This study sheds light on the nuanced user perceptions of diversity in RS, providing valuable insights for the development of more user-centric recommendation algorithms. Study data and analysis scripts are available from https://osf.io/9y8gx/.
Physical fitness presents a significant challenge in ensuring proper exercise posture. Individuals who work out need help maintaining correct exercise posture during their workouts. Maintaining correct form is critical for ensuring the safety and effectiveness of fitness routines. Yet, it is often challenging for individuals to keep proper form without professional guidance, which usually comes at expensive costs. The paper presents a novel method that utilizes the capabilities of YOLOv7 and a primary web camera to offer immediate feedback and correction on body posture during gym activities. Such a method empowers individuals to correct themselves and promotes motivation even without the presence of a professional trainer. This system has been developed to provide immediate, personalized feedback for various fitness exercises. It efficiently counts repetitions and provides textual guidance for improvement, tailored to the specific requirements of fitness enthusiasts. To determine the efficacy of our technology, we carried out a user study in a controlled laboratory setting simulating a gym environment. The study compares our interactive system with the traditional training method, involving participants of varied fitness levels. It showed significant improvements in exercise technique with real-time feedback. These findings are crucial for AI-supported training systems in strength training, underscoring the need for adaptive technologies for different user experiences. The research contributes to human-computer interaction and fitness technology discussions, highlighting interactive models’ potential to augment and sometimes replicate personal training benefits in exercise form and posture improvement.
Affective computing has potential to enrich the development lifecycle of Graphical User Interfaces (GUIs) and of intelligent user interfaces by incorporating emotion-aware responses. Yet, affect is seldom considered to determine whether a GUI design would be perceived as good or bad. We study how physiological signals can be used as an early, effective, and rapid affective assessment method for GUI design, without having to ask for explicit user feedback. We conducted a controlled experiment where 32 participants were exposed to 20 good GUI and 20 bad GUI designs while recording their eye activity through eye tracking, facial expressions through video recordings, and brain activity through electroencephalography (EEG). We observed noticeable differences in the collected data, so we trained and compared different computational models to tell good and bad designs apart. Taken together, our results suggest that each modality has its own “performance sweet spot” both in terms of model architecture and signal length. Taken together, our findings suggest that is possible to distinguish between good and bad designs using physiological signals. Ultimately, this research paves the way toward implicit evaluation methods of GUI designs through user modeling.
Previous research on an Intelligent Tutoring System (referred to as ACSP), showed the need to personalize explanations of its AI-driven hints for users with low Need for Cognition (N4C) and low Conscientiousness (Cons.). Specifically, this work found that explanations should be provided to these users with the objective of increasing user interaction with them. In this paper, we present and evaluate design alterations to the original ACSP explanation interface aimed at achieving this objective. Our results provide initial evidence that the implemented personalization, in the form of the design alterations, had a positive impact on users with low N4C and Cons., by increasing attention to explanations and contributing to learning gains.
Algorithmic Recourse aims to provide actionable explanations, or recourse plans, to overturn potentially unfavourable decisions taken by automated machine learning models. In this paper, we propose an interaction paradigm based on a guided interaction pattern aimed at both eliciting the users’ preferences and heading them toward effective recourse interventions. In a fictional task of money lending, we compare this approach with an exploratory interaction pattern based on a combination of alternative plans and the possibility of freely changing the configurations by the users themselves. Our results suggest that users may recognize that the guided interaction paradigm improves efficiency. However, they also feel less freedom to experiment with “what-if” scenarios. Nevertheless, the time spent on the purely exploratory interface tends to be perceived as a lack of efficiency, which reduces attractiveness, perspicuity, and dependability. Conversely, for the guided interface, more time on the interface seems to increase its attractiveness, perspicuity, and dependability while not impacting the perceived efficiency. That might suggest that this type of interfaces should combine these two approaches by trying to support exploratory behavior while gently pushing toward a guided effective solution.
While much research in recommender systems focused on improving the accuracy of recommendations, issues pertaining to their presentation have been under-explored. Considering the uptake of recommendations as one of their success indicators, we investigate the role of proponents in affecting user's decision to accept a recommendation. We refer to proponent as a person or avatar, advocating in favor of the recommended item. This paper reports on a user study that evaluated the impact of including several types of proponents in the recommender interface and their impact on the uptake of recommendations. We observe that out of the studied proponents, real-world contacts have the strongest impact on the uptake of recommendations, which can inform the design recommender system interfaces.
In recommender systems, the presentation of explanations plays a crucial role in supporting users’ decision-making processes. Although numerous existing studies have focused on the effects (e.g., transparency) of explanation content, explanation expression is largely overlooked. Tone, such as formal and humorous, is directly linked to expressiveness and is an important element in human communication. However, studies on the impact of tone on explanations within the context of recommender systems are insufficient. Therefore, this study investigates the tonal effects of explanations through an online user study. We focus on a hotel domain and six types of tones. The collected data analysis reveals that the tone of explanations influences the perceived effects, such as trust and effectiveness, of recommender systems. Our findings suggest that the tone of explanations can enhance user experience in recommender systems.
With the rapid advances in deep learning, we have witnessed a strongly increased interest in conversational recommender systems (CRS). Until recently, however, even the latest generative models exhibited major limitations and they frequently return non-meaningful responses according to previous studies. However, with the latest Generative AI-based dialog systems implemented with Generative Pre-Trained Transformer (GPT) models, a new era has arrived for CRS research. In this work, we study the use of ChatGPT as a movie recommender system. To this purpose, we conducted an online user study involving N=190 participants, who were tasked to evaluate ChatGPT’s responses in a multitude of dialog situations. As a reference point for the analysis, we included a retrieval-based conversational method in the experiment, which was found to be a robust approach in previous research.
Our study results indicate that the responses by ChatGPT were perceived to be significantly better than those by the previous system in terms of their meaningfulness. A detailed inspection of the results showed that ChatGPT excelled when providing recommendations, but sometimes missed the context when asked questions about a movie within a longer dialog. A statistical analysis revealed that information adequacy and recommendation accuracy of the responses had the strongest influence on the perceived meaningfulness of the responses. Finally, an additional analysis showed that the human perceptions of meaningfulness correlated only very weakly with computational metrics such as BLEU or ROUGE, emphasizing the importance of involving humans in the evaluation of a CRS.
Context has been an important topic in recommender systems over the past two decades. Most of the prior CARS papers manually selected and considered only a few crucial contextual variables in an application, such as time, location, and company of a person. This prior work demonstrated significant recommendation performance improvements when various CARS-based methods have been deployed in numerous applications. In this paper, we study “context-rich” applications dealing with a large variety of different types of contexts. We demonstrate that supporting only a few of the most important contextual variables that could be manually identified, although useful, is not sufficient. In particular, we develop an approach to extract a large number of contextual variables for the dialogue-based recommender systems. In our study, we processed dialogues of bank managers with their clients and managed to identify over two hundred types of contextual variables forming the Long Tail of Context (LTC). We empirically demonstrate that LTC matters, and using all these contextual variables from the Long Tail leads to better recommendation performance.
Machine learning models using users’ gaze and hand data to encode user interaction behavior in VR are often tailored to a single task and sensor set, limiting their applicability in settings with constrained compute resources. We propose GEARS, a new paradigm that learns a shared feature extraction mechanism across multiple tasks and sensor sets to encode gaze and hand tracking data of users VR behavior into multi-purpose embeddings. GEARS leverages a contrastive learning framework to learn these embeddings, which we then use to train linear models to predict task labels. We evaluated our paradigm across four VR datasets with eye tracking that comprise different sensor sets and task goals. The performance of GEARS was comparable to results from models trained for a single task with data of a single sensor set. Our research advocates a shift from using sensor set and task specific models towards using one shared feature extraction mechanism to encode users’ interaction behavior in VR.
The inherent social characteristics of humans make them prone to adopting distributed and collaborative applications easily. Although fundamental methods and technologies have been defined and developed over the years to construct these applications, their adoption in practice is uncommon because end-users may be puzzled about how to use them without much hassle. Indeed, commonly, these applications require a certain level of technical expertise and awareness to use them correctly. Fortunately, AI-chatbot interventions are envisioned to assist and support various human tasks. In this paper, we contribute pervasive chatbots as a solution that fosters a more transparent and user-friendly interconnection of devices in distributed and collaborative environments. Through two rigorous user studies, firstly, we quantify the perception of users toward distributed and collaborative applications (N = 56 participants). Secondly, we analyze the benefits of adopting pervasive chatbots when compared with the chatbot reference model designed for assistance and recommendations (N = 24 participants). Our results suggest that pervasive chatbots can significantly enhance the practicability of distributed and collaborative applications, reducing the time and effort needed for collaboration with surrounding devices by 57%. With this information, we then provide design and development implications to integrate pervasive chatbot interventions in distributed and collaborative environments. Moreover, challenges and opportunities are also provided to highlight the remaining issues that need to be addressed to realize the full vision of pervasive chatbots for any multi-device application. Our work paves the way towards the proliferation of sophisticated and highly decentralized computing environments that are easily interconnected.
In everyday life, we make decisions in groups about a variety of issues. In group decision-making, group members discuss options, exchange preferences and opinions, and make a common decision. Decision support systems and group recommender systems facilitate this process by enabling preference elicitation, generating recommendations, and supporting the process. We are here interested in building a conversational system, namely, a chat app, enhanced with an AI agent supporting the group decision-making process. To design the system, rather than solely relying on our assumptions, we took one step back and conducted a comprehensive focus group study. This approach has allowed us to gain original insights into the specific needs and preferences of the future end-users, i.e., group members, ensuring that our system design aligns more closely with their requirements. The focus group study involved fourteen participants in three group compositions: friends, families, and couples. Our findings reveal that most of the group members define a good choice as one that maximizes overall satisfaction without leaving any member dissatisfied. Dealing with competing group members emerged as a primary concern, with study participants requesting specific help from the AI agent to address this challenge. Participants identified personality and group structure as crucial characteristics for the AI agent to properly operate, though some expressed privacy concerns. Lastly, participants expected an AI agent to provide private interactions with individual members, proactively guide discussions when necessary, continually analyze group interactions, and tailor support to those interactions.
Health-promoting digital agents, taking on the role of an assistant, coach or companion, are expected to have knowledge about a person’s medical and health aspects, yet they typically lack knowledge about the person’s activities. These activities may vary daily or weekly and are contextually situated, posing challenges for the human-agent interaction. This pilot study aimed to explore the experiences and behaviors of older adults when interacting with an initially unknowledgeable digital agent that queries them about an activity that they are simultaneously engaged in. Five older adults participated in a scenario involving preparing coffee followed by having coffee with a guest. While performing these activities, participants educated the smartwatch-embedded agent, named Virtual Occupational Therapist (VOT), about their activity performance by answering a set of activity-ontology based questions posed by the VOT. Participants’ interactions with the VOT were observed, followed by a semi-structured interview focusing on their experience with the VOT. Collected data were analyzed using an activity-theoretical framework. Results revealed participants exhibited agency and autonomy, deciding whether to adapt to the VOT’s actions in three phases: adjustment to the VOT, partial adjustment, and the exercise of agency by putting the VOT to sleep after the social conditions and activity changed. Results imply that the VOT should incorporate the ability to distinguish when humans collaborate as expected by the VOT and when they choose not to comply and instead act according to their own agenda. Future research focuses on how collaboration evolves and how the VOT needs to adapt in the process.