Incorporating Knowledge Graphs (KGs) into Recommendation has attracted growing attention in industry, due to the great potential of KG in providing abundant supplementary information and interpretability for the underlying models. However, simply integrating KG into recommendation usually brings in negative feedback in industry, mainly due to the ignorance of the following two factors: i) users' multiple intents, which involve diverse nodes in KG. For example, in e-commerce scenarios, users may exhibit preferences for specific styles, brands, or colors. ii) knowledge noise, which is a prevalent issue in Knowledge Enhanced Recommendation (KGR) and even more severe in industry scenarios. The irrelevant knowledge properties of items may result in inferior model performance compared to approaches that do not incorporate knowledge. To tackle these challenges, we propose a novel approach named Knowledge Enhanced Multi-intent Transformer Network for Recommendation (KGTN), which comprises two primary modules: Global Intents Modeling with Graph Transformer, and Knowledge Contrastive Denoising under Intents. Specifically, Global Intents with Graph Transformer focuses on capturing learnable user intents, by incorporating global signals from user-item-relation-entity interactions with a well-designed graph transformer, and meanwhile learning intent-aware user/item representations. On the other hand, Knowledge Contrastive Denoising under Intents is dedicated to learning precise and robust representations. It leverages the intent-aware user/item representations to sample relevant knowledge, and subsequently proposes a local-global contrastive mechanism to enhance noise-irrelevant representation learning. Extensive experiments conducted on three benchmark datasets show the superior performance of our proposed method over the state-of-the-arts. And online A/B testing results on Alibaba large-scale industrial recommendation platform also indicate the real-scenario effectiveness of KGTN. The implementations are available at: https://github.com/CCIIPLab/KGTN.
With the surge in mobile gaming, accurately predicting user spending on newly downloaded games has become paramount for maximizing revenue. However, the inherently unpredictable nature of user behavior poses significant challenges in this endeavor. To address this, we propose a robust model training and evaluation framework aimed at standardizing spending data to mitigate label variance and extremes, ensuring stability in the modeling process. Within this framework, we introduce a collaborative-enhanced model designed to predict user game spending without relying on user IDs, thus ensuring user privacy and enabling seamless online training. Our model adopts a unique approach by separately representing user preferences and game features before merging them as input to the spending prediction module. Through rigorous experimentation, our approach demonstrates notable improvements over production models, achieving a remarkable 17.11% enhancement on offline data and an impressive 50.65% boost in an online A/B test. In summary, our contributions underscore the importance of stable model training frameworks and the efficacy of collaborative-enhanced models in predicting user spending behavior in mobile gaming. The code associated with this paper has also been released at the following link https://doi.org/10.5281/zenodo.10775846.
In the realm of e-commerce search, the significance of semantic matching cannot be overstated, as it directly impacts both user experience and company revenue. Along this line, query rewriting, serving as an important technique to bridge the semantic gaps inherent in the semantic matching process, has attached wide attention from the industry and academia. However, existing query rewriting methods often struggle to effectively optimize long-tail queries and alleviate the phenomenon of "few recall'' caused by semantic gap. In this paper, we present BEQUE, a comprehensive framework that Bridges the sE mantic gap for long-tail QUE ries. In detail, BEQUE comprises three stages: multi-instruction supervised fine tuning (SFT), offline feedback, and objective alignment. We first construct a rewriting dataset based on rejection sampling and auxiliary tasks mixing to fine-tune our large language model (LLM) in a supervised fashion. Subsequently, with the well-trained LLM, we employ beam search to generate multiple candidate rewrites, and feed them into Taobao offline system to obtain the partial order. Leveraging the partial order of rewrites, we introduce a contrastive learning method to highlight the distinctions between rewrites, and align the model with the Taobao online objectives. Offline experiments prove the effectiveness of our method in bridging semantic gap. Online A/B tests reveal that our method can significantly boost gross merchandise volume (GMV), number of transaction (#Trans) and unique visitor (UV) for long-tail queries. BEQUE has been deployed on Taobao, one of most popular online shopping platforms in China, since October 2023.
Efficient retrieval and ranking of relevant products in e-commerce product search relies on accurate mapping of queries to product categories. This query classification typically utilizes a combination of textual and customer behavioral signals. However, new product categories often lack customer interaction data leading to poor performance. In this paper, we present a novel approach to mitigate this cold start problem in product ranking via synthetic generation of queries as well as simulation of customer interactions. Specifically we study two strategies for synthetic data generation: (i) fine-tuning a generative language model (LLM) on historical product-query interactions and using it to generate synthetic queries from the product catalog, (ii) Bayesian prompt optimization with an instruction-tuned LLM to directly generate queries from catalog. Empirical evaluation of the proposed approaches on public datasets and real-world customer queries demonstrates significant benefits (+2.96% and +2.34% in PR-AUC on e-commerce queries)1 relative to the baseline approach without synthetic data augmentation. Furthermore, evaluation of the augmented model on live search page results in a substantial increase in highly relevant product results (+3.35%) and reduction (-3.07%) in irrelevant results.
Feature crosses, which represent joint features synthesized by two single features, are critical for deep recommender systems to model sophisticated feature relations. In practice, only a tiny fraction of feature crosses among massive possible ones are informative, while introducing irrelevant or noisy ones may increase online service latency and boost the risk of overfitting. Therefore, picking high-quality feature crosses is essential in practical recommender systems. However, even for selecting quadratic feature crosses, existing algorithms still incur either o(n2) time complexity or o(n2) space complexity, which is inefficient and unscalable in industrial scenarios. In this paper, we present an efficient and accurate quadratic feature cross selection method with both linear time and space complexity. Motivated by the idea of Quasi-Newton methods, we propose to use 2nd-order derivative matrix to evaluate all theoretically possible feature crosses concurrently without the need of constructing them explicitly, where an approximation of 2nd-order gradient is applied to guarantee both low time and space complexity. Furthermore, we decouple the feature crosses' novelty from single features' joint importance. Experiments on two public recommendation datasets and a private dataset validate the efficiency and effectiveness of our method, and it has also become a fundamental feature cross selection tool used by Huawei Ads Platform.
Effective user representations are pivotal in personalized advertising. However, stringent constraints on training throughput, serving latency, and memory, often limit the complexity and input feature set of online ads ranking models. This challenge is magnified in extensive systems like Meta's, which encompass hundreds of models with diverse specifications, rendering the tailoring of user representation learning for each model impractical. To address these challenges, we present Scaling User Modeling (SUM), a framework widely deployed in Meta's ads ranking system, designed to facilitate efficient and scalable sharing of online user representation across hundreds of ads models. SUM leverages a few designated upstream user models to synthesize user embeddings from massive amounts of user features with advanced modeling techniques. These embeddings then serve as inputs to downstream online ads ranking models, promoting efficient representation sharing. To adapt to the dynamic nature of user features and ensure embedding freshness, we designed SUM Online Asynchronous Platform (SOAP), a latency-free online serving system complemented with model freshness and embedding stabilization, which enables frequent user model updates and online inference of user embeddings upon each user request. We share our hands-on deployment experiences for the SUM framework and validate its superiority through comprehensive experiments. To date, SUM has been launched to hundreds of ads ranking models in Meta, processing hundreds of billions of user requests daily, yielding significant online metric gains and infrastructure cost savings.
Query intent classification is an essential module for customers to quickly find desired products on the e-commerce application. Most existing query intent classification methods rely on the users' click behavior as a supervised signal to construct training samples. However, these methods based entirely on posterior labels may lead to serious category imbalance problems because of the Matthew effect in click samples. Compared with popular categories, it is difficult for products under long-tail categories to obtain traffic and user clicks, which makes the models unable to detect users' intent for products under long-tail categories. This in turn aggravates the problem that long-tail categories cannot obtain traffic, forming a vicious circle. In addition, due to the randomness of the user's click, the posterior label is unstable for the query with similar semantics, which makes the model very sensitive to the input, leading to an unstable and incomplete recall of categories. In this paper, we propose a novel Semi-supervised Multi-channel Graph Convolutional Network (SMGCN) to address the above problems from the perspective of label association and semi-supervised learning. SMGCN extends category information and enhances the posterior label by utilizing the similarity score between the query and categories. Furthermore, it leverages the co-occurrence and semantic similarity graph of categories to strengthen the relations among labels and weaken the influence of posterior label instability. We conduct extensive offline and online A/B experiments, and the experimental results show that SMGCN significantly outperforms the strong baselines, which shows its effectiveness and practicality.
Cloud failure prediction (e.g., disk failure prediction, memory failure prediction, node failure prediction, etc.) is a crucial task for ensuring the reliability and performance of cloud systems.However, the problem of class imbalance poses a huge challenge for accurate prediction as the number of healthy components (majority class) in a cloud system is much larger than the number of failed components (minority class). The consequences of this class imbalance include biased model performance and insufficient learning, as the model may lack adequate information to learn the characteristics associated with cloud failure effectively. Moreover, current methods for addressing the class imbalance problem, such as SMOTE and its variants, exhibit certain drawbacks, such as generating noisy samples and struggling to maintain sample diversity, which limit their effectiveness in addressing the challenges presented by the class imbalance in cloud failure prediction. In this paper, we propose a novel oversampling method for imbalanced classification, named SOIL (Score cOnditioned dIffusion modeL), which employs a score-conditioned diffusion model to generate high-quality synthetic samples for the minority class, more accurately representing real-world cloud failure patterns. By incorporating classification probabilities as conditional scores, SOIL offers supervision to the generation process, effectively limiting noise production while maintaining sample diversity. Through extensive experiments on various public and industrial datasets , upon adopting our method, the cloud failure prediction model's F1-score is improved by an average of 5.39% and consistently outperforms state-of-the-art competitors in addressing the class imbalance problem, which confirm the effectiveness and robustness of SOIL. In addition, SOIL has been successfully applied to a global large-scale cloud platform serving billions of customers, demonstrating its practicability.
Neural Architecture Search (NAS) has demonstrated its efficacy in computer vision and potential for ranking systems. However, prior work focused on academic problems, which are evaluated at small scale under well-controlled fixed baselines. In industry system, such as ranking system in Meta, it is unclear whether NAS algorithms from the literature can outperform production baselines because of: (1) scale - Meta's ranking systems serve billions of users, (2) strong baselines - the baselines are production models optimized by hundreds to thousands of world-class engineers for years since the rise of deep learning, (3) dynamic baselines - engineers may have established new and stronger baselines during NAS search, and (4) efficiency - the search pipeline must yield results quickly in alignment with the productionization life cycle. In this paper, we present Rankitect, a NAS software framework for ranking systems at Meta. Rankitect seeks to build brand new architectures by composing low level building blocks from scratch. Rankitect implements and improves state-of-the-art (SOTA) NAS methods for comprehensive and fair comparison under the same search space, including sampling-based NAS, one-shot NAS, and Differentiable NAS (DNAS). We evaluate Rankitect by comparing to multiple production ranking models at Meta. We find that Rankitect can discover new models from scratch achieving competitive trade-off between Normalized Entropy loss and FLOPs. When utilizing search space designed by engineers, Rankitect can generate better models than engineers, achieving positive offline evaluation and online A/B test at Meta scale.
This paper proposes the USer ViewING FLow ModEling (SINGLE) method for the article recommendation task, which models the user constant preference and instant interest from user-clicked articles. Specifically, we first employ a user constant viewing flow modeling method to summarize the user's general interest to recommend articles. In this case, we utilize Large Language Models (LLMs) to capture constant user preferences from previously clicked articles, such as skills and positions. Then we design the user instant viewing flow modeling method to build interactions between user-clicked article history and candidate articles. It attentively reads the representations of user-clicked articles and aims to learn the user's different interest views to match the candidate article. Our experimental results on the Alibaba Technology Association (ATA) website show the advantage of SINGLE, achieving a 2.4% improvement over previous baseline models in the online A/B test. Our further analyses illustrate that SINGLE has the ability to build a more tailored recommendation system by mimicking different article viewing behaviors of users and recommending more appropriate and diverse articles to match user interests.
In real-world coupon recommendations, the coupon allocation process is influenced by both the recommendation model trained with historical interaction data and marketing tactics aimed at specific commercial goals. These tactics can cause an imbalance in user-coupon interactions, leading to a deviation from users' natural preferences. We refer to this deviation as the matching bias. Theoretically, unbiased data which is assumed to be collected via a randomized allocating policy (i.e., without model or tactics intervention) is ideal training data because it reflects the user's natural preferences. However, obtaining unbiased data in real-world scenarios is costly and sometimes unfeasible.
To address this problem, we propose a novel model-agnostic training paradigm named <u>C</u>ounterfactual <u>D</u>ata <u>A</u>ugmentation for debiased coupon recommendations based on <u>P</u>otential <u>K</u>nowledge (CDAPK) for the marketing scenario that allocates coupons with discounts. We leverage the counterfactual data augmentation technique to answer the following key question: If a user is offered a coupon that he has never seen before in his history, will he use this coupon? By creating the counterfactual interaction data and assigning labels based on the potential knowledge of the given scenario, CDAPK shifts the original data distribution into an unbiased distribution, facilitating model optimization and debiasing. The advantage of CDAPK lies in its ability to approximate the ideal states of the training data without depleting the real-world traffic flow. We implement CDAPK on five representative models: FM, DNN, NCF, MASKNET, and DEEPFM, and conduct extensive offline and online experiments against SOTA debiasing methods to validate the superiority of CDAPK.
Sequential recommendation systems (SRS) are crucial in various applications as they enable users to discover relevant items based on their past interactions. Recent advancements involving large language models (LLMs) have shown significant promise in addressing intricate recommendation challenges. However, these efforts exhibit certain limitations. Specifically, directly extracting representations from an LLM based on items' textual features and feeding them into a sequential model hold no guarantee that the semantic information of texts could be preserved in these representations. Additionally, concatenating textual descriptions of all items in an item sequence into a long text and feeding it into an LLM for recommendation results in lengthy token sequences, which largely diminishes the practical efficiency.
In this paper, we introduce SAID, a framework that utilizes LLMs to explicitly learn Semantically Aligned item ID embeddings based on texts. For each item, SAID employs a projector module to transform an item ID into an embedding vector, which will be fed into an LLM to elicit the exact descriptive text tokens accompanied by the item. The item embeddings are forced to preserve fine-grained semantic information of textual descriptions. Further, the learned embeddings can be integrated with lightweight downstream sequential models for practical recommendations. In this way, SAID circumvents lengthy token sequences in previous works, reducing resources required in industrial scenarios and also achieving superior recommendation performance. Experiments on six public datasets demonstrate that SAID outperforms baselines by about 5% to 15% in terms of NDCG@10. Moreover, SAID has been deployed in Alipay's online advertising platform, achieving a 3.07% relative improvement of cost per mille (CPM) over baselines, with an online response time of under 20 milliseconds.
Combining contextual information (i.e., side information) of items beyond IDs has become an important way to improve the performance in recommender systems. Existing self-attention-based side information fusion methods can be categorized into early, late, and hybrid fusion. In practice, naive early fusion may interfere with the representation of IDs, resulting in negative effects, while late fusion misses effective interactions between IDs and side information. Some hybrid methods have been proposed to address these issues, but they only utilize side information in calculating attention scores, which may lead to information loss. To harness the full potential of side information without noisy interference, we propose an <u>A</u>ligned <u>S</u>ide <u>I</u>nformation <u>F</u>usion (ASIF) method for sequential recommendation, consisting of two parts: Fused Attention with Untied Positions and Representation Alignment. Specifically, we first decouple the positions to exclude the noisy interference in the attention scores. Secondly, we adopt the contrastive objective to maintain the semantic consistency between IDs and side information and then employ orthogonal decomposition to extract the homogeneous parts. By aligning the representations and fusing them together, ASIF makes full use of the side information without interfering with IDs. Offline experimental results on four datasets demonstrate the superiority of ASIF. Additionally, we successfully deployed the model in Alipay's advertising system and achieved 1.09% and 1.86% improvements on clicks and Cost Per Mille (CPM).
In this paper, we present OmniSearchSage, a versatile and scalable system for understanding search queries, pins, and products for Pinterest search. We jointly learn a unified query embedding coupled with pin and product embeddings, leading to an improvement of >8% relevance, >7% engagement, and >5% ads CTR in Pinterest's production search system. The main contributors to these gains are improved content understanding, better multi-task learning, and real-time serving. We enrich our entity representations using diverse text derived from image captions from a generative LLM, historical engagement, and user-curated boards. Our multitask learning setup produces a single search query embedding in the same space as pin and product embeddings and compatible with pre-existing pin and product embeddings. We show the value of each feature through ablation studies, and show the effectiveness of a unified model compared to standalone counterparts. Finally, we share how these embeddings have been deployed across the Pinterest search stack, from retrieval to ranking, scaling to serve 300k requests per second at low latency. Our implementation of this work is available at this link https://github.com/pinterest/atg-research/tree/main/omnisearchsage.
User response modeling can enhance the learning of user representations and further improve the reinforcement learning (RL) recommender agent. However, as users' behaviors are influenced by their long-term preferences and short-term stochastic factors (e.g., weather, mood, or fashion trends), it remains challenging for previous works focusing on recurrent neural network-based user response modeling. Meanwhile, due to the dynamic interests of users, it is often unrealistic to assume the dynamics of users are stationary. Drawing inspiration from opponent modeling, we propose a novel network structure, Deep User Q-Network (DUQN), incorporating a user response probabilistic model into the Q-learning ads allocation strategy to capture the effect of the non-stationary user policy on Q-values. Moreover, we utilize the Recurrent State-Space Model (RSSM) to develop the user response model, which includes deterministic and stochastic components, enabling us to fully consider user long-term preferences and short-term stochastic factors. In particular, we design a RetNet version of RSSM (R-RSSM) to support parallel computation. The R-RSSM model can be further used for multi-step predictions to enable bootstrapping over multiple steps simultaneously. Finally, we conduct extensive experiments on a large-scale offline dataset from the Meituan food delivery platform and a public benchmark. Experimental results show that our method yields superior performance to state-of-the-art (SOTA) baselines. Moreover, our model demonstrates a significant improvement in the online A/B test and has been fully deployed on the industrial Meituan platform, serving more than 500 million customers.
Despite significant reliability efforts, large-scale cloud services inevitably experience production incidents that can significantly impact service availability and customer satisfaction. Worse, in many cases one incident can lead to multiple downstream failures due to cascading effects that create several related incidents across different dependent services. Often time On-call Engineers (OCEs) examine these incidents in silos that lead to significant amounts of manual effort and increase the overall time-to-mitigate incidents. Therefore, developing efficient incident linking models is of paramount importance for grouping related incidents into clusters so as to quickly resolve major outages and reduce on-call fatigue. Existing incident linking methods mostly leverage textual and contextual information of incidents (e.g., title, description, severity, impacted components), thus failing to leverage the inter-dependencies between services. In this paper, we propose the dependency-aware incident linking (DiLink) framework which leverages both textual and service dependency graph information to improve the accuracy and coverage of incident links not only emerge from same service, but also from different services and workloads. Furthermore, we propose a novel method to align the embeddings of multi-modal (i.e., textual and graphical) data using Orthogonal Procrustes. Extensive experimental results on real-world incidents from 5 workloads of Microsoft demonstrate that our alignment method has an F1-score of 0.96 (14% gain over current state-of-the-art methods). We are also in the process of deploying this solution across 610 services from these 5 workloads for continuously supporting OCEs improving incident management and reducing manual effort.
Recommender systems are essential for finding personalized content for users on online platforms. These systems are often trained on historical user interaction data, which collects user feedback on system recommendations. This creates a feedback loop leading to popularity bias; popular content is over-represented in the data, better learned, and thus recommended even more. Less popular content struggles to reach its potential audiences. Popularity bias limits the diversity of content that users are exposed to, and makes it harder for new creators to gain traction. Existing methods to alleviate popularity bias tend to trade off the performance of popular items. In this work, we propose a new method for alleviating popularity bias in recommender systems, called the cluster anchor regularization, which partitions the large item corpus into hierarchical clusters, and then leverages the cluster information of each item to facilitate transfer learning from head items to tail items. Our results demonstrate the effectiveness of the proposed method with offline analyses and live experiments on a large-scale industrial recommendation platform, where it significantly increases tail recommendation without hurting the overall user experience.
Reranking plays a crucial role in modern multi-stage recommender systems by rearranging the initial ranking list to model interplay between items. Considering the inherent challenges of reranking such as combinatorial searching space, some previous studies have adopted the evaluator-generator paradigm, with a generator producing feasible sequences and a evaluator selecting the best one based on estimated listwise utility. This paper explores the potential of diffusion models for generating high-quality sequences in reranking tasks, as the intrinsic nature of diffusion models is to improve generation quality by iterative refinements of generated samples. However, we argue that it is nontrivial to take diffusion models as the generator in the context of recommendation. Firstly, diffusion models primarily operate in continuous data space, differing from the discrete data space of item permutations. Secondly, the recommendation task is different from conventional generation tasks as the purpose of recommender systems is to fulfill user interests. Lastly, real-life recommender systems require efficiency, posing challenges for the inference of diffusion models.
To overcome these challenges, we propose a novel Discrete Conditional Diffusion Reranking (DCDR) framework for recommendation. DCDR extends traditional diffusion models by introducing a discrete forward process with tractable posteriors, which adds noise to item sequences through step-wise discrete operations (e.g., swapping). Additionally, DCDR incorporates a conditional reverse process that generates item sequences conditioned on expected user responses. For efficient and robust inference, we propose several optimizations to enable the deployment of DCDR in real-life recommender systems. Extensive offline experiments conducted on public datasets demonstrate that DCDR outperforms state-of-the-art reranking methods. Furthermore, DCDR has been deployed in a real-world video app with over 300 million daily active users, significantly enhancing online recommendation quality.
People enjoy sharing "notes" including their experiences within online communities. Therefore, recommending notes aligned with user interests has become a crucial task. Existing online methods only input notes into BERT-based models to generate note embeddings for assessing similarity. However, they may underutilize some important cues, e.g., hashtags or categories, which represent the key concepts of notes. Indeed, learning to generate hashtags/categories can potentially enhance note embeddings, both of which compress key note information into limited content. Besides, Large Language Models (LLMs) have significantly outperformed BERT in understanding natural languages. It is promising to introduce LLMs into note recommendation. In this paper, we propose a novel unified framework called NoteLLM, which leverages LLMs to address the item-to-item (I2I) note recommendation. Specifically, we utilize Note Compression Prompt to compress a note into a single special token, and further learn the potentially related notes' embeddings via a contrastive learning approach. Moreover, we use NoteLLM to summarize the note and generate the hashtag/category automatically through instruction tuning. Extensive validations on real scenarios demonstrate the effectiveness of our proposed method compared with the online baseline and show major improvements in the recommendation system of Xiaohongshu.
In online advertising scenario, sellers often create multiple creatives to provide comprehensive demonstrations, making it essential to present the most appealing design to maximize the Click-Through Rate (CTR). However, sellers generally struggle to consider users' preferences for creative design, leading to the relatively lower aesthetics and quantities compared to Artificial Intelligence (AI)-based approaches. Traditional AI-based approaches still face the same problem of not considering user information while having limited aesthetic knowledge from designers. In fact that fusing the user information, the generated creatives can be more attractive because different users may have different preferences. To optimize the results, the generated creatives in traditional methods are then ranked by another module named creative ranking model. The ranking model can predict the CTR score for each creative considering user features. However, the two above stages (generating creatives and ranking creatives) are regarded as two different tasks and are optimized separately. Specifically, generating creatives in the first stage without considering the target of improving CTR task may generate several creatives with poor quality, leading to dilute online impressions and directly making bad effectiveness on online results.
In this paper, we proposed a new automated C reative G eneration pipeline for Click-Through Rate (CG4CTR).1 The code is at with the goal of improving CTR during the creative generation stage. In this pipeline, a new creative is automatically generated and selected by stable diffusion method with the LoRA model and two novel models: prompt model and reward model. Our contributions have four parts: 1) The inpainting mode in stable diffusion method is firstly applied to creative image generation task in online advertising scene. A self-cyclic generation pipeline is proposed to ensure the convergence of training. 2) Prompt model is designed to generate individualized creative images for different user groups, which can further improve the diversity and quality of the generated creatives. 3) Reward model comprehensively considers the multi-modal features of image and text to improve the effectiveness of creative ranking task, and it is also critical in self-cyclic generation pipeline. 4) The significant benefits obtained in online and offline experiments verify the significance of our proposed method.
Click-Through Rate (CTR) prediction plays a critical role in sponsored search. Modeling the semantic relevance between queries and ads is one of the most crucial factors affecting the performance of CTR prediction. However, different users have different sensitivities to semantic relevance due to their personalized relevance preferences. Therefore, semantic relevance may have different incentives on the user's click probability (i.e., stimulative incentive, inhibitive incentive, or irrelevant incentive). Unfortunately, few works have studied the phenomenon, which ignores the complicated incentive effects of semantic relevance and limits the performance of CTR prediction.
To this end, we propose a novel Personalized Relevance Incentive N eTwork (PRINT for short) to explicitly model the personalized incentives of query-ad semantic relevance on user's click probability. Specifically, we introduce a User Relevance Preference Module (usertask) to extract the user's personalized relevance preference from historical query-ad interacted sequence. Then, a RElevance Incentive Module (REIM) is designed to discern three incentive types and model the personalized incentive effects on CTR prediction. Experiments on public datasets and industrial datasets demonstrate the significant improvement of our PRINT. Furthermore, PRINT is also deployed in the sponsored search advertising system in Meituan, obtaining an improvement of 1.94% and 2.29% in CTR and Cost Per Mile (CPM) respectively. We publish the source code at https://anonymous.4open.science/r/PRINT-D365/.
Integrated ranking is a critical component in industrial recommendation platforms. It combines candidate lists from different upstream channels or sources and ranks them into an integrated list, which will be exposed to users. During this process, to take responsibility for channel providers, the integrated ranking system needs to consider the exposure fairness among channels, which directly affects the opportunities of different channels being displayed to users. Besides, personalization also requires the integrated ranking system to consider the user's diverse preference on different channels besides items. Existing methods are hard to address both problems effectively. In this paper, we propose a <u>Hi</u>erarchical <u>F</u>airness-aware <u>I</u>ntegrated ranking (HiFI) framework. It contains a channel recommender and an item recommender, and the fairness constraint is on channels with constrained RL. We also design a gated attention layer (GAL) to effectively capture users' multi-faceted preferences. We compare HiFI with various baselines on public and industrial datasets, and HiFI achieves the state-of-the-art performance on both utility and fairness metrics. We also conduct an online A/B test to further validate the effectiveness of HiFI.
The problem of search relevance in the E-commerce domain is a challenging one since it involves understanding the intent of a user's short nuanced query and matching it with the appropriate products in the catalog. This problem has traditionally been addressed using language models (LMs) and graph neural networks (GNNs) to capture semantic and inter-product behavior signals, respectively. However, the rapid development of new architectures has created a gap between research and the practical adoption of these techniques. Evaluating the generalizability of these models for deployment requires extensive experimentation on complex, real-world datasets, which can be non-trivial and expensive. Furthermore, such models often operate on latent space representations that are incomprehensible to humans, making it difficult to evaluate and compare the effectiveness of different models. This lack of interpretability hinders the development and adoption of new techniques in the field. To bridge this gap, we propose Plug and Play Graph LAnguage Model (PP-GLAM), an explainable ensemble of plug and play models. Our approach uses a modular framework with uniform data processing pipelines. It employs additive explanation metrics to independently decide whether to include (i) language model candidates, (ii) GNN model candidates, and (iii) inter-product behavioral signals. For the task of search relevance, we show that PP-GLAM outperforms several state-of-the-art baselines as well as a proprietary model on real-world multilingual, multi-regional e-commerce datasets. To promote better model comprehensibility and adoption, we also provide an analysis of the explainability and computational complexity of our model. We also provide the public codebase and provide a deployment strategy for practical implementation.
Compared to business-to-consumer (B2C) e-commerce systems, consumer-to-consumer (C2C) e-commerce platforms usually encounter the limited-stock problem, that is, a product can only be sold one time in a C2C system. This poses several unique challenges for click-through rate (CTR) prediction. Due to limited user interactions for each product (i.e. item), the corresponding item embedding in the CTR model may not easily converge. This makes the conventional sequence modeling based approaches cannot effectively utilize user history information since historical user behaviors contain a mixture of items with different volume of stocks. Particularly, the attention mechanism in a sequence model tends to assign higher score to products with more accumulated user interactions, making limited-stock products being ignored and contribute less to the final output. To this end, we propose the Meta-Split Network (MSNet) to split user history sequence regarding to the volume of stock for each product, and adopt differentiated modeling approaches for different sequences. As for the limited-stock products, a meta-learning approach is applied to address the problem of inconvergence, which is achieved by designing meta scaling and shifting networks with ID and side information. In addition, traditional approach can hardly update item embedding once the product is consumed. Thereby, we propose an auxiliary loss that makes the parameters updatable even when the product is no longer in distribution. To the best of our knowledge, this is the first solution addressing the recommendation of limited-stock product. Experimental results on the production dataset and online A/B testing demonstrate the effectiveness of our proposed method.
Uplift modeling, vital in online marketing, seeks to accurately measure the impact of various strategies, such as coupons or discounts, on different users by predicting the Individual Treatment Effect (ITE). In an e-commerce setting, user behavior follows a defined sequential chain, including impression, click, and conversion. Marketing strategies exert varied uplift effects at each stage within this chain, impacting metrics like click-through and conversion rate. Despite its utility, existing research has neglected to consider the inter-task across all stages impacts within a specific treatment and has insufficiently utilized the treatment information, potentially introducing substantial bias into subsequent marketing decisions. We identify these two issues as the chain-bias problem and the treatment-unadaptive problem. This paper introduces the Entire Chain UPlift method with context-enhanced learning (ECUP), devised to tackle these issues. ECUP consists of two primary components: 1) the Entire Chain-Enhanced Network, which utilizes user behavior patterns to estimate ITE throughout the entire chain space, models the various impacts of treatments on each task, and integrates task prior information to enhance context awareness across all stages, capturing the impact of treatment on different tasks, and 2) the Treatment-Enhanced Network, which facilitates fine-grained treatment modeling through bit-level feature interactions, thereby enabling adaptive feature adjustment. Extensive experiments on public and industrial datasets validate ECUP's effectiveness. Moreover, ECUP has been deployed on the Meituan food delivery platform, serving millions of daily active users, with the related dataset released for future research.
The deployment of Large Multimodal Models (LMMs) within Ant Group has significantly advanced multimodal tasks in payment, security, and advertising, notably enhancing advertisement audition tasks in Alipay. However, the deployment of such sizable models introduces challenges, particularly in increased latency and carbon emissions, which are antithetical to the ideals of Green AI. This paper introduces a novel multi-stage compression strategy for our proprietary LLM, AntGMM. Our methodology pivots on three main aspects: employing small training sample sizes, addressing multi-level redundancy through multi-stage pruning, and introducing an advanced distillation loss design. In our research, we constructed a dataset, the Multimodal Advertisement Audition Dataset (MAAD), from real-world scenarios within Alipay, and conducted experiments to validate the reliability of our proposed strategy. Furthermore, the effectiveness of our strategy is evident in its operational success in Alipay's real-world multimodal advertisement audition for three months from September 2023. Notably, our approach achieved a substantial reduction in latency, decreasing it from 700ms to 90ms, while maintaining online performance with only a slight performance decrease. Moreover, our compressed model is estimated to reduce electricity consumption by approximately 75 million kWh annually compared to the direct deployment of AntGMM, demonstrating our commitment to green AI initiatives.
Modern Web APIs allow developers to provide extensively customized experiences for website visitors, but the richness of the device information they provide also make them vulnerable to being abused to construct browser fingerprints, device-specific identifiers that enable covert tracking of users even when cookies are disabled.
Previous research has established entropy, a measure of information, as the key metric for quantifying fingerprinting risk. However, earlier studies had two major limitations. First, their entropy estimates were based on either a single website or a very small sample of devices. Second, they did not adequately consider correlations among different Web APIs, potentially grossly overestimating their fingerprinting risk.
We provide the first study of browser fingerprinting which addresses the limitations of prior work. Our study is based on actual visited pages and Web APIs reported by tens of millions of real Chrome browsers in-the-wild. We accounted for the dependencies and correlations among Web APIs, which is crucial for obtaining more realistic entropy estimates. We also developed a novel experimental design that accurately and efficiently estimates entropy while never observing too much information from any single user. Our results provide an understanding of the distribution of entropy for different website categories, confirm the utility of entropy as a fingerprinting proxy, and offer a method for evaluating browser enhancements which are intended to mitigate fingerprinting.
Recommender systems have made significant strides in various industries, primarily driven by extensive efforts to enhance recommendation accuracy. However, this pursuit of accuracy has inadvertently given rise to echo chamber/filter bubble effects. Especially in industry, it could impair user's experiences and prevent user from accessing a wider range of items. One of the solutions is to take diversity into account. However, most of existing works focus on user's explicit preferences, while rarely exploring user's non-interaction preferences. These neglected non-interaction preferences are especially important for broadening user's interests in alleviating echo chamber/filter bubble effects. Therefore, in this paper, we first define diversity as two distinct definitions, i.e., user-explicit diversity (U-diversity) and user-item non-interaction diversity (N-diversity) based on user historical behaviors. Then, we propose a succinct and effective method, named as Controllable Category Diversity Framework (CCDF) to achieve both high U-diversity and N-diversity simultaneously. Specifically, CCDF consists of two stages, User-Category Matching and Constrained Item Matching. The User-Category Matching utilizes the DeepU2C model and a combined loss to capture user's preferences in categories, and then selects the top-K categories with a controllable parameter K. These top-K categories will be used as trigger information in Constrained Item Matching. Offline experimental results show that our proposed DeepU2C outperforms state-of-the-art diversity-oriented methods, especially on N-diversity task. The whole framework is validated in a real-world production environment by conducting online A/B testing. The improved conversion rate and diversity metrics demonstrate the superiority of our proposed framework in industrial applications. Further analysis supports the complementary effects between recommendation and search that diversified recommendation is able to effectively help users to discover new needs, and then inspire them to refine their demands in search.
Session-based recommender systems (SBRSs) predict users' next interacted items based on their historical activities. While most SBRSs capture purchasing intentions locally within each session, capturing items' global information across different sessions is crucial in characterizing their general properties. Previous works capture this cross-session information by constructing graphs and incorporating neighbor information. However, this incorporation cannot vary adaptively according to the unique intention of each session, and the constructed graphs consist of only one type of user-item interaction. To address these limitations, we propose knowledge graph-based session recommendation with session-adaptive propagation. Specifically, we build a knowledge graph by connecting items with multi-typed edges to characterize various user-item interactions. Then, we adaptively aggregate items' neighbor information considering user intention within the learned session. Experimental results demonstrate that equipping our constructed knowledge graph and session-adaptive propagation enhances session recommendation backbones by 10%-20%. Moreover, we provide an industrial case study showing our proposed framework achieves 2% performance boost over an existing well-deployed model at The Home Depot e-platform.
The high false positive (FP) rate of authentication alerts remains to be a prominent challenge in cybersecurity nowadays. We identify two problems that cause this issue, which are unaddressed in existing learning-based anomaly detection methods. First, in industrial applications, ground-truth labels for malicious authentication events are extremely scarce. Therefore, learning-based methods must optimize their procedures for auto-generating high-quality training instances, an aspect that existing works have overlooked. Second, every existing model is based on a single form of data representation, either stream or graph snapshot, which may not be expressive enough to identify heterogeneity in behaviors of networked entities. This results in misclassifying a legitimate but differently-behaved authentication event into an anomalous one. We address these problems by proposing a new framework based on self-supervised link prediction on dynamic authentication networks, with two highlighted features: (1) our framework is based on the unification of two most popular views of dynamic interconnected systems: graph snapshots and link stream, ensuring the best coverage of behavioral heterogeneity; (2) to generate high-quality training samples, we propose a carefully designed negative sampling procedure called filtered rewiring, to ensure that the negative samples used for training are both truly negative and instructive. We validate our framework on 4 months of authentication data of 125 randomly selected, real organizations that subscribe to Microsoft's defense services.
Modern large-scale recommender systems are built upon computation-intensive infrastructure and usually suffer from a huge difference in traffic between peak and off-peak periods. In peak periods, it is challenging to perform real-time computation for each request due to the limited budget of computational resources. The recommendation with a cache is a solution to this problem, where a user-wise result cache is used to provide recommendations when the recommender system cannot afford a real-time computation. However, the cached recommendations are usually suboptimal compared to real-time computation, and it is challenging to determine the items in the cache for each user. In this paper, we provide a cache-aware reinforcement learning (CARL) method to jointly optimize the recommendation by real-time computation and by the cache. We formulate the problem as a Markov decision process with user states and a cache state, where the cache state represents whether the recommender system performs recommendations by real-time computation or by the cache. The cache state is determined by the computational load of the recommender system. We perform reinforcement learning based on such a model to improve user engagement over multiple requests. Moreover, we show that the cache will introduce a challenge called critic dependency, which deteriorates the performance of reinforcement learning. To tackle this challenge, we propose an eigenfunction learning (EL) method to learn independent critics for CARL. Experiments show that CARL can significantly improve the users' engagement when considering the result cache. CARL has been fully launched in Kwai app, serving over 100 million users.
Recent breakthroughs in large models have highlighted the critical significance of data scale, labels and modals. In this paper, we introduce MS MARCO Web Search, the first large-scale information-rich web dataset, featuring millions of real clicked query-document labels. This dataset closely mimics real-world web document and query distribution, provides rich information for various kinds of downstream tasks and encourages research in various areas, such as generic end-to-end neural indexer models, generic embedding models, and next generation information access system with large language models. MS MARCO Web Search offers a retrieval benchmark with three web retrieval challenge tasks that demands innovations in both machine learning and information retrieval system research domains. As the first dataset that meets large, real and rich data requirements, MS MARCO Web Search paves the way for future advancements in AI and system research. MS MARCO Web Search dataset is available at: https://github.com/microsoft/MS-MARCO-Web-Search.
Carpooling Route Planning (CRP) has become an important issue with the growth of low-carbon traffic systems. We investigate a meaningful and challenging scenario for CRP in industry, where each passenger may have several potential positions to get on and off the car. Traditional graph search algorithms or indexing methods usually consume a lot of time and space or perform poorly.
In this paper, we propose an end-to-end encoder-decoder model to plan a route for each many-to-one carpooling order with various data-driven mechanisms such as graph partitioning and feature crossover. The encoder is a filter-integrated Graph Convolution Network with external information fusion combining a supervised pre-training classification task, while the latter mimics a pointer network with a rule-based mask mechanism and a domain feature crossover module. We validate the effectiveness and efficiency of our model based on both synthetic and real-world datasets.
Click-through rate (CTR) prediction is a core task in recommender systems. Existing methods (IDRec for short) rely on unique identities to represent distinct users and items that have prevailed for decades. On one hand, IDRec often faces significant performance degradation on cold-start problem; on the other hand, IDRec cannot use longer training data due to constraints imposed by iteration efficiency. Most prior studies alleviate the above problems by introducing pre-trained knowledge(e.g. pre-trained user model or multi-modal embeddings). However, the explosive growth of online latency can be attributed to the huge parameters in the pre-trained model. Therefore, most of them cannot employ the unified model of end-to-end training with IDRec in industrial recommender systems, thus limiting the potential of the pre-trained model.
To this end, we propose a pre-trained plug-in CTR model, namely PPM. PPM employs multi-modal features as input and utilizes large-scale data for pre-training. Then, PPM is plugged in IDRec model to enhance unified model's performance and iteration efficiency. Upon incorporating IDRec model, certain intermediate results within the network are cached, with only a subset of the parameters participating in training and serving. Hence, our approach can successfully deploy an end-to-end model without causing huge latency increases. Comprehensive offline experiments and online A/B testing at JD E-commerce demonstrate the efficiency and effectiveness of PPM.
The task of stock earnings forecasting has received considerable attention due to the demand investors in real-world scenarios. However, compared with financial institutions, it is not easy for ordinary investors to mine factors and analyze news. On the other hand, although large language models in the financial field can serve users in the form of dialogue robots, it still requires users to have financial knowledge to ask reasonable questions. To serve the user experience, we aim to build an automatic system, FinReport, for ordinary investors to collect information, analyze it, and generate reports after summarizing.
Specifically, our FinReport is based on financial news announcements and a multi-factor model to ensure the professionalism of the report. The FinReport consists of three modules: news factorization module, return forecasting module, risk assessment module. The news factorization module involves understanding news information and combining it with stock factors, the return forecasting module aim to analysis the impact of news on market sentiment, and the risk assessment module is adopted to control investment risk. Extensive experiments on real-world datasets have well verified the effectiveness and explainability of our proposed FinReport. Our codes and datasets are available at https://github.com/frinkleko/FinReport.
In online platforms like eBay, sponsored search advertising has become instrumental for businesses aiming for enhanced visibility. However, in automated ad auctions, the sellers (ad campaigns) run the risk of exhausting their budgets prematurely in the absence of proper pacing strategies. In response to this, online platforms have been prompted to employ budget pacing strategies to maintain consistent spending patterns for their sellers. While numerous budget pacing strategies have been introduced, they predominantly stem from either empirical or theoretical perspectives, often functioning in isolation. This paper aims to bridge this gap by investigating the performance of a theoretically inspired optimization-based bid shading method, AdaptivePacing, within eBay's sponsored search environment and proposing variants of the algorithm tailored to real-world environments. Our findings highlight the benefits of applying theoretical pacing approaches in practical contexts. Specifically, the optimization-based AdaptivePacing method offers the platform flexible control over campaign spending patterns, accounts for business constraints, and suggests tailored strategies for distinct advertisers. Furthermore, when evaluating AdaptivePacing alongside established empirical methods, we demonstrate its practical effectiveness and pinpoint areas for further refinement.
E-commerce platforms typically store and structure product information and search data in a hierarchy. Efficiently categorizing user search queries into a similar hierarchical structure is paramount in enhancing user experience on e-commerce platforms as well as news curation and academic research. The significance of this task is amplified when dealing with sensitive query categorization or critical information dissemination, where inaccuracies can lead to considerable negative impacts. The inherent complexity of hierarchical query classification is compounded by two primary challenges: (1) the pronounced class imbalance that skews towards dominant categories, and (2) the inherent brevity and ambiguity of search queries that hinder accurate classification.
To address these challenges, we introduce a novel framework that leverages hierarchical information through (i) enhanced representation learning that utilizes the contrastive loss to discern fine-grained instance relationships within the hierarchy, called "instance hierarchy'', and (ii) a nuanced hierarchical classification loss that attends to the intrinsic label taxonomy, named "label hierarchy''. Additionally, based on our observation that certain unlabeled queries share typographical similarities with labeled queries, we propose a neighborhood-aware sampling technique to intelligently select these unlabeled queries to boost the classification performance. Extensive experiments demonstrate that our proposed method is better than state-of-the-art (SOTA) on the proprietary Amazon dataset, and comparable to SOTA on the public datasets of Web of Science and RCV1-V2. These results underscore the efficacy of our proposed solution, and pave the path toward the next generation of hierarchy-aware query classification systems.
In the field of Natural Language Processing (NLP), sentence pair classification is important in various real-world applications. Bi-encoders are commonly used to address these problems due to their low-latency requirements, and their ability to act as effective retrievers. However, bi-encoders often under-perform compared to cross-encoders by a significant margin. To address this gap, many Knowledge Distillation (KD) techniques have been proposed. Most existing KD methods focus solely on utilizing the prediction scores of cross-encoder models and overlook the fact that cross-encoders and bi-encoders have fundamentally different input structures. In this work, we introduce a novel knowledge distillation approach called DISKCO, which DISentangles the Knowledge learned in Cross-encoder models especially from multi-head cross-attention models and transfers it to bi-encoder models. DISKCO leverages the information encoded in the cross-attention weights of the trained cross-encoder model, and provide it as contextual cues for the student bi-encoder model during training and inference. DISKCO combines the benefits of independent encoding for low-latency applications with the knowledge acquired from cross-encoders, resulting in improved performance. Empirically, we demonstrate the effectiveness of DISKCO on proprietary and on various publicly available datasets. Our experiments show that DISKCO outperforms traditional knowledge distillation methods by upto 2%.
In the recommender system of Meituan Waimai, we are dealing with ever-lengthening user behavior sequences, which pose an increasing challenge to modeling user preference effectively. A number of existing sequential recommendation models struggle to capture long-term dependencies, or they exhibit high complexity, both of which make it difficult to satisfy the unique business requirements of Meituan Waimai's recommender system.
To better model user interests, we consider selecting relevant sub-sequences from users' extensive historical behaviors based on their preferences. In this specific scenario, we've noticed that the contexts in which users interact have a significant impact on their preferences. For this purpose, we introduce a novel method called Context-based Fast Recommendation Strategy (referred to as CoFARS) to tackle the issue of long sequences. We first identify contexts that share similar user preferences with the target context and then locate the corresponding Points of Interest (PoIs) based on these identified contexts. This approach eliminates the necessity to select a sub-sequence for every candidate PoI, thereby avoiding high time complexity. Specifically, we implement a prototype-based approach to pinpoint contexts that mirror similar user preferences. To amplify accuracy and interpretability, we employ Jensen?Shannon(JS) divergence of PoI attributes such as categories and prices as a measure of similarity between contexts. Subsequently, we construct a temporal graph that encompasses both prototype and context nodes to integrate temporal information. We then identify appropriate prototypes considering both target contexts and short-term user preferences. Following this, we utilize contexts aligned with these prototypes to generate a sub-sequence, aimed at predicting CTR and CTCVR scores with target attention.
Since its inception in 2023, this strategy has been adopted in Meituan Waimai's display recommender system, leading to a 4.6% surge in CTR and a 4.2% boost in GMV.
Information retrieval (IR) is a pivotal component in various applications. Recent advances in machine learning (ML) have enabled the integration of ML algorithms into IR, particularly in ranking systems. While there is a plethora of research on the robustness of ML-based ranking systems, these studies largely neglect commercial e-commerce systems and fail to establish a connection between real-world and manipulated query relevance. In this paper, we present the first systematic measurement study on the robustness of e-commerce ranking systems. We define robustness as the consistency of ranking outcomes for semantically identical queries. To quantitatively analyze robustness, we propose a novel metric that considers both ranking position and item-specific information that are absent in existing metrics. Our large-scale measurement study with real-world data from e-commerce retailers reveals an open opportunity to measure and improve robustness since semantically identical queries often yield inconsistent ranking results. Based on our observations, we propose several solution directions to enhance robustness, such as the use of Large Language Models. Note that the issue of robustness discussed herein does not constitute an error or oversight. Rather, in scenarios where there exists a vast array of choices, it is feasible to present a multitude of products in various permutations, all of which could be equally appealing. However, this extensive selection may lead to customer confusion. As e-commerce retailers use various techniques to improve the quality of search results, we hope that this research offers valuable guidance for measuring the robustness of the ranking systems.
Web-scale ranking systems at Meta serving billions of users is complex. Improving ranking models is essential but engineering heavy. Automated Machine Learning (AutoML) can potentially release engineers from labor intensive work of tuning ranking models; however, it is unknown if AutoML is efficient enough to meet tight production timeline in real-world applications and, at the same time, bring additional improvements to the already strong baselines. Moreover, to achieve higher ranking performance, there is an ever-increasing demand to scale up ranking models to even larger capacity, which imposes more challenges on the AutoML efficiency. The large scale of models and tight production schedule requires AutoML to outperform human baselines by only using a small number of model evaluation trials (~100). This paper presents a sampling-based AutoML search method, focusing on neural architecture search and hyperparameter optimization, with a particular emphasis on addressing aforementioned challenges in Meta-scale production when building large capacity models. Our approach efficiently handles large-scale data demands. It leverages a lightweight predictor-based searcher and reinforcement learning to explore vast search spaces, significantly reducing the number of model evaluations. Through experiments in large capacity modeling for CTR and CVR applications, we have demonstrated that our method achieves outstanding Return-on-Investment (ROI) versus human tuned baselines, with up to 0.09% Normalized Entropy (NE) loss reduction or 25% Query-per-Second (QPS) increase by only sampling one hundred models on average from a curated search space. The proposed AutoML method has already made real-world impact where a discovered Instagram CTR model with up to -0.36% NE gain (over existing production baseline) was selected for large-scale online A/B test and show statistically significant gain. These production results proved AutoML efficacy and accelerated its adoption in ranking systems at Meta.
The dissemination of information is a complex process that plays a crucial role in real-world applications, especially when intertwined with friend invitations and their ensuing responses. Traditional diffusion models, however, often do not adequately capture this invitation-aware diffusion (IAD), rendering inferior results. These models typically focus on describing the social influence process, i.e., how a user is informed by friends, but tend to overlook the subsequent behavioral changes that invitations might precipitate. To this end, we present the Independent Cascade with Invitation (ICI) model, which incorporates both the social influence process and multi-stage behavior conversions in IAD. We validate our design through an empirical study on in-game IAD. Furthermore, we conduct extensive experiments to evaluate the effectiveness of our proposal against 6 state-of-the-art models on 6 real-world datasets. In particular, we demonstrate that our solution can outperform the best competitor by up to 5× in cascade estimation and 17.2% in diffusion prediction. We deploy our proposal in the seed selection and friend ranking scenarios of Tencent's online games, where it achieves improvements of up to 170% and 20.3%, respectively.
Multi-index vector search has become the cornerstone for many applications, such as recommendation systems. Efficient search in such a multi-modal hybrid vector space is challenging since no single index design performs well for all kinds of vector data. Existing approaches to processing multi-index hybrid queries either suffer from algorithmic limitations or processing inefficiency. In this paper, we propose OneSparse, a unified multi-vector index query system that incorporates multiple posting-based vector indices, which enables highly efficient retrieval of multi-modal data-sets. OneSparse introduces a novel multi-index query engine design of inter-index intersection push-down. It also optimizes the vector posting format to expedite multi-index queries. Our experiments show OneSparse achieves more than 6x search performance improvement while maintaining comparable accuracy. OneSparse has already been integrated into Microsoft online web search and advertising systems with 5x+ latency gain for Bing web search and 2.0% Revenue Per Mille (RPM) gain for Bing sponsored search.
In the ever-evolving digital audio landscape, Spotify, well-known for its music and talk content, has recently introduced audiobooks to its vast user base. While promising, this move presents significant challenges for personalized recommendations. Unlike music and podcasts, audiobooks, initially available for a fee, cannot be easily skimmed before purchase, posing higher stakes for the relevance of recommendations. Furthermore, introducing a new content type into an existing platform confronts extreme data sparsity, as most users are unfamiliar with this new content type. Lastly, recommending content to millions of users requires the model to react fast and be scalable. To address these challenges, we leverage podcast and music user preferences and introduce 2T-HGNN, a scalable recommendation system comprising Heterogeneous Graph Neural Networks (HGNNs) and a Two Tower (2T) model. This novel approach uncovers nuanced item relationships while ensuring low latency and complexity. We decouple users from the HGNN graph and propose an innovative multi-link neighbor sampler. These choices, together with the 2T component, significantly reduce the complexity of the HGNN model. Empirical evaluations involving millions of users show significant improvement in the quality of personalized recommendations, resulting in a +46% increase in new audiobooks start rate and a +23% boost in streaming rates. Intriguingly, our model's impact extends beyond audiobooks, benefiting established products like podcasts.
User response prediction is essential in industrial recommendation systems, such as online display advertising. Among all the features in recommendation models, user behaviors are among the most critical. Many works have revealed that a user's behavior reflects her interest in the candidate item, owing to the semantic or temporal correlation between behaviors and the candidate. While the literature has individually examined each of these correlations, researchers have yet to analyze them in combination, that is, the semantic-temporal correlation. We empirically measure this correlation and observe intuitive yet robust patterns. We then examine several popular user interest models and find that, surprisingly, none of them learn such correlation well.
To fill this gap, we propose a Temporal Interest Network (TIN) to capture the semantic-temporal correlation simultaneously between behaviors and the target. We achieve this by incorporating target-aware temporal encoding, in addition to semantic encoding, to represent behaviors and the target. Furthermore, we conduct explicit 4-way interaction by deploying target-aware attention and target-aware representation to capture both semantic and temporal correlation. We conduct comprehensive evaluations on two popular public datasets, and our proposed TIN outperforms the best-performing baselines by 0.43% and 0.29% on GAUC, respectively. During online A/B testing in Tencent's advertising platform, TIN achieves 1.65% cost lift and 1.93% GMV lift over the base model. It has been successfully deployed in production since October 2023, serving the WeChat Moments traffic. We have released our code at https://github.com/zhouxy1003/TIN.
Cross-domain CTR (CDCTR) prediction is an important research topic that studies how to leverage meaningful data from a related domain to help CTR prediction in target domain. Most existing CDCTR works design implicit ways to transfer knowledge across domains such as parameter-sharing that regularizes the model training in target domain. More effectively, recent researchers propose explicit techniques to extract user interest knowledge and transfer this knowledge to target domain. However, the proposed method mainly faces two issues: 1) it usually requires a super domain, i.e. an extremely large source domain, to cover most users or items of target domain, and 2) the extracted user interest knowledge is static no matter what the context is in target domain. These limitations motivate us to develop a more flexible and efficient technique to explicitly transfer knowledge. In this work, we propose a cross-domain augmentation network (CDAnet) being able to perform explicit knowledge transfer between two domains. Specifically, CDAnet contains a designed translation network and an augmentation network which are trained sequentially. The translation network computes latent features from two domains and learns meaningful cross-domain knowledge of each input in target domain by using a designed cross-supervised feature translator. Later the augmentation network employs the explicit cross-domain knowledge as augmented information to boost the target domain CTR prediction. Through extensive experiments on two public benchmarks and one industrial production dataset, we show CDAnet can learn meaningful translated features and largely improve the performance of CTR prediction. CDAnet has been conducted online A/B test in image2product retrieval at Taobao app, bringing an absolute 0.11 point CTR improvement, a relative 0.64% deal growth and a relative 1.26% GMV increase.
Online advertising plays a pivotal role in sustaining the accessibility of free content on the Internet, serving as a primary revenue source for websites and online services. This dynamic marketplace sees advertisers allocating budgets and competing for the opportunity to present ads to users engaging with web pages, online services, and mobile apps. Modern online advertising often employs first-price auctions to determine ad placements. Yet, conducting auctions as isolated events in a greedy manner, may lead to sub-optimal results, necessitating some form of budget pacing. Traditionally, budget pacing has been achieved through hard throttling, where ads or campaigns are selectively made eligible for each auction using a biased coin-toss with a specified probability (or pacing-signal). More recently, the pacing signal has been leveraged to soft throttle ads, and is used as a multiplicative factor on their bids, thus enabling participation in all auctions but with potentially modified bids.
In this study, we introduce Mystique, a "soft" throttling-based budget pacing system. Mystique operates on two levels: it utilizes spending data to establish a daily target spending curve for each campaign, and continuously updates a pacing signal to align the actual spending with this curve. Our offline evaluation in a complex simulated marketplace, demonstrates Mystique's ability to outperform several baseline algorithms, enabling budget depletion while securing more opportunities. Mystique has been in production for several years now, serving a major native advertising marketplace, and successfully pacing over one billion USD annually.
Massively Multiplayer Online Games (MMOs) feature intricate virtual economies that permeate various in-game activities. However, the balancing act between profitability and equality in MMO economic design proves to be a persistent conundrum, especially in nascent business models like Pay-to-Win (P2W). Conventional efforts are curtailed by two primary constraints: the inability to verify and the provision of suboptimal solutions. In light of these predicaments, this paper delves into MMO economies and explores the promising potential of integrating emerging AI methodologies into economic design. Specifically, we introduce a novel hierarchical Reinforcement Learning (RL) solution for achieving Pareto optimality between profitability and equality in P2W economies. Leveraging our substantial industrial acumen and expertise, we establish an economic simulation environment that facilitates authentic and realistic assessments of MMO economic evolution. Building upon this foundation, we reconceptualize the P2W economic design process within the paradigm of a Markov Decision Process (MDP) and tackle it as a standard RL problem. Comprehensive evaluations corroborate that our solution demonstrates consistent personality specialization in economic simulations akin to real-world MMOs and significantly outperforms other baselines in economic design. Further discussions highlight its superiority in both frontier research and practical applications within the game industry.
We propose a general model-agnostic Contrastive learning framework with Counterfactual Samples Synthesizing (CCSS) for modeling the monotonicity between the neural network output and numerical features which is critical for interpretability and effectiveness of recommender systems. CCSS models the monotonicity via a two-stage process: synthesizing counterfactual samples and contrasting the counterfactual samples. The two techniques are naturally integrated into a model-agnostic framework, forming an end-to-end training process. Abundant empirical tests are conducted on a publicly available dataset and a real industrial dataset, and the results well demonstrate the effectiveness of our proposed CCSS. Besides, CCSS has been deployed in our real large-scale industrial recommender, successfully serving over hundreds of millions users.
In an era of digital evolution, banking sectors face the dual challenge of nurturing a digitally savvy demographic and managing potential dormant account holders. This study delves deep into the prediction of customer contributions, particularly considering the skewed nature of such data. The inherent skewness in customer contribution data, highlighted by the substantial low-value contribution group and the vast variability among high-value contributors, necessitates an advanced prediction model. Addressing this, we present the Skewness-aware Boosting Regression Trees (SBRT) framework to predict customer contributions whose distribution exhibit high skewness. SBRT seamlessly combines the strength of Gradient Boosted Decision Trees with a novel mechanism of random tree deactivation, adeptly tackling distribution skewness. The model's effectiveness is rooted in four principles: cross-feature extraction, a percentile-based calibration and rebalancing method, tree deactivation during the boosting phase, and the utilization of Huber loss. Extensive real-world bank data testing underscores SBRT's promising capability in managing skewed distributions, setting it a cut above in predicting customer contributions. The culmination of this work lies in its practical validation, where online A/B tests highlight SBRT's tangible industrial applicability.
Online controlled experiments have emerged as industry gold standard for assessing new web features. As new web algorithms proliferate, experimentation platform faces an increasing demand on the velocity of online experiments, which encourages adaptive traffic testing methods to speed up identifying best variant by efficiently allocating traffic. This paper proposed four Bayesian batch bandit algorithms (NB-TS, WB-TS, NB-TTTS, WB-TTTS) for eBay's experimentation platform, using summary batch statistics without incurring new engineering technical debts. The novel WB-TTTS, in particular, demonstrates as an efficient, trustworthy and robust alternative to fixed horizon A/B testing. Another novel contribution is to bring trustworthiness of best arm identification algorithms into evaluation criterion and highlight the existence of severe false positive inflation with equivalent best arms. To gain the trust of experimenters, experimentation platform must consider both efficiency and trustworthiness; However, to the best of authors' knowledge, trustworthiness as an important topic is rarely discussed. This paper shows that Bayesian bandits without neutral posterior reshaping, particularly naive Thompson sampling (NB-TS), are untrustworthy because they can always identify an arm as the best from equivalent best arms. To restore trustworthiness, a novel finding uncovers connections between convergence distribution of posterior optimal probabilities of equivalent best arms and neutral posterior reshaping, which controls false positives. Lastly, this paper presents lessons learned from eBay's experience, as well as thorough evaluations. We hope that this paper is useful to other industrial practitioners and inspires academic researchers interested in the trustworthiness of adaptive traffic experimentation.
Graph plays an important role in representing complex relationships in real-world applications such as social networks, biological data and citation networks. In recent years, Large Language Models (LLMs) have achieved tremendous success in various domains, which makes applying LLMs to graphs particularly appealing. However, directly applying LLMs to graph modalities presents unique challenges due to the discrepancy and mismatch between the graph and text modalities. Hence, to further investigate LLMs' potential for comprehending graph information, we introduce GraphPrompter, a novel framework designed to align graph information with LLMs via soft prompts. Specifically, GraphPrompter consists of two main components: a graph neural network to encode complex graph information and an LLM that effectively processes textual information. Comprehensive experiments on various benchmark datasets under node classification and link prediction tasks demonstrate the effectiveness of our proposed method. The GraphPrompter framework unveils the substantial capabilities of LLMs as predictors in graph-related tasks, enabling researchers to utilize LLMs across a spectrum of real-world graph scenarios more effectively.
While revolutionizing social networks, recommendation systems, and online web services, graph neural networks are vulnerable to adversarial attacks. Recent state-of-the-art adversarial attacks rely on gradient-based meta-learning to selectively perturb a single edge with the highest attack score until they reach the budget constraint. While effective in identifying vulnerable links, these methods are plagued by high computational costs. By leveraging continuous relaxation and parameterization of the graph structure, we propose a novel attack method -- DGA to efficiently generate effective attacks and meanwhile eliminate the need for costly retraining. Compared to the state-of-the-art, DGA achieves nearly equivalent attack performance with 6 times less training time and 11 times smaller GPU memory footprint on different benchmark datasets. Additionally, we provide extensive experimental analyses of the transferability of DGA among different graph models, as well as its robustness against widely-used defense mechanisms.
As concerns over data privacy intensify, unlearning in Graph Neural Networks (GNNs) has emerged as a prominent research frontier in academia. This concept is pivotal in enforcing the right to be forgotten, which entails the selective removal of specific data from trained GNNs upon user request. Our research focuses on edge unlearning, a process of particular relevance to real-world applications. Current state-of-the-art approaches like GNNDelete can eliminate the influence of specific edges yet suffer from over-forgetting, which means the unlearning process inadvertently removes excessive information beyond needed, leading to a significant performance decline for remaining edges. Our analysis identifies the loss functions of GNNDelete as the primary source of over-forgetting and also suggests that loss functions may be redundant for effective edge unlearning. Building on these insights, we simplify GNNDelete to develop Unlink to Unlearn (UtU), a novel method that facilitates unlearning exclusively through unlinking the forget edges from graph structure. Our extensive experiments demonstrate that UtU delivers privacy protection on par with that of a retrained model while preserving high accuracy in downstream tasks, by upholding over 97.3% of the retrained model's privacy protection capabilities and 99.8% of its link prediction accuracy. Meanwhile, UtU requires only constant computational demands, underscoring its advantage as a highly lightweight and practical edge unlearning solution.
Suggesting relevant questions to users is an important task in various applications, such as community Q&A or e-commerce websites. To ensure that there is no redundancy in the selected set of candidate questions, it is essential to filter out any near-duplicate questions. Identifying near-duplicate questions has another use case in light of the adoption of Large Language Models (LLMs) - fetching pre-computed answers for similar questions. However, identifying the similarity of questions is a bit more complex in comparison to generic text, as questions entail open-ended information that is not explicitly contained within the wording of the question itself. We introduce a taxonomy that accounts for the subtle intricacies characteristic of near-duplicate questions and propose a method for detecting them utilizing the capabilities of LLMs.
Most Temporal Knowledge Graphs (TKGs) exhibit a long-tail entity distribution, where the majority of entities have sparse connections. Existing TKG completion methods struggle with managing new or unseen entities that often lack sufficient connections. In this paper, we introduce a model-agnostic enhancement layer that can be integrated with any existing TKG completion method to improve its performance. This enhancement layer employs a broader, global definition of entity similarity, transcending the limitations of local neighborhood proximity found in Graph Neural Network (GNN) based methods. Additionally, we conduct our evaluations in a novel, realistic setup that treats the TKG as a stream of evolving data. Evaluations on two benchmark datasets demonstrate that our framework surpasses existing methods in overall link prediction, inductive link prediction, and in addressing long-tail entities. Notably, our approach achieves a 10% improvement in MRR on one dataset and a 15% increase on another.
Recent recommender system advancements have focused on developing sequence-based and graph-based approaches. Both approaches proved useful in modeling intricate relationships within behavioral data, leading to promising outcomes in personalized ranking and next-item recommendation tasks while maintaining good scalability. However, they capture very different signals from data. While the former approach represents users directly through ordered interactions with recent items, the latter aims to capture indirect dependencies across the interactions graph. This paper presents a novel multi-representational learning framework exploiting these two paradigms' synergies. Our empirical evaluation on several datasets demonstrates that mutual training of sequential and graph components with the proposed framework significantly improves recommendations performance.
In this paper, we first define the problem of item-ranking promotion (IRP) in recommender systems as (Goal 1) maintaining a high level of overall recommendation accuracy while (Goal 2) recommending the items with extra values (i.e., RP-items) to as many users as possible. Our novel framework, proposed to address the IRP problem, is based on our own loss function that simultaneously aims to achieve the two goals above and employs a learning-to-rank scheme for training a recommender model. Via extensive experiments, we validate the effectiveness of our framework in terms of the exposure rate of RP-items and the accuracy of recommendation.
Federated learning is an approach for privacy preserving machine learning. It is increasingly being used in a number of classification as well as ranking tasks. Protocols for federated learning involve model update at the edge devices and aggregation at the central servers over multiple rounds. In practice, most deep learning models deployed on the edge are already trained and in-use. Federated learning protocols lead to an oscillation in the performance of these local models over the epochs. The drop in accuracy is more prominent in the early phases. In this article, we study such effects for the popular FedAvg federated learning algorithm and suggest the modified HBIAS FedAvg algorithm. The algorithm proposes a heuristic based initialization adoption strategy for this purpose. We find that this protocol leads to smoother performance variation for experiments on benchmark datasets.
The rampant spread of fake news has adversely affected society, resulting in extensive research on curbing its spread. As a notable milestone in large language models (LLMs), ChatGPT has gained significant attention due to its exceptional capabilities. In this study, we present an exploration of ChatGPT's proficiency in generating, explaining, and detecting fake news as follows.Generation -- We employ different prompt methods to generate fake news and prove the high quality of these instances through both self-assessment and human evaluation.Explanation -- We obtain nine features to characterize fake news based on ChatGPT's explanations and analyze the distribution of these factors across multiple public datasets.Detection -- We examine ChatGPT's capacity to identify fake news. We propose a reason-aware prompt method to improve its performance. We further probe into the potential extra information that could bolster its effectiveness in detecting fake news.
Node embedding is one of the most widely adopted techniques in numerous graph analysis tasks, such as node classification. Methods for node embedding can be broadly classified into three categories: proximity matrix factorization approaches, sampling methods, and deep learning strategies. Among the deep learning strategies, graph contrastive learning has attracted significant interest. Yet, it has been observed that existing graph contrastive learning approaches do not adequately preserve the local topological structure of the original graphs, particularly when neighboring nodes belong to disparate categories. To address this challenge, this paper introduces a novel node embedding approach named Locally Linear Contrastive Embedding (LLaCE). LLaCE is designed to maintain the intrinsic geometric structure of graph data by utilizing locally linear formulation, thereby ensuring that the local topological characteristics are accurately reflected in the embedding space. Experimental results on one synthetic dataset and five real-world datasets validate the effectiveness of our proposed method.
Deceptive patterns are design practices embedded in digital platforms to manipulate users, representing a widespread and long-standing issue in the web and mobile software development industry. Legislative actions highlight the urgency of globally regulating deceptive patterns. However, despite advancements in detection tools, a significant gap exists in assessing deceptive pattern risks. In this study, we introduce a comprehensive approach involving the interactions between the Adversary, Watchdog (e.g., detection tools), and Challengers (e.g., users) to formalize and decode deceptive pattern threats. Based on this, we propose a quantitative risk assessment system. Representative cases are analyzed to showcase the practicability of the proposed risk scoring system, emphasizing the importance of involving human factors in deceptive pattern risk assessment.
As a paradigm that preserves privacy, Federated Learning (FL) enables distributed clients to cooperatively train global models using local datasets. However, this approach also provides opportunities for adversaries to compromise system stability by contaminating local data, such as through Label-Flipping Attacks (LFAs). In addressing these security challenges, most existing defense strategies presume the presence of an independent and identically distributed (IID) environment, resulting in suboptimal performance under Non-IID conditions. This paper introduces RSim-FL, a novel and pragmatic defense mechanism that incorporates Representational Similarity Analysis (RSA) into the detection of malevolent updates. This is achieved by calculating the similarity between uploaded local models and the global model. The evaluation, conducted against five state-of-the-art baselines, demonstrates that RSim-FL can accurately identify malicious local models and effectively mitigate divergent Label-Flipping Attacks (LFAs) in a Non-IID setting.
Recent studies show that deep neural networks are extremely vulnerable, especially for adversarial examples of image classification models. However, the current defense technologies exhibit a series of limitations in terms of the adaptability of different attacks, the trade-off between clean-instance accuracy and robust one, as well as efficiency for train time overhead. To tackle these problems, we present a novel component, named redundant fully connected layer, which can be combined with existing model backbones in a pluggable manner. Specifically, we design a tailor-made loss function for it that leverages cosine similarity to maximize the difference and diversity of multiple fully connected parts. We conduct extensive experiments against 12 representative attacks (white-box and black-box), based on the popular dataset. The empirical evaluations show that our scheme realizes significant outcomes against various attacks with negligible additional training overhead, while hardly bringing collateral damage for clean-instance accuracy.
Embedding-based Retrieval (EBR) system is a fundamental component that supplies candidates for downstream ranking mechanisms in the sponsored search system. To enhance search experience and ensure effective retrieval, EBR usually accounts for various objectives including the semantic relevance and personalization of search results. However, traditional multi-task EBR models ignore the intrinsic progressive relationship between relevant and personalized candidates during a search. Recognizing this gap, we make the very first attempt to utilize the representation generation capabilities of Diffusion Models in EBR. In this paper, we present a novel model DiffuRetrieval to address the progressive objectives for high-quality item retrieval. In forward process, DiffuRetrieval incrementally corrupts item representations through controlled noise injection. Conversely, in reverse process, we refine the representations based on query information in a chain-of-thought manner, initially establishing coarse-grained relevance and progressively moving towards fine-grained personalization. Online A/B tests on Meituan sponsored search platform demonstrate that our approach markedly surpasses the baselines, delivering substantial improvements in revenue, relevance and personalization.
We present the Thought Graph as a novel framework to support complex reasoning and use gene set analysis as an example to uncover semantic relationships between biological processes. Our framework stands out for its ability to provide a deeper understanding of gene sets, significantly surpassing GSEA by 40.28% and LLM baselines by 5.38% based on cosine similarity to human annotations. Our analysis further provides insights into future directions of biological processes naming, and implications for bioinformatics and precision medicine.
Citation networks have been thought to exhibit scale-free property for many years; however, this assertion has been doubted recently. In this paper, we conduct extensive experiments to resolve this controversial issue. We firstly demonstrate the scale-free property in scale-free networks sampled from the popular Barabasi-Albert (BA) model. To this end, we employ a merged rank distribution, which is divided into outliers, power-law segment, and non-power-law data, to characterize network degrees, and propose a random sample consensus (RANSAC)-based method to identify power-law segments from merged rank distributions, and use the Kolmogorov-Smirnov (KS) test to examine the scale-free property in power-law segments. Subsequently, we apply the same methods to examine the scale-free property in real-world citation networks. Experimental results confirm the scale-free property in citation networks and attribute previous skepticism to the presence of outliers.
Documents describing information with expiration time often include time expressions specifying the expiration time. To train a classifier determining if a time expression represents an expiration time, we need a labeled dataset. We propose a method of automatically constructing such a dataset. Our method collects tweets including time expressions, and automatically determines whether the time expressions represent expiration times based on temporal changes in the frequency of retweets. Our experimental result shows that our method produces an effective dataset.
Relevant recommendation is a distinctive recommendation scenario in e-commerce platforms, which provides an extended set of items that are relevant to the trigger item (the item that triggers the relevant recommendation). Different from the general recommendations whose item feeds are diversified, relevant recommendation regards the trigger item as a key component. From one perspective, the trigger item reveals users' current interests and determines the range of the recommendation results. From the other perspective, users may have the mindset to look for items that have directional attribute differences from the trigger item. In this paper, we present an attribute-aware personalized item comparison framework. Under this framework, an item subtraction module is first applied over the trigger item and the candidate item, which calculates their directional difference with consideration of their intrinsic similarity. Then two modules are used to estimate users' preference for this current item pair: one learns the collective preference of all users, and the other learns the current user's personal evolutional preference. Experiments on a CTR prediction task over both a public dataset and an industrial dataset from our shopping app show that the proposed method outperforms the state-of-the-art algorithms and also achieves better generalization ability.
How can we detect a group of individuals whose connectivity persists and even strengthens over time? Despite extensive research on temporal networks, this practically pertinent question has been scantily investigated. In this paper, we formulate the problem of selecting a subset of nodes whose induced subgraph maximizes the overall edge count while abiding by time-aware spectral connectivity constraints. We solve the problem via a semidefinite programming (SDP) relaxation. Our experiments on a broad array of synthetic and real-world data establish the effectiveness of our method and deliver key insights on real-world temporal graphs.
Quantum computing (QC) has recently achieved significant technological advancements, attracting widespread attention. Current users mainly access QC resources through cloud services. However, cloud-based quantum services provide convenience while also introducing security risks. For example, attackers could steal private information or inject malicious programs into quantum devices, while quantum device fingerprinting may be the first step for these malicious intents. In this paper, we propose a novel Task-Driven Quantum Device Fingerprinting (TD-QDF) identification method based on quantum neural network (QNN) task outcomes. Unlike previous research, our method does not require any hardware details, resulting in high availability in practice. Extensive experiments involving 3 QNN circuits on 10 real IBM quantum computers show that our method can effectively identify quantum devices. This research contributes to advancing quantum fingerprinting technologies and holds promising implications for enhancing the security and accountability of quantum computing systems.
Most previous heterogeneous graph embedding models represent elements in a heterogeneous graph as vector representations in a low-dimensional Euclidean space. However, because heterogeneous graphs inherently possess complex structures, such as hierarchical or power-law structures, distortions can occur when representing them in Euclidean space. To overcome this limitation, we propose Hyperbolic Heterogeneous Graph Attention Networks (HHGAT) that learn vector representations in hyperbolic spaces with metapath instances. We conducted experiments on three real-world heterogeneous graph datasets, demonstrating that HHGAT outperforms state-of-the-art heterogeneous graph embedding models in node classification and clustering tasks. This superior performance is attributed to HHGAT's ability to capture the complex structure of heterogeneous graphs effectively.
Social network users often maintain multiple active accounts, sometimes referred to as alter egos. Examples of alter egos include personal and professional accounts or named and anonymous accounts. If alter egos are common on a platform, they can affect the results of A/B testing because a user's alter egos can influence each other. For a single user, one account may be assigned treatment, while another is assigned control. Alter-ego bias is relevant when the treatment affects the individual user rather than the account. Through experimentation and theoretical analysis, we examine the worst and expected case bias for different numbers of alter egos and for a variety of network structures and peer effect strengths. We show that alter egos moderately bias the results of simulated A/B tests on several network structures, including a real-world Facebook subgraph and several types of synthetic networks: small world networks, forest fire networks, stochastic block models, and a worst-case structure. We also show that bias increases with the number of alter egos and that different network structures have different upper bounds on bias.
This paper introduces a novel approach to stock movement prediction using multi-label classification, leveraging the interconnections between news articles and related company stocks. We present the Label-Prior Graph Attention (LPGA) model, which significantly enhances the performance of news-driven stock price movement forecasting. This model is comprised of a unique graph attention architecture, incorporating a label encoder and a text encoder, designed to effectively capture and utilize the relationships between labels in a graph-based context. Our model demonstrates superior performance over several benchmark models. The LPGA model's efficacy is further validated through experiments on two multi-label datasets. The model outperforms established baseline models across various evaluation metrics. The success of the LPGA model in both stock movement prediction and general multi-label classification tasks indicates its potential as a versatile tool in the realm of machine learning and financial analysis.
Count questions are an important type of information need, though often present in noisy, contradictory, or semantically not fully aligned form on the Web. In this work, we propose CardiO, a lightweight and modular framework for searching entity counts on the Web. CardiO extracts all counts from a set of relevant Web snippets, and infers the most central count based on semantic and numeric distances from other candidates. In the absence of supporting evidence, the system relies on peer sets of similar size, to provide an estimate. Experiments show that CardiO can produce accurate and traceable counts better than small LLM-only methods. Although larger models have higher precision, when used to enhance CardiO components, they do not contribute to the final precision or recall.
Social recommendation aims to integrate social relationships to improve the performance of recommendation, and has attracted increasing attention in the field of recommendation system. Recently, Graph Neural Networks (GNNs) based methods for social recommendation are very competitive, but most of them overlook the fact that social relationships may have potential noises. Through the message passing mechanism of GNNs, these noises could be propagated and amplified, ultimately reducing the performance of recommendation. In view of this, we propose a novel GNN-based Adaptive Denoising Social Recommendation (ADSRec) method. It devises a denoising network, which can alleviate the impact of social relationships noises via the adaptive weight adjustment strategy. By further introducing the contrastive learning, the representations of users and items can be enhanced, leading to better recommendation results. Extensive experiments on three widely used datasets demonstrate the superiority of ADSRec over baselines.
Disease prediction holds considerable significance in modern healthcare, because of its crucial role in facilitating early intervention and implementing effective prevention measures. However, most recent disease prediction approaches heavily rely on laboratory test outcomes (e.g., blood tests and medical imaging from X-rays). Gaining access to such data for precise disease prediction is often a complex task from the standpoint of a patient and is always only available post-patient consultation. To make disease prediction available from patient-side, we propose <u>P</u>ersonalized <u>M</u>edical Disease <u>P</u>rediction (PoMP), which predicts diseases using patient health narratives including textual descriptions and demographic information. By applying PoMP, patients can gain a clearer comprehension of their conditions, empowering them to directly seek appropriate medical specialists and thereby reducing the time spent navigating healthcare communication to locate suitable doctors. We conducted extensive experiments using real-world data from Haodf to showcase the effectiveness of PoMP.
With the proliferation of Location-based Social Networks (LBSNs), user check-in data at Points-of-Interest (POIs) has surged, reshaping user-environment interaction. However, POI recommendation remains a challenging task for two primary reasons. First, external incentives often drive users' check-ins, potentially misrepresenting their genuine preferences. Second, while many current research model the temporal dynamics of user preferences in a discrete space, they ignore capturing the continuous evolution of these preferences. To address these challenges, we propose the GraphSAGE-based POI Recommendation via Continuous-Time Modeling (GSA-CTM). We first utilize GraphSAGE to identify real user preferences and filter out noise beyond the user's real preferences. After GraphSAGE captures complex interaction, we use Gated Recurrent Unit (GRU) combined with neural Ordinary Differential Equations (ODEs) to capture the temporal information embedded in the interaction, and then use neural ODEs to model the user's continuous dynamic preferences into continuous space. Experiments on two widely-used public datasets validate the superiority of our method.
Recommendation systems often neglect global patterns that can be provided by clusters of similar items or even additional information such as text. Therefore, we study the impact of integrating clustering embeddings, review embeddings, and their combinations with embeddings obtained by a recommender system. Our work assesses the performance of this approach across various state-of-the-art recommender system algorithms. Our study highlights the improvement of recommendation performance through clustering, particularly evident when combined with review embeddings, and the enhanced performance of neural methods when incorporating review embeddings.
Recent studies have introduced privacy-preserving graph neural networks to safeguard the privacy of sensitive link information in graphs. However, existing link protection mechanisms in GNNs, particularly over decentralized nodes, struggle to strike an optimal balance between privacy and utility. We argue that a pivotal issue is the separation of noisy topology denoising and GNN private learning into distinct phases at the server side, leading to an under-denoising problem in the noisy topology. To address this, we propose a dynamic, adaptive Link LDP framework that performs noisy topology denoising on the server side in a dynamic manner. This approach aims to mitigate the impact of local noise on the GNN training process, reducing the uncertainty introduced by local noise. Furthermore, we integrate the noise generation and private training processes across all existing Link LDP GNNs into a unified framework. Experimental results demonstrate that our method surpasses existing approaches, obtaining around a 7% performance improvement under strong privacy strength and achieving a better trade-off between utility and privacy.
In the realm of social media, understanding and predicting post reach is a significant challenge. This paper presents a Crowd Reaction AssessMent (CReAM) task designed to estimate if a given social media post will receive more reaction than another, a particularly essential task for digital marketers and content writers. We introduce the Crowd Reaction Estimation Dataset (CRED), consisting of pairs of tweets from The White House with comparative measures of retweet count. The proposed Generator-Guided Estimation Approach (GGEA) leverages generative Large Language Models (LLMs), such as ChatGPT, FLAN-UL2, and Claude, to guide classification models for making better predictions. Our results reveal that a fine-tuned FLANG-RoBERTa model, utilizing a cross-encoder architecture with tweet content and responses generated by Claude, performs optimally. We further use a T5-based paraphraser to generate paraphrases of a given post and demonstrate GGEA's ability to predict which post will elicit the most reactions. We believe this novel application of LLMs provides a significant advancement in predicting social media post reach.
Question-answering (QA) retrieval is the task of retrieving the most relevant answer to a given question from a collection of answers. Various approaches to QA retrieval have been developed recently. One successful and popular model is Contextualized Late Interaction over BERT (ColBERT), a transformer-based approach that adopts a query-document scoring mechanism that retains the granularity of transformer matching, whilst improving on efficiency. However, one key limitation is that it requires further fine-tuning for new query or collection types. In this work, we explore and propose several non-parametric retrieval augmentation methods based on explicit signals of term importance that improve over ColBERT's baseline performance. In particular, we consider the QA retrieval task in the context of StackExchange question-answering forum, verifying the effectiveness of our methods in this setting.
The challenge of managing immigration data is exacerbated by its reliance on paper-based, evidence-driven records maintained by legal professionals, creating obstacles for efficient processing and analysis due to inherent trust issues with AI-based systems. This paper introduces a cutting-edge framework to surmount these hurdles by synergizing Large Language Models (LLMs) with Knowledge Graphs (KGs), revolutionizing traditional data handling methods. Our method transforms archaic, paper-based immigration records into a structured, interconnected knowledge network that intricately mirrors the legal and procedural nuances of immigration, ensuring a dynamic and trustworthy platform for data analysis. Utilizing LLMs, we extract vital entities and relationships from diverse legal documents to forge a comprehensive knowledge graph, encapsulating the complex legalities and procedural disparities in immigration processes and mapping the multifaceted interactions among stakeholders like applicants, sponsors, and legal experts. This graph not only facilitates a deep dive into the legal stipulations but also incorporates them, significantly boosting the system's reliability and precision. With the integration of Retrieval Augmented Generation (RAG) for exact, context-aware data retrieval and Augmented Knowledge Creation for developing a conversational interface via LLMs, our framework offers a scalable, adaptable solution to immigration data management. This innovative amalgamation of LLMs, KGs, and RAG techniques marks a paradigm shift towards more informed, efficient, and trustworthy decision-making in the sphere of global migration, setting a new benchmark for legal technology and data source management.
Companies track user data and sell it to advertisers. They claim to protect user privacy by anonymization, but our research shows that significant risks are still involved. Even with anonymous data, attackers can identify users on other websites from tracking records. We propose an identity alignment method of deanonymization attack, which analyzes tracker data to align identities. We explore the key factors affecting the effectiveness of identity alignment and analyze its impact on user privacy. We use crawling data to create tracker data close to ground-truth scenarios and propose an evaluation framework for online tracking based identity alignment.
Graph clustering is a challenging task, especially when there is a hierarchical structure. The availability of multiple graphs (or relational graphs), in the multi-graph setting, provides additional information that can be leveraged to improve clustering results. This paper aims to develop a new hierarchical clustering algorithm for multi-graphs, the HTGM algorithm. This algorithm represents the set of graphs in the multi-graph as a 3-way tensor, and maximizes a modularity measure, extending the modularity-based graph clustering algorithm to multi-graphs and tensor structures. We evaluate the proposed algorithm over synthetic and real-world datasets and show the effectiveness of the proposed algorithm by benchmarking it to alternative clustering algorithms.
Collecting statistics from online public SPARQL endpoints is hampered by their fair usage policies. These restrictions hinder several critical operations, such as aggregate query processing, portal development, and data summarization. Online sampling enables the collection of statistics while respecting fair usage policies. However, sampling has not yet been integrated into the SPARQL standard. Although integrating sampling into the SPARQL standard appears beneficial, its effectiveness must be demonstrated in a practical semantic web context. This paper investigates whether online sampling can generate summaries useful in cutting-edge SPARQL federation engines. Our experimental studies indicate that sampling allows the creation and maintenance of summaries by exploring less than 20% of datasets.
In this work, we aim to manipulate and share an entire sparse dataset with a third party privately. As our first main result, we prove that any differentially private mechanism that maintains a reasonable similarity with the initial dataset is doomed to have a very weak privacy guarantee. Next, we consider a variation of k-anonymity, which we call smooth-k-anonymity, and design a simple large-scale algorithm that efficiently provides smooth-k-anonymity. We further perform an empirical evaluation and show that our algorithm improves the performance in downstream machine learning tasks on anonymized data.
In response to growing interest in sustainable living from both governmental and public spheres, there is an increased effort to understand environmental implications. Recommendation systems, which are widely applied in various aspects of daily life, are crucial tools in encouraging and guiding users toward sustainable choices. However, existing public recommendation datasets primarily focus on user-item interactions and lack sufficient emphasis on sustainability, posing significant challenges to developing recommendations for sustainable items. In this work, we enrich a public food recommendation dataset by assigning environmental impact, nutritional impact, and health scores to each recipe, following well-recognized sustainability measurements. Through this work, we aim to lay a groundwork for recommending foods that are both healthy and environmentally conscious, all while maintaining recommendation accuracy.
Currently, with the expansion of the use of social networks, the topic of information dissemination has achieved significant importance. The spread of rumors and efforts to stop them have led researchers to pay more attention than ever to predicting the impact of each individual on the network. Various methods have been proposed for this purpose, such as the Hybrid global structure model (HGSM), Generalized Gravity centrality (GGC), and Degree and neighborhood centrality (DNC). However, alongside their advantages, they have drawbacks, such as high time complexity, low accuracy, or inefficiency in distinguishing between the dissemination abilities of different individuals. Therefore, this paper focuses on a method based on degree, K-shell, and K-shell diversity in the neighborhood of each individual. Simulations were conducted using the Susceptible-Infected-Recovered (SIR) model and compared with 9 recent methods. Evaluations of 7 different networks in terms of resolution, accuracy, time complexity, and correlation exhibit the superiority of the proposed method.
Monocular 3D face reconstruction plays a crucial role in avatar generation, with significant demand in web-related applications such as generating virtual financial advisors in FinTech. Current reconstruction methods predominantly rely on deep learning techniques and employ 2D self-supervision as a means to guide model learning. However, these methods encounter challenges in capturing the comprehensive 3D structural information of the face due to the utilization of 2D images for model training purposes. To overcome this limitation and enhance the reconstruction of 3D structural features, we propose an innovative approach that integrates existing 2D features with 3D features to guide the model learning process. Specifically, we introduce the 3D-ID Loss, which leverages the high-dimensional structure features extracted from a Spectral-Based Graph Convolution Encoder applied to the facial mesh. This approach surpasses the sole reliance on the 3D information provided by the facial mesh vertices coordinates. Our model is trained using 2D-3D data pairs from a combination of datasets and achieves state-of-the-art performance on the NoW benchmark.
Causal extraction from text plays a crucial role in various downstream analytical and predictive tasks, such as constructing repositories of causal insights for reasoning. However, existing models often overlook the rich contextual commonsense knowledge that could enhance the reasoning process and evaluate underlying causal mechanisms. In this study, we introduce a knowledge-induced transformer architecture for predicting causality. Our model accepts an antecedent and a set of contextual knowledge as input, then ranks plausible consequences from a given set of hypotheses. To enhance semantic understanding, we augment the transformer with a relational graph network, which computes fine-grained semantic information between the antecedent, knowledge, and hypotheses using a similarity matrix that quantifies word-to-word similarity. We evaluate the proposed architecture against state-of-the-art models using openly available datasets and demonstrate its superior performance.
Predicting customer preferences for each item is a prerequisite module for most recommender systems in e-commerce. However, the sparsity of behavioral data is often a challenge to learn accurate prediction models. Given millions of items, each customer may only be able to interact with a small subset of them over time. This sparse behavioral data is insufficient to represent item-customer and item-item relations for a machine learning model to digest, resulting in limited prediction accuracy that hinders recommendation performance. To mitigate this issue, this study introduces an inter-sequence data augmentation method, SDAinter, that enhances data density by leveraging cross-customer behavioral patterns to enrich item relations. Tested on three public and one proprietary e-commerce dataset, SDAinter significantly increases data density, leading to notable improvements in both evaluation and business metrics. Our findings demonstrate SDAinter's effectiveness and its potential to complement existing data augmentation strategies in recommender systems. See https://github.com/ML-apollo/SDA_inter.
Websites utilize several approaches to detect automated agents. The agents are deployed either for beneficial purposes such as search engine crawlers, or to perform tasks on behalf of the adversary such as scanning for vulnerabilities. Recent methods in detecting such agents include the analysis of the behavior that the agents show when visiting the website. In this paper, I) we describe a deep learning framework that analyzes the triggered browser events to classify the visitor. II) We develop two adversarial attacks in order to bypass the defense by generating adversarial vectors that are misclassified by the model. III) We discuss how applicable the attacks are by reviewing the limitations of the popular tools (i.e., Selenium and Puppeteer) used for the development of automated agents based on full-fledged browsers.
Selecting urban regions for metro network expansion to meet maximal transportation demands is crucial for urban development, while computationally challenging to solve. The expansion process relies not only on complicated features like urban demographics and origin-destination (OD) flow but is also constrained by the existing metro network and urban geography. In this paper, we introduce a reinforcement learning framework to address a Markov decision process within an urban heterogeneous multi-graph. Our approach employs an attentive policy network that intelligently selects nodes based on information captured by a graph neural network. Experiments on real-world urban data demonstrate that our proposed methodology substantially improve the satisfied transportation demands by over 30% when compared with state-of-the-art methods. Codes are published at https://github.com/tsinghua-fib-lab/MetroGNN.
Recently, with the increasing size of real-world networks, graph engines have been studied extensively for efficient graph analysis. As one of the state-of-the-art single-machine-based graph engines, \textRealGraph ^\textGPU processes large-scale graphs very efficiently thanks to its well-designed architecture and the strong parallel-computing power of GPU. Via a preliminary analysis, we first observe \textRealGraph ^\textGPU has a good chance for more performance improvement in IOs between storage and GPU's device memory. This motivates us to present \textRealGraph ^\textGPU++ , a solution that substantially reduces IO time by establishing adirect data path between storage and device memory. Additionally, it employsasynchronous processing of CPU and GPU tasks to issue IO requests more frequently, thereby improving overall performance by achieving higher IO bandwidth. Experimental results on real-world datasets show that \textRealGraph ^\textGPU++ outperforms dramatically existing 11 state-of-the-art graph engines including \textRealGraph ^\textGPU .
Search engine configuration can be quite difficult for inexpert developers. Instead, an auto-configuration approach can be used to speed up development time. Yet, such an automatic process usually requires relevance labels to train a supervised model. In this work, we suggest a simple, yet highly effective, extension to the probabilistic query performance prediction (QPP) framework that allows to auto-configure search algorithms without relevance labels. Our solution only assumes the availability of a sample of queries in a given domain. We demonstrate the merits of our solution using two common auto-configuration tasks. The first task is similarity model selection, including selection over "compound'' models such as re-ranking and fusion. The second one is similarity parameter auto-tuning, where we choose amongst dozens of possible parameter configurations so to optimize the potential search quality.
Cross-domain recommendation (CDR) has emerged as a promising approach to improve click-through rate (CTR) in the target domain by effectively transferring user interests from the source domain. However, existing methods either use a uniform interest transfer function or focus on user-level personalized transfer functions, neglecting the fact that the transition of user states in the target domain also influence the interests in the source domain. To address this issue, we present User-State based Interest Transfer network (USIT), a novel method that takes into account the user state evolution. USIT contains two main components: a User-State Transition module (UST) and a State-Level Interests Transfer module (SLIT). UST models the evolution of user states by predicting the next state in the target domain. As the user's state evolves, SLIT adaptively weights the interests by interest-level mask attention in the source domain. Extensive offline experiments and online A/B tests demonstrate that our proposed USIT method significantly outperforms current state-of-the-art models in CDR scenarios. Currently, we have deployed it on NetEase Cloud Music, affecting millions of users.
In recent years, JavaScript has become the most widely used programming language, especially in web development. However, writing secure JavaScript code is not trivial, and programmers often make mistakes that lead to security vulnerabilities in web applications. Large Language Models (LLMs) have demonstrated substantial advancements across multiple domains, and their evolving capabilities indicate their potential for automatic code generation based on a required specification, including automatic bug fixing. In this study, we explore the accuracy of LLMs, namely ChatGPT and Bard, in finding and fixing security vulnerabilities in JavaScript programs. We also investigate the impact of context in a prompt on directing LLMs to produce a correct patch of vulnerable JavaScript code. Our experiments on real-world software vulnerabilities show that while LLMs are promising in automatic program repair of JavaScript code, achieving a correct bug fix often requires an appropriate amount of context in the prompt.
Existing Neural Machine Translation (NMT) models mainly handle translation in the general domain, while overlooking domains with special writing formulas, such as e-commerce and legal documents. Taking e-commerce as an example, the texts usually include amounts of domain-related words and have more grammar problems, which leads to inferior performances of current NMT methods. To address these problems, we collect two domain-related resources, including a set of term pairs (aligned Chinese-English bilingual terms) and a parallel corpus annotated for the e-commerce domain. Furthermore, we propose a two-step fine-tuning paradigm (named G2ST) with self-contrastive semantic enhancement to transfer one general NMT model to the specialized NMT model for e-commerce. The paradigm can be used for the NMT models based on Large language models (LLMs). Extensive evaluations on real e-commerce titles demonstrate the superior translation quality and robustness of our G2ST approach, as compared with state-of-the-art NMT models such as LLaMA, Qwen, GPT-3.5, and even GPT-4.
Users rely on clever recommendations for items they might like to buy, and service providers rely on clever recommender systems to ensure that their product is recommended to their target audience. Providing explanations for recommendations helps to increase transparency and the users' overall trust in the system, besides helping practitioners debug their recommendation model. Modern recommendation systems utilize multi-modal data such as reviews and images to provide recommendation. In this work, we propose CAVIAR (Counterfactual explanations for VIsual Recommender systems), a novel method to explain recommender systems that utilize visual features of items. Our explanation is counterfactual and is optimized to be simultaneously simple and effective. Given an item in the user's top-K recommended list, CAVIAR makes a minimal, yet meaningful, perturbation to the item's image-embedding such that it is no longer a part of the list. In this way, CAVIAR aims to find the visual features of the item that were the most relevant for the recommendation. In order to lend meaning to the perturbations, we leverage CLIP model to connect the perturbed image features to textual features. We frame the explanation as a natural language counterfactual by contrasting the observed visual features in the item before and after the perturbation.
Monero is a privacy-focused cryptocurrency that incorporates anonymity networks (such as Tor and I2P) and deploys the Dandelion++ protocol to prevent malicious attackers from linking transactions with their source IPs. However, this paper highlights a vulnerability in Monero's integration of the Tor network, which allows an attacker to successfully deanonymize transactions originating from Monero Tor hidden service nodes at the network-layer level.
Our approach involves injecting malicious Monero Tor hidden service nodes into the Monero P2P network to correlate the onion addresses of incoming Monero Tor hidden service peers with their originating transactions. And by sending a signal watermark embedded with the onion address to the Tor circuit, we establish a correlation between the onion address and IP address of a Monero Tor hidden service node. Ultimately, we correlate transactions and IPs of Monero Tor hidden service nodes.
Through experimentation on the Monero testnet, we provide empirical evidence of the effectiveness of our approach in successfully deanonymizing transactions originating from Monero Tor hidden service nodes.
Social media has become a crucial conduit for the swift dissemination of information during global crises. However, this also paves the way for the manipulation of narratives by malicious actors. This research delves into the interaction dynamics between coordinated (malicious) entities and organic (regular) users on Twitter amidst the Gaza conflict. Through the analysis of approximately 3.5 million tweets from over 1.3 million users, our study uncovers that coordinated users significantly impact the information landscape, successfully disseminating their content across the network: a substantial fraction of their messages is adopted and shared by organic users. Furthermore, the study documents a progressive increase in organic users' engagement with coordinated content, which is paralleled by a discernible shift towards more emotionally polarized expressions in their subsequent communications. These results highlight the critical need for vigilance and a nuanced understanding of information manipulation on social media platforms.
We report the results of a yearlong effort at the Laboratory for Web Algorithmics and Inria to port the WebGraph framework [4] from Java to Rust. For two decades WebGraph has been instrumental in the analysis and distribution of large graphs for the research community of TheWebConf, but the intrinsic limitations of the Java Virtual Machine had become a bottleneck for very large use cases, such as the Software Heritage Merkle graph [2] with its half a trillion arcs. As part of this clean-slate implementation of WebGraph in Rust, we developed a few ancillary projects bringing to the Rust ecosystem some missing features of independent interest, such as easy, consistent and zero-cost memory mapping of data structures. WebGraph in Rust offers impressive performance improvements over the previous implementation, enabling open-source graph analytics on very large datasets like Common Crawl, on top of a modern systems programming language.
One of the byproducts of message passing neural networks (MPNNs) is their potential bias towards weakly connected nodes, which can result in degraded performance. This paper confirms that as the number of layers increases, this bias becomes more closely associated with an imbalance in the distribution of eigenvector centrality, known as localization, which further amplifies the discrepancy in label influence on nodes, resulting in a performance gap. Therefore, we explore the effectiveness of non-backtracking centrality and PageRank centrality in mitigating this bias in MPNNs.
User lifelong behavior sequences are essential for click-through rate (CTR) prediction tasks in industrial recommender systems. Attention-based module, especially multi-head target attention (MHTA), has been proven to be effective in aggregating behavior features given a certain target item. However, we found a common phenomenon that attention weights in MHTA tend to over-concentrate on merely a small subset of a user's historical behaviors, producing a sparse one-hot distributed attention weights in training gradually, which we callAttention Polarization (AP). These polarized weights on certain behaviors (which we call "\textitattention anchor ") could make the model fail to capture a user's diversified interests, and harm the learning of behavior embedding as the gradients on these features are nearly zero. We introduce two indicators:anchor rate andattention entropy to measure the magnitude of AP, and proposeDe-Anchor , a novel method to alleviate it, which can serve as a stand-alone and parameter-efficient plug-in to existing CTR backbones. De-Anchor contains two modules:Anchor-aware gradient dropout (AGD) forcing the model to capture diversified interest information from behavior sequences by discarding gradients of non-behavior features, andTarget-aware attention anchor (TAA) providing a pseudo behavior to offload excessive weights of MHTA. Extensive offline experiments and industrial online A/B tests demonstrate the efficacy of our method.
Gradient Inversion Attacks (GIAs) have shown that private training data can be recovered from gradient updates in Federated Learning (FL). However, these GIAs can only recover the entire batch of data with limited performance or stochastically restore some random instances. In this paper, we propose a class-wise targeted attack, named GradFilt, which can reconstruct the training data of some specified class(es) from the batch-averaged gradients. By modifying the parameters of the classification layer, we create a filter within the FL model that eliminates the gradients of non-target data while preserving the gradients of target data. We evaluate GradFilt with image datasets on popular FL model architectures. The results show that GradFilt can effectively reconstruct the desired samples with higher accuracies than the existing GIAs. Moreover, we can also achieve 100% success rate in restoring the batch labels. We hope this work can raise awareness of the privacy risks in FL and inspire effective defense mechanisms.
The evolution of political campaigns is evident with the ascent of social media. Ideological beliefs are increasingly disseminated through political-affiliated fan pages. The interaction between politicians and the general public on these platforms plays a pivotal role in election outcomes. In this study, we utilize a multimodal approach to explore and quantify similarities of ideologies among political fan pages. we employed visualization techniques to demonstrate the political stance of each fan page. To validate our proposal, we concentrated on an analysis of the 2021 national referendums in Taiwan, encompassing a collection of fan pages and their corresponding posts that were related to these referendums. Through a qualitative analysis of the content of these fan pages, the efficacy of our multimodal framework in clustering fan pages according to their respective political ideologies was evaluated. The findings of this study underscore the significant enhancement in the accuracy of stance detection when integrating multiple modalities of data, namely textual content, visual imagery, and user interactions.
Open Source Software (OSS) projects play a critical role in the digital infrastructure of companies and services provided to millions of people. Given their importance, understanding the resilience of OSS projects is paramount. A primary reason for OSS project failure is the shock caused by the dropout of a core developer, which can jeopardize productivity and project survival. Using a difference-in-differences (DiD) analysis, this study investigates the repercussions of this shock on the productivity of 8,234 developers identified among 9,573 OSS GitHub projects. Our findings reveal the indirect impact of the core developer's dropout. The remaining developers experienced a 20% productivity drop. This observation is troubling because it suggests that the shock might push other developers to drop out, putting the collaboration structure of the project at risk. Also, projects with higher productivity before the shock experienced a larger drop-down after the shock. This points to a tradeoff between productivity and resilience, i.e., the ability of OSS projects to recover from the dropout of a core developer. Our findings underscore the importance of a balanced approach in OSS project management, harmonizing productivity goals with resilience considerations.
Podcast content modeling is crucial for a variety of practical web uses, such as the recommendation and classification of podcasts. However, previous studies on podcast content modeling rely on task-specific datasets to train dedicated models for each downstream application, which are labels heavily dependent and the learned representations are non-generalizable across different tasks. In addition, the rich and intricate structural information among users, podcasts, and topics are neglected. In this paper, we propose to model podcast content without labels and learn general podcast representations without prior knowledge of downstream tasks. Moreover, the learned podcast representations encode crucial structural information, complementary to the independent content information of each podcast. In particular, we first collect a new and large-scale podcast graph from Spotify. Then, we propose Podcast2Vec, a novel self-supervised podcast content modeling method to learn podcast representations. Podcast2Vec captures general transferable knowledge across different tasks and complex structures via a metapath-based neighbor sampling strategy and a multi-view relational modeling framework. Thorough experiments demonstrate the superiority of our method on four real-world podcast content modeling tasks.
This paper proposes a new defense mechanism, namely, GCAMA, against model poisoning attacks on Federated learning (FL), which integrates <u>G</u>radient-weighted <u>C</u>lass <u>A</u>ctivation <u>M</u>apping (GradCAM) and <u>A</u>utoencoder to offer a scientifically more powerful detection capability compared to existing Euclidean distance-based approaches. Particularly, GCAMA generates a heat map for each uploaded local model update, transforming each local model update into a lower-dimensional, visual representation, thereby accentuating the hidden features of the heat maps and increasing the success rate of identifying anomalous heat maps and malicious local models. We test ResNet-18 and MobileNetV3-Large deep learning models with CIFAR-10 and GTSRB datasets under Non-Independent and Identically Distributed (Non-IID) setting, respectively. The results demonstrate that GCAMA offers superior test accuracy of FL global model compared to the state-of-the-art methods. Our code is available at: https://github.com/jjzgeeks/GradCAM-AE
As social media users can easily access, generate, and spread information regardless of its authenticity, the proliferation of fake news related to public health has become a serious problem. Since these rumors have caused severe social issues, detecting them in the early stage is imminent. Therefore, in this paper, we propose a deep learning model that can debunk fake news on COVID-19, as a case study, at the initial stage of emergence. The evaluation with a newly-collected dataset consisting of both the COVID-19 and Non-COVID-19 fake news claims demonstrates that the proposed model achieves high performance, indicating that the model can identify fake news on COVID-19 in the early stage with a small amount of data. We believe that our methodology and findings can be applied to detect fake news on newly-emerging and critical topics, which should be performed with insufficient resources.
Conversational search engines such as YouChat and Microsoft Copilot use large language models (LLMs) to generate responses to queries. It is only a small step to also let the same technology insert ads within the generated responses - instead of separately placing ads next to a response. Inserted ads would be reminiscent of native advertising and product placement, both of which are very effective forms of subtle and manipulative advertising. Considering the high computational costs associated with LLMs, for which providers need to develop sustainable business models, users of conversational search engines may very well be confronted with generated native ads in the near future. In this paper, we thus take a first step to investigate whether LLMs can also be used as a countermeasure, i.e., to block generated native ads. We compile the Webis Generated Native Ads 2024 dataset of queries and generated responses with automatically inserted ads, and evaluate whether LLMs or fine-tuned sentence transformers can detect the ads. In our experiments, the investigated LLMs struggle with the task but sentence transformers achieve precision and recall values above 0.9.
The reasoning and generalization capabilities of LLMs can help us better understand user preferences and item characteristics, offering exciting prospects to enhance recommendation systems. Though effective while user-item interactions are abundant, conventional recommendation systems struggle to recommend cold-start items without historical interactions. To address this, we propose utilizing LLMs as data augmenters to bridge the knowledge gap on cold-start items during training. We employ LLMs to infer user preferences for cold-start items based on textual description of user historical behaviors and new item descriptions. The augmented training signals are then incorporated into learning the downstream recommendation models through an auxiliary pairwise loss. Through experiments on public Amazon datasets, we demonstrate that LLMs can effectively augment the training signals for cold-start items, leading to significant improvements in cold-start item recommendation for various recommendation models.
As emerging digital assets, NFTs are susceptible to anomalous trading behaviors due to the lack of stringent regulatory mechanisms, potentially causing economic losses. In this paper, we conduct the first systematic analysis of four non-fungible tokens (NFT) markets. Specifically, we analyze more than 25 million transactions within these markets, to explore the evolution of wash trade activities. Furthermore, we propose a heuristic algorithm that integrates the network characteristics of transactions with behavioral analysis, to detect wash trading activities in NFT markets. Our findings indicate that NFT markets with incentivized structures exhibit higher proportions of wash trading volume compared to those without incentives. Notably, the LooksRare and X2Y2 markets are detected with wash trading volume proportions as high as 94.5% and 84.2%, respectively.
The emergence of filter bubbles leads to various harms. To mitigate filter bubbles, some recent works select the seeds for different viewpoints to minimize the formation of bubbles under the influence propagation model. Different from these works where the diffusion networks remain unchanged, in this paper, we conduct the first attempt to mitigate filter bubbles via edge insertion. Besides, to be more generalized, we focus on mitigating filter bubbles for the given target node set since the audiences can be different for different scenarios. Specifically, we propose the concept of openness score for each target node, which serves as a metric to assess the likelihood of this node being influenced by multiple viewpoints simultaneously. Given a directed graph G, two seed sets, a positive integer k and a target node set, we aim to find k edges incident to the given seeds such that the total openness score is maximized. We prove the NP-hardness of problem studied. A baseline method is first presented by extending the greedy framework. To handle large graphs efficiently, we develop a sampling-based strategy. A data-dependent approximation method is developed with theoretical guarantees. Experiments over real social networks are conducted to demonstrate the advantages of proposed techniques.
The presence of a large number of bots in Online Social Networks (OSN) leads to undesirable social effects. Graph neural networks (GNNs) are effective in detecting bots as they utilize user interactions. However, class-imbalanced issues can affect bot detection performance. To address this, we propose an over-sampling strategy for GNNs (OS-GNN) that generates samples for the minority class without edge synthesis. First, node features are mapped to a feature space through neighborhood aggregation. Then, we generate samples for the minority class in the feature space. Finally, the augmented features are used to train the classifiers. This framework is general and can be easily extended into different GNN architectures. The proposed framework is evaluated using three real-world bot detection benchmark datasets, and it consistently exhibits superiority over the baselines.
The global rollout of 5G mobile networks has prompted discussions on deployment strategies. Given the knowledge gap in the current deployment strategies of 5G base stations, understanding the deployment experience from regions with widespread 5G base stations is valuable for guiding future deployments elsewhere. In this study, based on a large data set collected from a metropolitan city in China, we discover the misalignment between 5G traffic demand and the number of base stations. Then we introduce a factor to quantify the misalignment. Our analysis indicates the following important observations. Firstly, unique traffic patterns of functional areas contribute to different misalignment factors, i.e., transport areas exhibit a positive factor, in contrast to the negative factor observed in urban comprehensive and residential areas. Secondly, regions with a high density of base stations still suffer from low energy and resource utilization efficiency due to their high energy consumption. Thirdly, our analysis reveals that 5G base stations are frequently located in areas with large 4G traffic, yet the incomplete migration of traffic to 5G results in misalignment. This understanding of the 5G deployment experience can help further studies on optimizing energy efficiency and network utilization rate of the mobile networks.
Industrial recommender systems usually consist of the retrieval stage and the ranking stage, to handle the billion-scale of users and items. The retrieval stage retrieves candidate items relevant to user interests for recommendations and has attracted much attention. Frequently, a user shows refined multi-interests in a hierarchical structure. For example, a user likes Conan and Kuroba Kaito, which are the roles in hierarchical structure"Animation, Japanese Animation, Detective Conan".However, most existing methods ignore this hierarchical nature, and simply average the fine-grained interest information. Therefore, we propose a novel two-stage approach to explicitly modeling refined multi-interest in a hierarchical structure for recommendation. In the first hierarchical multi-interest mining stage, the hierarchical clustering and transformer-based model adaptively generate circles or sub-circles that users are interested in. In the second stage, the partition of retrieval space allows the EBR models to deal only with items within each circle and accurately capture users' interests. Experimental results show that the proposed approach achieves state-of-the-art performance. Our framework has also been deployed at Lofter.
Rumor detection is to identify and mitigate potentially damaging falsehoods, thereby shielding the public from misleading information. However, existing methods fall short of tackling class imbalance, meaning rumor is less common than true messages, as they lack specific adaptation for the context of rumor dissemination. In this work, we propose Dual Graph Networks with Synthetic Oversampling (SynDGN), a novel method that can determine whether a claim made on social media is rumor or not in the presence of class imbalance. SynDGN properly utilizes dual graphs to integrate social media contexts and user characteristics to make accurate predictions. Experiments conducted on two well-known datasets verify that SynDGN consistently outperforms state-of-the-art models, regardless of whether the data is balanced or not.
Data binding in web front-end development has made a significant contribution to removing complexity from development and simplifying programming. However, data binding has caused a degradation of website performance at the cost of reducing the burden on programmers. In this paper, we propose Visible Anchor to solve the performance degradation caused by data binding. We develop a compiler called FaST that implements the method. Then, We compared the rendering time among websites built by existing methods and FaST compiler. The evaluation result revealed that the websites built by FaST compiler are at minimum 2.9 times faster to be rendered than the ones built by the existing methods. FaST made a significant contribution to improving the performance of web front-end data binding. Consequently, data binding with FaST can be a better choice for web front-end development.
Automated fact-checking is a crucial task in the governance of internet content. Although various studies utilize advanced models to tackle this issue, a significant gap persists in addressing complex real-world rumors and deceptive claims. To address this challenge, this paper explores the novel task of flaw-oriented fact-checking, including aspect generation and flaw identification. We also introduce RefuteClaim, a new framework designed specifically for this task. Given the absence of an existing dataset, we present FlawCheck, a dataset created by extracting and transforming insights from expert reviews into relevant aspects and identified flaws. The experimental results underscore the efficacy of RefuteClaim, particularly in classifying and elucidating false claims.
Travel demand forecasting is a vital problem in the development of smart cities, infrastructure planning, and transportation management. The advent of contactless smart card systems has enabled the collection of data regarding daily transit and purchasing activities, providing a rich source of insights into citizen behavior. In this paper, we introduce a new problem of predicting changes in travel demand resulting from the installation of a new facility while preserving privacy. To address this problem, we propose a simple but effective supervised learning method that can capture the relationships between residential areas and existing facility locations, and exploit spatial features to forecast future demand in response to a new facility location. As a workable example, we employ real-world data to predict the future travel demand triggered by the installation of a new station in a railway system. Through extensive experiments, we demonstrate that our method improves the prediction accuracy.
Non-Fungible Tokens (NFTs) are digital assets recorded on the blockchain, providing cryptographic proof of ownership over digital or physical items. Although Solana has only begun to gain popularity in recent years, its NFT market has seen substantial transaction volumes. In this paper, we conduct the first systematic research on the characteristics of Solana NFTs from two perspectives: longitudinal measurement and wash trading security audit. We gathered 132,736 Solana NFT from Solscan and analyzed the sales data within these collections. Investigating users'economic activity and NFT owner information reveals that the top users in Solana NFT are skewed toward a higher distribution of purchases. Subsequently, we employ the Local Outlier Factor algorithm to conduct a wash trading audit on 2,175 popular Solana NFTs. We discovered that 138 NFT pools are involved in wash trading, with 8 of these NFTs having a wash trading rate exceeding 50%. Fortunately, none of these NFTs have been entirely washed out.
The existing pairing and authentication mechanisms adopt either fuzzy commitment or fuzzy password-authenticated key exchange for device fingerprint generation, detecting and correcting multiple symbol errors, leading to guessing attacks and increased pairing time. In this study, we propose a one-shot pairing and authentication approach that generates a device fingerprint from the selected contextual data using Median-of-medians (Moms), ensuring randomness and preventing guessing attacks. Moreover, we integrate the Moms secret into Password Authenticated Key Exchange (PAKE) to reduce the pairing time and improve security. The evaluation demonstrates that our proposed one-shot pairing and authentication approach ensures strong resistance against information gain, reduces the probability of guessing attacks, and significantly decreases the pairing time compared to state-of-the-art approaches.
In recent years, natural language processing (NLP) models have demonstrated remarkable performance in text classification tasks. However, trust in the decision-making process requires a deeper understanding of the operational principles of these networks. Therefore, there is an urgent need to enhance transparency and the interpretability of these "black boxes". Aligned with this, we propose a model-agnostic interpretability method named MCG. This method generates counterfactual interpretations that are more faithful to the original models' performance through a multi-round dialogue, in which a new template is generated based on the evaluation of the previous counterfactual interpretation. In addition, MCG proposes a solution to improve model performance through counterfactual data augmentation for cases where the model to be interpreted is misclassified, which is rarely covered by existing counterfactual methods. Extensive experiments on three datasets demonstrate that our MCG outperforms current state-of-the-art methods in counterfactual generation for interpretability.
Core computations in Graph Neural Network (GNN) training and inference are often mapped to sparse matrix operations such as sparse-dense matrix multiplication (SpMM). These sparse operations are harder to optimize by manual tuning because their performance depends significantly on the sparsity of input graphs, GNN models, and computing platforms. To address this challenge, we present iSpLib, a PyTorch-based C++ library equipped with auto-tuned sparse operations. iSpLib expedites GNN training with a cache-enabled backpropagation that stores intermediate matrices in local caches. The library offers a user-friendly Python plug-in that allows users to take advantage of our optimized PyTorch operations out-of-the-box for any existing linear algebra-based PyTorch implementation of popular GNNs (Graph Convolution Network, GraphSAGE, Graph Inference Network, etc.) with only two lines of additional code. We demonstrate that iSpLib obtains up to 27x overall training speedup compared to the equivalent PyTorch 2.1.0 and PyTorch Geometric 2.4.0 implementations on the CPU. Our library is publicly available at \hrefhttps://github.com/HipGraph/iSpLib https://github.com/HipGraph/iSpLib \footnote\hrefhttps://doi.org/10.5281/zenodo.10806511 https://doi.org/10.5281/zenodo.10806511 .
Providing online content monetized via ads to users is a lucrative business. But what if the content is pirated or illicit, thus harming the brand safety of the advertiser? In this paper, we are the first to investigate Ad Laundering: a technique with which bad actors deceive advertisers by hiding illicit content within evidently lawful websites to monetize the generated traffic. We develop a client-side detection methodology to detect and analyze websites performing ad laundering. We describe in detail the techniques these websites use to cloak content, and provide estimations for the ad revenues they are able to collect on a monthly basis. Finally, we attribute the generated revenue to different traffic channels and establish that even popular brands have their ads rendered next to undesirable content.
Decentralized finance (DeFi) protocols are crypto projects developed on the blockchain to manage digital assets. Attacks on DeFi have been frequent and have resulted in losses exceeding 77 billion. However, detection methods for malicious DeFi events are still lacking. In this paper, we propose DeFiTail, the first framework that utilizes deep learning to detect access control and flash loan exploits that may occur on DeFi. Since the DeFi protocol events involve invocations with multi-account transactions, which requires execution path unification with different contracts. Moreover, to mitigate the impact of mistakes in Control Flow Graph (CFG) connections, we validate the data path by employing the symbolic execution stack. Furthermore, we feed the data paths through our model to achieve the inspection of DeFi protocols. Experimental results indicate that DeFiTail achieves the highest accuracy, with 98.39% in access control and 97.43% in flash loan exploits. DeFiTail also demonstrates an enhanced capability to detect malicious contracts, identifying 86.67% accuracy from the CVE dataset.
The yield of a chemical reaction quantifies the percentage of the target product formed in relation to the reactants consumed during the chemical reaction. Accurate yield prediction can guide chemists toward selecting high-yield reactions during synthesis planning, offering valuable insights before dedicating time and resources to wet lab experiments. While recent advancements in yield prediction have led to overall performance improvement across the entire yield range, an open challenge remains in enhancing predictions for high-yield reactions, which are of greater concern to chemists. In this paper, we argue that the performance gap in high-yield predictions results from the imbalanced distribution of real-world data skewed towards low-yield reactions, often due to unreacted starting materials and inherent ambiguities in the reaction processes. Despite this data imbalance, existing yield prediction methods continue to treat different yield ranges equally, assuming a balanced training distribution. Through extensive experiments on three real-world yield prediction datasets, we emphasize the urgent need to reframe reaction yield prediction as an imbalanced regression problem. Finally, we demonstrate that incorporating simple cost-sensitive re-weighting methods can significantly enhance the performance of yield prediction models on underrepresented high-yield regions.
Existing multigraph convolution methods either ignore the cross-view interaction among multiple graphs, or induce extremely high computational cost due to standard cross-view polynomial operators. To alleviate this problem, this paper proposes a Simple MultiGraph Convolution Networks (SMGCN) which first extracts consistent cross-view topology from multigraphs including edge-level and subgraph-level topology, then performs polynomial expansion based on raw multigraphs and consistent topologies. In theory, SMGCN utilizes the consistent topologies in polynomial expansion rather than standard cross-view polynomial expansion, which performs credible cross-view spatial message-passing, follows the spectral convolution paradigm, and effectively reduces the complexity of standard polynomial expansion. In the simulations, experimental results demonstrate that SMGCN achieves state-of-the-art performance on ACM and DBLP multigraph benchmark datasets. Our codes are available at https://github.com/frinkleko/SMGCN.
Recent studies have revealed that federated learning (FL), once considered secure due to clients not sharing their private data with the server, is vulnerable to attacks such as client-side training data distribution inference, where a malicious client can recreate the victim's data. While various countermeasures exist, they are not practical, often assuming server access to some training data or knowledge of label distribution before the attack.
In this work, we bridge the gap by proposing InferGuard, a novel Byzantine-robust aggregation rule aimed at defending against client-side training data distribution inference attacks. In our proposed InferGuard, the server first calculates the coordinate-wise median of all the model updates it receives. A client's model update is considered malicious if it significantly deviates from the computed median update. We conduct a thorough evaluation of our proposed InferGuard on five benchmark datasets and perform a comparison with ten baseline methods. The results of our experiments indicate that our defense mechanism is highly effective in protecting against client-side training data distribution inference attacks, even against strong adaptive attacks. Furthermore, our method substantially outperforms the baseline methods in various practical FL scenarios.
We studied the link between K-anonymity and differential privacy as the basis for deriving a novel method for noise estimation. Hence, we provide threefold contributions: First, we use the birthday-bound paradox for uniqueness to estimate the noise level, ε in (ε, δ) differentially privacy scheme. Second, our group-aware formulation provides resilience to a series of inference attacks by using the group privacy property in our unique group-centric formulation. Third, draw a connection between the attacker advantage, δ, and ε for univariate and multivariate cases. Finally, we demonstrate applicability in Laplacian, Gaussian, and Exponential mechanisms.
Educational Data Records (EDR) are crucial for capturing teaching behavior and student information, forming the basis for achieving educational intelligence. However, ensuring educational privacy has become a pressing concern, posing practical challenges to the use and sharing of educational data. To address the issue of EDR privacy preserving, we present EduSyn, a privacy data release scheme that utilizes generative diffusion models and differential privacy methods. Specifically, we adopt a diffusion modeling scheme that can be applied to both discrete and continuous types of data to accommodate the data characteristics of EDR, while an invariant Post Randomization (PRAM) perturbation method that satisfies local differential privacy is applied for data attributes that need to be specially protected before model training. We conduct comprehensive validation of this scheme within the domain of education applications, showcasing that EduSyn generates a superior private EDR dataset compared to similar generative methods and strikes a better privacy-utility trade-off.
Decentralized Exchanges (DEXs), leveraging blockchain technology and smart contracts, have emerged in decentralized finance. However, the DEX project with multi-contract interaction is accompanied by complex state logic, which makes it challenging to solve state defects. In this paper, we conduct the first systematic study on state derailment defects of DEXs. These defects could lead to incorrect, incomplete, or unauthorized changes to the system state during contract execution, potentially causing security threats. We propose StateGuard, a deep learning-based framework to detect state derailment defects in DEX smart contracts. StateGuard constructs an Abstract Syntax Tree (AST) of the smart contract, extracting key features to generate a graph representation. Then, it leverages a Graph Convolutional Network (GCN) to discover defects. Evaluating StateGuard on 46 DEX projects with 5,671 smart contracts reveals its effectiveness, with a precision of 92.24%. To further verify its practicality, we used StateGuard to audit real-world smart contracts and successfully authenticated multiple novel CVEs.
Social media platforms have become one of the main channels where people disseminate and acquire information, of which the reliability is severely threatened by rumors widespread in the network. Existing approaches such as suspending users or broadcasting real information to combat rumors are either with high cost or disturbing users. In this paper, we introduce a novel rumor mitigation paradigm, where only a minimal set of links in the social network are intervened to decelerate the propagation of rumors, countering misinformation with low business cost and user awareness. A knowledge-informed agent embodying rumor propagation mechanisms is developed, which intervenes the social network with a graph neural network for capturing information flow in the social media platforms and a policy network for selecting links. Experiments on real social media platforms demonstrate that the proposed approach can effectively alleviate the influence of rumors, substantially reducing the affected populations by over 25%. Codes for this paper are released at https://github.com/tsinghua-fib-lab/DRL-Rumor-Mitigation.
Spambot activity has become increasingly pervasive on social media platforms, such as X (formerly known as Twitter), leading to concerns over information quality and user experience. This study presents an innovative approach for real-time detection and reporting of spambots on Twitter platform. Using data analytics technique, we adapted a comprehensive framework capable of accurately identifying and categorizing spambot accounts based on their behavioral patterns and characteristics. By providing an efficient solution to this growing issue, our research aims to enhance user trust in social media communication channels and promote a more transparent and authentic online environment for users to engage with each other and share information.
This paper aims to answer the question of whether to use the impression log in evaluating news recommendation models. We start with a claim that the testing with the impression log composed of only hard-negative news (i.e., impression (IMP)-based test) is not beneficial to evaluating the models precisely. Based on the claim, we discuss a way of evaluating models by employing all kinds of negative news articles (i.e., Total test). Also, we propose a more-efficient way of evaluating models by sampling only a small number of negative articles (i.e., random-sampling (RS)-based test). We verify our claim by extensively comparing the evaluation results on six models from the IMP-based, Total, and RS-based tests: the RS-based test shows more accurate results than the IMP-based test in determining the superiority among the models while providing higher efficiency than the Total test. Therefore, our answer to the question above would be "do not employ the impression log in testing models even if it is available." This result is quite meaningful since it enables news recommendation researchers and practitioners, who have been using the impression log thus going to the wrong way, to turn to the right one.
In this paper, we explore the potential of large language models (LLMs) in generating personalized online advertisements (ads) tailored to specific personality traits, focusing on openness and neuroticism. We conducted a user study involving two tasks to understand the performance of LLM-generated ads compared to human-written ads in different online environments. Task 1 simulates a social media environment where users encounter ads while scrolling through their feed. Task 2 mimics a shopping website environment where users are presented with multiple sponsored products side-by-side. Our results indicate that LLM-generated ads targeting the openness trait positively impact user engagement and preferences, with performance comparable to human-written ads. Furthermore, in both scenarios, the overall effectiveness of LLM-generated ads was found to be similar to that of human-written ads, highlighting the potential of LLM-generated personalised content to rival traditional advertising methods with the added advantage of scalability. This study underscores the need for cautious consideration in the deployment of LLM-generated content at scale. While our findings confirm the scalability and potential effectiveness of LLM-generated content, there is an equally pressing concern about the ease with which it can be misused.
Click-through rate (CTR) prediction plays an indispensable role in online recommendation and advertising platforms. Numerous deep learning based models have been proposed to improve CTR prediction accuracy, and they typically leverage user behavior sequences to capture users' shifting preferences. However, these historical sequences of user interactions often suffer from severe homogeneity and scarcity compared to the extensive item pool. Relying solely on such sequences for user representations is inherently restrictive, as user interests extend beyond the scope of items they have previously engaged with. To address this challenge, we propose a data-driven approach to enrich user representations.We recognize user profiling and recall items as two ideal data sources within the cross-stage framework, encompassing the u2u (user-to-user) and i2i (item-to-item) aspects, respectively, because of their higher relevance to target users and ranking items, as well as their greater diversity. In this paper, we propose a novel architecture named Recall-Augmented Ranking (RAR).RAR consists of two key sub-modules, namely the Cross-Stage User and Item Selection Module and the Co-Interaction Module. These sub-modules synergistically gather information from a vast pool of look-alike users and recall items, resulting in enriched user representations. Notably, RAR is orthogonal to many existing CTR models, allowing for seamless integration and consistent performance improvements in a plug-and-play manner. Extensive experiments are conducted on CTR prediction benchmarks, which verify the efficacy and compatibility of RAR against state-of-the-art methods.
As the digital commerce landscape continues to expand with rating platforms, the consumer base has similarly grown, marking a pivotal reliance on user ratings and reviews. However, the rise of fraudulent users leveraging deceitful conduct, such as manipulation of product rankings, challenges the credibility of these platforms and compels the necessity for an effective detection model. Amid the challenges of evolving fraudulent patterns and label scarcity, this paper presents a novel model, Burstiness-aware Bipartite Graph Neural Networks (BurstBGN), which combats fraud by exploiting user-product bipartite graphs and timestamped rating activities. BurstBGN encapsulates two key ideas: the modeling of user-product interaction via historical rating data through an Edge-time GNN module and the exhaustive mapping of bursty fraudulent user activities. The performance of BurstBGN is demonstrated through rigorous benchmarking against established methods across three datasets. Our results show that BurstBGN consistently outperforms these methods under both transductive and inductive settings, confirming its effectiveness in detecting fraudulent users from limited annotated data, and thereby providing a safeguard for maintaining user trust in e-commerce platforms.
Common click-through rate (CTR) prediction recommender models tend to exhibit feature-level bias, which leads to unfair recommendations among item groups and inaccurate recommendations for users. While existing methods address this issue by adjusting the learning of CTR models, such as through additional optimization objectives, they fail to consider how the bias is caused within these models. In this paper, we discover a generation path of feature-level bias: biased positive sample ratios → biased linear weights in CTR model → biased prediction scores → biased recommendations. Based on this understanding, we propose a minimally invasive yet effective strategy to counteract feature-level bias in CTR models by removing the biased linear weights from well-trained models. Additionally, we present a linear weight adjusting strategy that requires fewer random exposure records than relevant debiasing methods. The superiority of our proposed strategies are validated through extensive experiments on three real-world datasets. The code is available at https://github.com/mitao-cat/feature-level_bias
As natural language models like ChatGPT become increasingly prevalent in applications and services, the need for robust and accurate methods to detect their output is of paramount importance. In this paper, we present GPT Reddit Dataset (GRiD), a novel Generative Pretrained Transformer (GPT)-generated text detection dataset designed to assess the performance of detection models in identifying generated responses from ChatGPT. The dataset consists of a diverse collection of context-prompt pairs based on Reddit, with human-generated and ChatGPT-generated responses. We provide an analysis of the dataset's characteristics, including linguistic diversity, context complexity, and response quality. To showcase the dataset's utility, we benchmark several detection methods on it, demonstrating their efficacy in distinguishing between human and ChatGPT-generated responses. This dataset serves as a resource for evaluating and advancing detection techniques in the context of ChatGPT and contributes to the ongoing efforts to ensure responsible and trustworthy AI-driven communication on the internet. Finally, we propose GpTen, a novel tensor-based GPT text detection method that is semi-supervised in nature since it only has access to human-generated text and performs on par with fully-supervised baselines.
Online news articles encompass a variety of modalities such as text and images. How can we learn a representation that incorporates information from all those modalities in a compact and interpretable manner? In this paper, we propose CITEM (Compact Interpretable Tensor graph multi-modal news EMbedding), a tensor-based framework for compact and interpretable multi-modal news representations. CITEM generates a tensor graph consisting of a news similarity graph for each modality and employs a tensor decomposition to produce compact and interpretable embeddings, each dimension of which is a heterogeneous co-cluster of news articles and corresponding modalities. We extensively validate CITEM compared to baselines on two news classification tasks: misinformation news detection and news categorization. The experimental results show that CITEM performs within the same range of AUC as state-of-the-art baselines while producing 7x to 10.5x more compact embeddings. In addition, each embedding dimension of CITEM is interpretable, representing a latent co-cluster of articles.
Generating mobile traffic in urban contexts is important for network optimization. However, existing solutions show weakness in capturing complex temporal features of mobile traffic. In this paper, we propose a Knowledge-Guided Conditional Diffusion model (KGDiff) for controllable mobile traffic generation, where a customized denoising network of diffusion model is designed to explore the temporal features of mobile traffic. Specifically, we design a frequency attention mechanism that incorporates an Urban Knowledge Graph (UKG) to adaptively capture implicit correlations between mobile traffic and urban environments in the frequency domain. This approach enables the model to generate network traffic corresponding to different environments in a controlled manner, enhancing the model's controllability. Experiments on one real-world dataset show that the proposed framework has good controllability and can improve generation fidelity with gains surpassing 19%.
Stack Overflow is widely recognized by software practitioners as the go-to resource for addressing technical issues and sharing practical solutions. While not typically seen as a scholarly forum, users on Stack Overflow commonly refer to academic sources in their discussions. Yet, little is known about these referenced academic works and how they intersect the needs and interests of the Stack Overflow community. To bridge this gap, we conducted an exploratory large-scale study on the landscape of academic references in Stack Overflow. Our findings reveal that Stack Overflow communities with different domains of interest engage with academic literature at varying frequencies and speeds. The contradicting patterns suggest that some disciplines may have diverged in their interests and development trajectories from the corresponding practitioner community. Finally, we discuss the potential of Stack Overflow in gauging the real-world relevance of academic research.
Inspired by the success of graph contrastive learning, researchers have begun exploring the benefits of contrastive learning over hypergraphs. However, these works have the following limitations in modeling the high-order relationships over unlabeled data: (i) They primarily focus on maximizing the agreements among individual node embeddings while neglecting the capture of group-wise collective behaviors within hypergraphs; (ii) Most of them disregard the importance of the temperature index in discriminating contrastive pairs during contrast optimization. To address these limitations, we propose a novel dual-level Hy perG raph C ontrastive L earning framework with Ad aptive T emperature (HyGCL-AdT ) to boost contrastive learning over hypergraphs. Specifically, unlike most works that merely maximize the agreement of node embeddings in hypergraphs, we propose a dual-level contrast mechanism that not only captures the individual node behaviors in a local context but also models the group-wise collective behaviors of nodes within hyperedges from a community perspective. Besides, we design an adaptive temperature-enhanced contrastive optimization to improve the discrimination ability between contrastive pairs. Empirical experiments conducted on seven benchmark hypergraphs demonstrate that HyGCL-AdT exhibits excellent effectiveness compared to state-of-the-art baseline models. The source code is available at \hrefhttps://github.com/graphprojects/HyGCL-AdT https://github.com/graphprojects/HyGCL-AdT.
Various cohesive models are widely employed for the analysis of social networks to identify critical users or key relationships, with the k-core being a particularly popular approach. Existing works, such as the anchor k-core problem, aim to maximize k-core by anchoring nodes (the degree of anchor nodes are set as infinity). However, we find that node merging can also enlarge the k-core size. Different from anchoring nodes, nodes merging can cause both degree increase and decrease which brings more challenges. In this paper, we study the <u>c</u>ore <u>m</u>aximization by <u>n</u>ode <u>m</u>erging problem (CMNM) and prove its hardness. A greedy framework is first presented due to its hardness. To scale for large networks, we categorize potentially influential nodes and provide a detailed analysis of all node merging pairs. Then, based on these analyses, a fast and effective algorithm is developed. Finally, we conduct comprehensive experiments on real-world networks to evaluate the effectiveness and efficiency of the proposed method.
Predicting click-through rates (CTR) is a fundamental task for Web applications, where a key issue is to devise effective models for feature interactions. Current methodologies predominantly concentrate on modeling feature interactions within an individual sample, while overlooking the potential cross-sample relationships that can serve as a reference context to enhance the prediction. To make up for such deficiency, this paper develops a <u>R</u>etrieval-<u>A</u>ugmented <u>T</u>ransformer (RAT), aiming to acquire fine-grained feature interactions within and across samples. By retrieving similar samples, we construct augmented input for each target sample. We then build Transformer layers with cascaded attention to capture both intra- and cross-sample feature interactions, facilitating comprehensive reasoning for improved CTR prediction while retaining efficiency. Extensive experiments on real-world datasets substantiate the effectiveness of RAT and suggest its advantage in long-tail scenarios. The code will be open-sourced.
Recommender systems mainly tailor personalized recommendations according to user interests learned from user feedback. However, such recommender systems passively cater to user interests and even reinforce existing interests in the feedback loop, leading to problems like filter bubbles and opinion polarization. To counteract this, proactive recommendation actively steers users towards developing new interests in a target item or topic by strategically modulating recommendation sequences. Existing work for proactive recommendation faces significant hurdles: 1) overlooking the user feedback in the guidance process; 2) lacking explicit modeling of the guiding objective; and 3) insufficient flexibility for integration into existing industrial recommender systems. To address these issues, we introduce an Iterative Preference Guidance (IPG) framework. IPG performs proactive recommendation in a flexible post-processing manner by ranking items according to their IPG scores that consider both interaction probability and guiding value. These scores are explicitly estimated with iteratively updated user representation that considers the most recent user interactions. Extensive experiments validate that IPG can effectively guide user interests toward target interests with a reasonable trade-off in recommender accuracy. The code is available at https://github.com/GabyUSTC/IPG-Rec.
In public blockchains, leaking secret keys can cause the permanent loss of crypto assets. It is imperative to understand the illicit activities on blockchains related to leaked keys. This paper presents the first measurement study that uncovers, quantifies, and characterizes the actual misuses of the leaked keys from top websites on the Internet to withdraw assets on Ethereum. By finding key-leaking web pages and joining them with transactions, the study reveals 7.29*10^6/0.59*10^6 USD worth of assets on Ethereum mainnet/Binance Smart Chain (BSC) are withdrawn from 1421/1514 leaked secret keys. Mitigations are proposed to avoid the financial loss caused by leaked keys.
The homepage of an E-Commerce website may accommodate multiple and diverse recommendation modules; with each module is designed to cover some facet of the user's needs. Commonly, the recommendation modules are ordered in the same way for all homepage users, which leads to a sub-optimal user experience. In this work, we present a novel personalized module ordering solution that provides a more educated way to determine an ordering of the homepage modules based on historical user-interactions. Overall, we evaluate our solution and demonstrate its merits.
Our society is facing rampant misinformation harming public health and trust. To address the societal challenge, we introduce FACT-GPT, a framework leveraging Large Language Models (LLMs) to assist fact-checking. FACT-GPT, trained on a synthetic dataset, identifies social media content that aligns with, contradicts, or is irrelevant to previously debunked claims. Our evaluation shows that our specialized LLMs can match the accuracy of larger models in identifying related claims, closely mirroring human judgment. This research provides a solution for efficient claim matching, demonstrates the potential of LLMs in supporting fact-checkers, and offers valuable resources for further research in the field.
Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations. A popular application is to quantify semantic similarity between high-dimensional objects by applying cosine-similarity to a learned low-dimensional feature embedding. This can work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice. To gain insight into this empirical observation, we study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights. We derive analytically how cosine-similarity can yield arbitrary and therefore meaningless 'similarities.' For some linear models the similarities are not even unique, while for others they are implicitly controlled by the regularization. We discuss implications beyond linear models: a combination of different regularizations are employed when learning deep models; these have implicit and unintended effects when taking cosine-similarities of the resulting embeddings, rendering results opaque and possibly arbitrary. Based on these insights, we caution against blindly using cosine-similarity and outline alternatives.
Search results hold paramount importance for both users and advertisers. However, re-ranking results based on the timing of user clicks on listing advertisements poses a considerable challenge. In this study, we introduce a dynamic re-ranking method to re-rank search results triggered by the viewing of listing advertisements containing search terms after the initial presentation of search results. The proposed method leverages relevance feedback to enhance search results, providing valuable support to users in their searches. We conducted a comparative verification of accuracy rates between methods with and without the added weight. The results suggest that considering the timings of clicking on listing ads for re-ranking results can more effectively reflect user interest compared to approaches that don't take this timing into account. This study contributes to the advancement of search result presentation strategies by incorporating dynamic user behavior considerations.
Estimating position bias is a well-known challenge in Learning to Rank (L2R). Click data in e-commerce applications, such as targeted advertisements and search engines, provides implicit but abundant feedback to improve personalized rankings. However, click data inherently includes various biases like position bias. Based on the position-based click model, Result Randomization and Regression Expectation-Maximization algorithm (REM) have been proposed to estimate position bias, but they require various paired observations of (item, position). In real-world scenarios of advertising, marketers frequently display advertisements in a fixed pre-determined order, which creates difficulties in estimation due to the limited availability of various pairs in the training data, resulting in a sparse dataset. We propose a variant of the REM that utilizes item embeddings to alleviate the sparsity of (item, position). Using a public dataset and internal carousel advertisement click dataset, we empirically show that item embedding with Latent Semantic Indexing (LSI) and Variational Auto-Encoder (VAE) improves the accuracy of position bias estimation and the estimated position bias enhances Learning to Rank performance. We also show that LSI is more effective as an embedding creation method for position bias estimation.
Many geographic information systems applications rely on data provided by user devices in the road network, including traffic monitoring, driving navigation, and road closure detection. The underlying signal is generally collected by sampling locations from user trajectories. The sampling process, though critical for various applications, has not been studied sufficiently in the literature. While the most natural way to sample a trajectory may be to use a frequency based algorithm, e.g., sampling locations every x seconds, such a sampling strategy can be quite wasteful in resources (e.g., server-side processing, user battery) as well as stored user data. In this work, we conduct a horizontal study of various location sampling algorithms (based on frequency, road geography, reservoir sampling, etc.) and assess their trade-offs in terms of the size of the stored data and the induced quality of training for prediction tasks (specifically predicting speeds on road segments).
Training Graph Neural Networks (GNNs) efficiently remains a challenge due to the high memory demands, especially during recursive neighborhood aggregation. Traditional sampling-based GNN training methods often overlook the data's inherent structure, such as the power-law distribution observed in most real-world graphs, which results in inefficient memory usage and processing. We introduce a novel framework, M emory-A ware D ynamic E xiting GNN (MADE-GNN )), which capitalizes on the power-law nature of graph data to enhance training efficiency. MADE-GNN is designed to be data-aware, dynamically adjusting the depth of feature aggregation based on the connectivity of each node. Specifically, it routes well-connected "head'' nodes through extensive aggregation while allowing sparsely connected "tail'' nodes to exit early, thus reducing memory consumption without sacrificing model performance. This approach not only addresses the challenge of memory-intensive GNN training but also turns the power-law distribution from a traditional "curse'' into a strategic "blessing''. By enabling partial weight sharing between the early-exit mechanism and the full model, MADE-GNN effectively improves the representation of cold-start nodes, leveraging the structural information from head nodes to enhance generalization across the network. Our extensive evaluations across multiple public benchmarks, including industrial-level graphs, show that MADE-GNN outperforms existing GNN training methods in both memory efficiency and performance, offering significant improvements particularly for tail nodes. This demonstrates MADE-GNN's potential as a versatile solution for GNN applications facing similar scalability and distribution challenges.
Discovering and reusing research assets such as datasets and computational notebooks is crucial for building research workflows in data-centric studies. The rapid growth of research assets in scientific communities provides scientists with great opportunities to enhance research efficacy but also poses significant challenges in finding suitable materials for specific tasks. Scientists, especially those focusing on cross-disciplinary research, often find it difficult to formulate effective queries to retrieve desired resources. Previous work has proposed query reformulation methods to increase the efficiency of research asset search. However, it relies on existent knowledge graphs and is constrained to computational notebooks only. As research assets utilized by data analytic workflows are in essence heterogeneous, i.e., of distinct kinds and from diversified sources, query reformulation methods in this regard should consider the relationship between different types of research assets. To address the above challenges, we propose a retrieval-augmented query reformulation method for heterogeneous research asset retrieval. It is developed in the context of a Notebook-based virtual research environment (VRE) and offers query reformulation services to other VRE components. We demonstrate the effectiveness of the proposed query reformulation service with experiments on dataset and notebook retrieval. Up till now, we have indexed 8,954 datasets and 18,158 notebooks. The experimental results show that the proposed service can create useful query suggestions.
Wikipedia is a highly essential platform because of its informative, dynamic, and easily accessible nature. To identify topics/titles warranting their own Wikipedia article, editors of Wikipedia defined "Notability" guidelines. So far notability is enforced by humans, which makes scalability an issue. There has been no significant work on Notability determination for titles with complex category dependencies. We design a mechanism to identify such titles. We construct a dataset with 9k such titles and propose a category-agnostic approach utilizing Graph neural networks, for their notability determination. Our system outperforms machine learning-based, transformer-based classifiers and entity salience methods. It provides a scalable alternative for notability detection.
Sarcastic comments are often used to express dissatisfaction with products or events. Mining the topics and targets can provide clues for analyzing the underlying reasons behind the sarcasm, which helps understand user demands and improve products service. Existing research mainly focuses on mining single facet of sarcasm, such as topic or target, ignoring the complex interrelations between them. To overcome the above challenges, this paper proposes a Heterogeneous Information Network fused with Context-Aware Contrastive Learning (HINCCL) method. This approach aims to model multi-view features including syntactic style, domain knowledge, and textual semantics through a hierarchical attention aggregation mechanism. Furthermore, a context-aware negative contrastive training strategy is designed to learn the differentiated representations between different topic-target pairs. The effectiveness of the proposed method is validated on a dataset constructed in the digital domain.
Incorporating item content information into click-through rate (CTR) prediction models remains a challenge, especially with the time and space constraints of industrial scenarios. The content-encoding paradigm, which integrates user and item encoders directly into CTR models, prioritizes space over time. In contrast, the embedding-based paradigm transforms item and user semantics into latent embeddings, subsequently caching them to optimize processing time at the expense of space. In this paper, we introduce a new semantic-token paradigm and propose a discrete semantic tokenization approach, namely UIST, for user and item representation. UIST facilitates swift training and inference while maintaining a conservative memory footprint. Specifically, UIST quantizes dense embedding vectors into discrete tokens with shorter lengths and employs a hierarchical mixture inference module to weigh the contribution of each user-item token pair. Our experimental results on news recommendation showcase the effectiveness and efficiency (about 200-fold space compression) of UIST for CTR prediction.
CoSimRank, a favorable measure for assessing node similarity based on graphs, faces computational challenges on real evolving graphs. The best-of-breed algorithm, D-CoSim, for incremental CoSimRank search evaluates similarity changes by summing dot products between two vectors. These vectors are iteratively generated from scratch in the original high-dimensional space, leading to significant costs. In this paper, we propose I-CoSim, a novel efficient dynamic CoSimRank algorithm for evolving graphs. I-CoSim resorts to two low-dimensional Krylov subspaces and maximally reuses previously computed similarities in the original graph, which substantially expedites CoSimRank search on evolving graphs. We also theoretically provide an error bound on the I-CoSim estimation with guaranteed accuracy. Experimental results on real datasets show that I-CoSim is up to 28 times faster than the best-known competitor, with only a slight compromise in accuracy.
Blockchain technology enables a new form of online community: Decentralized Autonomous Organizations (DAOs), where members typically vote on proposals using tokens. Enthusiasts claim DAOs provide new opportunities for openness, horizontality, and democratization. However, this phenomenon is still under research, especially given the lack of quantitative studies. This paper presents the first census-like quantitative analysis of the whole ecosystem of DAOs, including 30K DAO communities on the main DAO platforms. This enables us to provide insights into the allegedly "democratic'' nature of DAOs, building metrics concerning their lifespan, participation, and power concentration. Most DAOs have a short lifespan and low participation. There is also a positive correlation between community size and voting power concentration. Like other online communities, DAOs seem to follow the iron law: becoming increasingly oligarchic as they grow. Still, a significant amount of DAOs of varying sizes defy this idea by being egalitarian by design.
Anomaly detection in time series that aims to identify unusual patterns has attracted a lot of attention recently. However, the representation of abnormal and normal data is difffcult to be distinguished because they are usually entangled. Recently, disentanglement theory based on variational auto-encoder (VAE) has shown great potential in machine learning and achieved great success in computer vision and natural language processing. In this paper, we propose a novel disentangled anomaly detection approach that adopts VAE-based disentanglement networks for anomaly detection in multivariate time series. The proposed method learns highquality disentangled latent factors in a continuous representation space to facilitate the identiffcation of anomalies from normal data. Extensive experiments demonstrate that our proposed lightweight model DA-VAE achieves state-of-the-art performance.
This paper addresses the gap between general-purpose text embeddings and the specific demands of item retrieval tasks. We demonstrate the shortcomings of existing models in capturing the nuances necessary for zero-shot performance on item retrieval tasks. To overcome these limitations, we propose generate in-domain dataset from ten tasks tailored to unlocking models' representation ability for item retrieval. Our empirical studies demonstrate that fine-tuning embedding models on the dataset leads to remarkable improvements in a variety of retrieval tasks. We also illustrate the practical application of our refined model in a conversational setting, where it enhances the capabilities of LLM-based Recommender Agents like Chat-Rec. Our code is available at https://github.com/microsoft/RecAI.
The Digital Services Act (DSA) represents a major legislative framework that mandates large social media providers to file Statements of Reasons (SoRs) to the DSA Transparency Database whenever they remove or restrict access to certain content on their platforms in the EU. In this work, we empirically analyze this unique data source and provide an early look at content moderation decisions of social media platforms in the EU. Our empirical analysis based on more than 156 million SoRs reveals significant differences in content moderation practices and how large social media platforms implement their obligations under the DSA. Our findings have important implications for regulators, suggesting the need to lay out more specific rules that ensure common standards on how social media providers handle rule-breaking content on their platforms.
In copy-move tampering operations, perpetrators often employ techniques, such as blurring, to conceal tampering traces, posing significant challenges to the detection of object-level targets with intact structures. Focus on these challenges, this paper proposes an Object-level Copy-Move Forgery Image Detection based on Inconsistency Mining (IMNet). To obtain complete object-level targets, we customize prototypes for both the source and tampered regions and dynamically update them. Additionally, we extract inconsistent regions between coarse similar regions obtained through self-correlation calculations and regions composed of prototypes. The detected inconsistent regions are used as supplements to coarse similar regions to refine pixel-level detection. We operate experiments on three public datasets which validate the effectiveness and the robustness of the proposed IMNet.
This research explores the efficacy of four state-of-the-art Large Language Models (LLMs): GPT-3.5-turbo-0301, Vicuna, PaLM 2, and Dolly in predicting (i) movie genres using audio transcripts of movie trailers and (ii) meta-information such as director and cast details using movie name and its year-of-release (YoR) for Hindi movies. In the contemporary landscape, training models for movie meta-information prediction often demand extensive data and parameters, posing significant challenges. We aim to discern whether LLMs mitigate these challenges. Focusing on Hindi movies within the Flickscore dataset, our study concentrates on trailer data. Preliminary findings reveal that GPT-3.5 stands out as the most effective LLM in predicting movie meta-information. Despite the inherent complexities of predicting diverse aspects such as genres and user preferences, GPT-3.5 exhibits promising capabilities. This research not only contributes to advancing our understanding of LLMs in the context of movie-related tasks but also sheds light on their potential application in Recommendation Systems (RS), indicating a notable leap forward in user preference comprehension and personalized content recommendations.
Backdoor attacks have posed a significant threat to deep neural networks, highlighting the need for robust defense strategies. Previous research has demonstrated that attribution maps change substantially when exposed to attacks, suggesting the potential of interpreters in detecting adversarial examples. However, most existing defense methods against backdoor attacks overlook the untapped capabilities of interpreters, failing to fully leverage their potential. In this paper, we propose a novel approach called interpretation-empowered neural cleanse (IENC ) for defending backdoor attacks. Specifically, integrated gradient (IG) is adopted to bridge the interpreters and classifiers to reverse and reconstruct the high-quality backdoor trigger. Then, an interpretation-empowered adaptative pruning strategy (IEAPS) is proposed to cleanse the backdoor-related neurons without the pre-defined threshold. Additionally, a hybrid model patching approach is employed to integrate the IEAPS and preprocessing techniques to enhance the defense performance. Comprehensive experiments are constructed on various datasets, demonstrating the potential of interpretations in defending backdoor attacks and the superiority of the proposed method.
Recent studies have found thousands of malware source code repositories on GitHub. For the first time, we propose to understand the origins and motivations behind the creation of such malware repositories. For that, we collect and profile the authors of malware repositories using a three-fold systematic approach. First, we identify 14K users in GitHub who have authored at least one malware repository. Second, we leverage a pretrained large language model (LLM) to estimate the likelihood of malicious intent of these authors. This innovative approach led us to categorize 3339 as Malicious, 3354 as Likely Malicious, and 7574 as Benign authors. Further, to validate the accuracy and reliability of our classification, we conduct a manual review of 200 randomly selected authors. Third, our analysis provides insights into the authors' profiles and motivations. We find that Malicious authors often have sparse profiles and focus on creating and spreading malware, while Benign authors typically have complete profiles with a focus on cybersecurity research and education. Likely Malicious authors show varying levels of engagement and ambiguous intentions. We see our study as a key step towards understanding the ecosystem of malware authorship on GitHub.
Traditional discriminative approaches in mental health analysis are known for their strong capacity but lack interpretability and demand large-scale annotated data. The generative approaches, such as those based on large language models (LLMs), have the potential to get rid of heavy annotations and provide explanations but their capabilities still fall short compared to discriminative approaches, and their explanations may be unreliable due to the fact that the generation of explanation is a black-box process. Inspired by the psychological assessment practice of using scales to evaluate mental states, our method which is called Mental Analysis by Incorporating Mental Scales (MAIMS), incorporates two procedures via LLMs. First, the patient completes mental scales, and second, the psychologist interprets the collected information from the mental scales and makes informed decisions. Experimental results show that MAIMS outperforms other zero-shot methods. MAIMS can generate more rigorous explanation based on the outputs of mental scales.
In the age of digital music streaming, playlists on platforms like Spotify have become an integral part of individuals' musical experiences. People create and publicly share their own playlists to express their musical tastes, promote the discovery of their favorite artists, and foster social connections. In this work, we aim to address the question: can we infer users' private attributes from their public Spotify playlists? To this end, we conducted an online survey involving 739 Spotify users, resulting in a dataset of 10,286 publicly shared playlists comprising over 200,000 unique songs and 55,000 artists. Then, we utilize statistical analyses and machine learning algorithms to build accurate predictive models for users' attributes.
Recent research in self-supervised contrastive learning of music representations has demonstrated remarkable results across diverse downstream tasks. However, a prevailing trend in existing methods involves representing equally-sized music clips in either waveform or spectrogram formats, often overlooking the intrinsic part-whole hierarchies within music. In our quest to comprehend the bottom-up structure of music, we introduce MART, a hierarchical music representation learning approach that facilitates feature interactions among cropped music clips while considering their part-whole hierarchies. Specifically, we propose a hierarchical part-whole transformer to capture the structural relationships between music clips in a part-whole hierarchy. Furthermore, a hierarchical contrastive learning objective is crafted to align part-whole music representations at adjacent levels, progressively establishing a multi-hierarchy representation space. The effectiveness of our music representation learning from part-whole hierarchies has been empirically validated across multiple downstream tasks, including music classification and cover song identification.
Over recent years, news recommender systems have gained significant attention in both academia and industry, emphasizing the need for a standardized benchmark to evaluate and compare the performance of these systems. Concurrently, Green AI advocates for reducing the energy consumption and environmental impact of machine learning. To address these concerns, we introduce the first Green AI benchmarking framework for news recommendation, known as GreenRec, and propose a metric for assessing the tradeoff between recommendation accuracy and efficiency. Our benchmark encompasses 30 base models and their variants, covering traditional end-to-end training paradigms as well as our proposed efficient only-encode-once (OLEO) paradigm. Through experiments consuming 2000 GPU hours, we observe that the OLEO paradigm achieves competitive accuracy compared to state-of-the-art end-to-end paradigms and delivers up to a 2992% improvement in sustainability metrics.
Generative document retrieval, an emerging paradigm in information retrieval, learns to build connections between documents and identifiers within a single model, garnering significant attention. However, there are still two challenges: (1) neglecting inner-content correlation during document representation; (2) lacking explicit semantic structure during identifier construction. Nonetheless, events have enriched relations and well-defined taxonomy, which could facilitate addressing the above two challenges. Inspired by this, we propose Event GDR, an event-centric generative document retrieval model, integrating event knowledge into this task. Specifically, we utilize an exchange-then-reflection method based on multi-agents for event knowledge extraction. For document representation, we employ events and relations to model the document to guarantee the comprehensiveness and inner-content correlation. For identifier construction, we map the events to well-defined event taxonomy to construct the identifiers with explicit semantic structure. Our method achieves significant improvement over the baselines on two datasets, and also hopes to provide insights for future research.
Recent years have witnessed the abundant emergence of heterogeneous graph neural networks (HGNNs) for link prediction. In heterogeneous graphs, different meta-paths connected to nodes reflect different aspects of the nodes' properties. Existing work fuses the multi-aspect properties of each node into a single vector representation, which makes them fail to capture fine-grained associations between multiple node properties. To this end, we propose a heterogeneous graph neural network with Multi-Aspect Node Association awareness, namely MANA. MANA leverages key associations among multi-aspect node properties to achieve link prediction. Specifically, to avoid the loss of effective association information for link prediction, we design a transformer-based Multi-Aspect Association Mining module to capture multi-aspect associations between nodes. Then, we introduce the Multi-Aspect Link Prediction module, empowering MANA to focus on the key associations among all, thus avoiding the negative impact of ineffective associations on the model's performance. We conduct extensive experiments on three widely used datasets from Heterogeneous Graph Benchmark (HGB). Experimental results show that our proposed method outperforms state-of-the-art baselines.
This paper explores the development and application of an automated system designed to extract information from semi-structured interview transcripts. Given the labor-intensive nature of traditional qualitative analysis methods, such as coding, there exists a significant demand for tools that can facilitate the analysis process. Our research investigates various topic modeling techniques and concludes that the best model for analyzing interview texts is a combination of BERT embeddings and HDBSCAN clustering. We present a user-friendly software prototype that enables researchers, including those without programming skills, to efficiently process and visualize the thematic structure of interview data. This tool not only facilitates the initial stages of qualitative analysis but also offers insights into the interconnectedness of topics revealed, thereby enhancing the depth of qualitative analysis.
The rapid explosion of linked data demands effective and efficient storage, management, and querying methods. Apache Spark is one of the most widely used engines for big data processing, with more and more systems adopting it for efficient query answering. Existing approaches, exploiting Spark for querying RDF data, adopt partitioning techniques for reducing the data that need to be accessed in order to improve efficiency. However, simplistic methods for data partitioning fail to minimize data access at query answering and effectively improve query efficiency. In this demonstration, we present DIAERESIS, a novel platform that exploits a summary-based partitioning strategy achieving a significant improvement in minimizing data access and as such improving query-answering efficiency. DIAERESIS first identifies the top-k most important schema nodes and distributes the other schema nodes to the centroid they mostly depend on. Then, it allocates the corresponding instance nodes to the schema nodes they are instantiated under, creating vertical sub-partitions and indexes. We allow conference participants to actively identify the impact of our partitioning methodology on data distribution and replication, data accessed for query answering, and query answering efficiency. Further, we contrast our approach with existing partitioning approaches adopted by state-of-the-art systems in the domain, providing a deep understanding of the challenges in the area.
Fashion analysis refers to the process of examining and evaluating trends, styles, and elements within the fashion industry to understand and interpret its current state, generating fashion reports. It is traditionally performed by fashion professionals based on their expertise and experience, which requires high labour cost and may also produce biased results for relying heavily on a small group of people. In this paper, to tackle the Fashion Report Generation (FashionReGen) task, we propose an intelligent Fashion Analyzing and Reporting system based the advanced Large Language Models (LLMs), debbed as GPT-FAR. Specifically, it tries to deliver FashionReGen based on effective catwalk analysis, the proposed GPT-FAR system is equipped with several key procedures, namely, catwalk understanding, collective organization and analysis, and report generation. By posing and exploring such an open-ended, complex and domain-specific task of FashionReGen, it is able to test the general capability of LLMs in fashion domain. It also inspires the explorations of more high-level tasks with industrial significance in other domains. Video illustration and more materials of GPT-FAR can be found in https://github.com/CompFashion/FashionReGen.
Bidders often take a long time to read and understand tender documents because they require specialized knowledge, and tender documents are generally long. Bidders first overview the specific items, such as payment and warranty, in a tender document and then check the overall document. Therefore, the function that can extract specific items (i.e., item extractor) and the function that can highlight words or phrases related to specific items (i.e., word-phrase highlighter) are in great demand. To develop the above two types of functions, we need to solve two problems. The first problem is the problem related to the annotated data set. The second problem concerns the BERT NER-based prediction approach in a small training dataset setting. To solve the first problem, we created two types of sequence labeling datasets related to Item Extractor and Word-Phrase Highlighter. To solve the second problem, we propose the Information Extraction (IE) method, which combines (1) a supervised learning approach using Bidirectional Encoder Representations from Transformers (BERT) and (2) a large language model (LLM)-based improver. We then developed the web application system called Tender Document Analyzer (TDDA), which includes "Item Extractor" and "Word-Phrase Highlighter". Experimental evaluation shows that our approach is practical. Firstly, the evaluation for extraction ability shows that the performance of our proposed method is much higher than the baseline approach that uses GPT 3.5, as well as demonstrates that the proposed LLM-based improver can improve the IE ability. In addition, the usability evaluation shows that bidders can solve the task in less time using our system.
In recent years, social media has emerged as a pivotal source of emergency response for natural disasters. Causal analysis of disaster sub-events is one of crucial concerns. However, the design and implementation of its application scenario present significant challenges, due to the intricate nature of events and information overload. In this work, we introduce GRACE, a system designed for generating the cause and effect of disaster sub-events from social media text. GRACE aims to provide a rapid, comprehensive, and real-time analysis of disaster intelligence. Different from conventional information digestion systems, GRACE employs event evolution reasoning by constructing a causal knowledge graph for disaster sub-events (referred to as DSECG) and fine-tuning GPT-2 on DSECG. This system offers users a comprehensive understanding of disaster events and supports human organizations in enhancing response efforts during disaster situations. Moreover, an online demo is accessible, allowing user interaction with GRACE and providing a visual representation of the cause and effect of disaster sub-events.
Infoboxes can be useful to quickly learn about the contents of text collections, but manually creating them is error-prone and time-consuming, and existing automatic approaches require training data or resources like ontologies that are not available for every domain. Moreover, they lack techniques for adaptation to the user. We therefore propose a system to automatically fill user-defined attributes of infoboxes with the human-in-the-loop which provides this adaptation, and works without training data and domain-specific resources. Our approach generalizes simple user feedback to explore a joint embedding space and find the correct values for the attributes. These structured representations of the texts can be used for collaborative exploration of text collections on the web. We provide a prototypic implementation for such a collaborative web application and demonstrate its usage.
In Hong Kong, the number of elderly citizens will reach one-third of the population within the next decade. To mitigate this problem, timebanking has received attention in recent years. In timebanking, an NGO helper earns time credits through providing voluntary services (e.g., household duties) to elders. These time credits can be used to acquire other services. Although timebanking has shown the promise of promoting mutual care in many countries, its potential has not been fully utilized, due to the lack of IT and data support. We thus develop HINCare, a software platform that supports timebanking for multiple NGOs. Besides providing convenience to NGO supervisors, helpers, and elders, HINCare makes use of a heterogeneous information network (HIN) for recommending suitable helpers to elders. This is the first time a graph-based recommender system is used for such purposes. Currently, HINCare is used by 12 NGOs to serve more than 5000 users in Hong Kong. In this demonstration, participants can play the role of helpers and elders in the HINCare environment.
In this demo paper, we present RealGraphGPUWeb a web-based graph analysis platform with the following features: (1) easy to use user-friendly GUI, (2) high processing performance, (3) various graph algorithms and data formats supported, (4) high accessibility anywhere on the web, and (5) no coding requirements. In our demo, we show how a naive user (e.g., non-CS researcher) gets graph analysis results conveniently and efficiently with a few clicks through the web-based GUI inRealGraphGPUWeb. We also make the user feel the effect of performance improvement obtained by our optimization strategies employed in RealGraphGPUWeb. We believe that RealGraphGPUWeb could be a good platform not only for CS users but also for non-CS users who want to analyze big graphs for their applications easily and efficiently.
The way scholarly knowledge and in particular literature reviews are communicated today rather resembles static, unstructured, pseudo-digitized articles, which are hardly processable by machines and AI. This demo showcases a novel way to create and publish scholarly literature reviews, also called semantic reviews. The neuro-symbolic approach consists of extracting key insights from scientific papers leveraging neural models and organizing them using a symbolic scholarly knowledge graph. The food information engineering review case study will allow participants to see how this approach is implemented using the Open Research Knowledge Graph (ORKG). The real-time demo will allow participants to play with the ORKG and create their own living, semantic review.
Graph-structured data is ubiquitous among a plethora of real-world applications. However, as graph learning algorithms have been increasingly deployed to help decision-making, there has been rising societal concern in the bias these algorithms may exhibit. In certain high-stake decision-making scenarios, the decisions made may be life-changing for the involved individuals. Accordingly, abundant explorations have been made to mitigate the bias for graph learning algorithms in recent years. However, there still lacks a library to collectively consolidate existing debiasing techniques and help practitioners to easily perform bias mitigation for graph learning algorithms. In this paper, we present PyGDebias, an open-source Python library for bias mitigation in graph learning algorithms. As the first comprehensive library of its kind, PyGDebias covers 13 popular debiasing methods under common fairness notions together with 26 commonly used graph datasets. In addition, PyGDebias also comes with comprehensive performance benchmarks and well-documented API designs for both researchers and practitioners. To foster convenient accessibility, PyGDebias is released under a permissive BSD-license together with performance benchmarks, API documentation, and use examples at https://github.com/yushundong/PyGDebias.
Interactive Machine Translation (IMT) advances the computer-aided translation (CAT) paradigm, enabling collaboration between machine translation systems and human translators for high-quality outputs. This paper presents Synslator, a CAT tool designed for IMT and proficient in online learning with real-time translation memories. Synslator accommodates different CAT service deployments by integrating two neural translation models for online learning and a language model to boost translation fluency interactively. Our evaluations demonstrate the system's online learning effectiveness, showing a 13% increase in post-editing efficiency with Synslator's interactive features. A tutorial video is provided at: https://youtu.be/K0vRsb2lTt8.
Recommender systems significantly impact user experience across diverse domains, yet existing frameworks often prioritize offline evaluation metrics, neglecting the crucial integration of A/B testing for forward-looking assessments. In response, this paper introduces a new framework seamlessly incorporating A/B testing into the Cornac recommendation library. Leveraging a diverse collection of model implementations in Cornac, our framework enables effortless A/B testing experiment setup from offline trained models. We introduce a carefully designed dashboard and a robust backend for efficient logging and analysis of user feedback. This not only streamlines the A/B testing process but also enhances the evaluation of recommendation models in an online environment. Demonstrating the simplicity of on-demand online model evaluations, our work contributes to advancing recommender system evaluation methodologies, underscoring the significance of A/B testing and providing a practical framework for implementation. The framework is open-sourced at https://github.com/PreferredAI/cornac-ab.
This paper introduces RecAI, a practical toolkit designed to augment or even revolutionize recommender systems with the advanced capabilities of Large Language Models (LLMs). RecAI provides a suite of tools, including Recommender AI Agent, Recommendation-oriented Language Models, Knowledge Plugin, RecExplainer, and Evaluator, to facilitate the integration of LLMs into recommender systems from multifaceted perspectives. The new generation of recommender systems, empowered by LLMs, are expected to be more versatile, explainable, conversational, and controllable, paving the way for more intelligent and user-centric recommendation experiences. We hope the open-source of RecAI can help accelerate evolution of new advanced recommender systems. The source code of RecAI is available at https://github.com/microsoft/RecAI.
Large language model evaluation plays a pivotal role in the enhancement of its capacity. Previously, numerous methods for evaluating large language models have been proposed in this area. Despite their effectiveness, these existing works mainly focus on assessing objective questions, overlooking the capability to evaluate subjective questions which is extremely common for large language models. Additionally, these methods predominantly utilize centralized datasets for evaluation, with question banks concentrated within the evaluation platforms themselves. Moreover, the evaluation processes employed by these platforms often overlook personalized factors, neglecting to consider the individual characteristics of both the evaluators and the models being evaluated. To address these limitations, we propose a novel anonymous crowd-sourcing evaluation platform, BingJian, for large language models that employs a competitive scoring mechanism where users participate in ranking models based on their performance. This platform stands out not only for its support of centralized evaluations to assess the general capabilities of models but also for offering an open evaluation gateway. Through this gateway, users have the opportunity to submit their questions, testing the models on a personalized and potentially broader range of capabilities. Furthermore, our platform introduces personalized evaluation scenarios, leveraging various forms of human-computer interaction to assess large language models in a manner that accounts for individual user preferences and contexts. The demonstration of BingJian can be accessed at https://github.com/Mingyue-Cheng/Bingjian.
When multiple users adopt collaborative cloud services like Google Drive to work on a shared resource, incorrect or missing permis- sions may cause conflicting or inconsistent access or use privileges. These issues (or conflicts) compromise resources confidentiality, integrity, or availability leading to a lack of trust in cloud services. An example conflict is when a user with editor permissions changes the permissions on a shared resource without consent from the orig- inal resource owner. In this demonstration, we introduce ACCORD, a web application built on top of Google Drive able to detect and resolve multi-user conflicts. ACCORD employs a simulator to help users preemptively identify potential conflicts and assists them in defining action constraints. Using these constraints, ACCORD can automatically detect and resolve any future conflicts.
Entity resolution, which involves identifying and merging records that refer to the same real-world entity, is a crucial task in areas like Web data integration. This importance is underscored by the presence of numerous duplicated and multi-version data resources on the Web. However, achieving high-quality entity resolution typically demands significant effort. The advent of Large Language Models (LLMs) like GPT-4 has demonstrated advanced linguistic capabilities, which can be a new paradigm for this task. In this paper, we propose a demonstration system named BoostER that examines the possibility of leveraging LLMs in the entity resolution process, revealing advantages in both easy deployment and low cost. Our approach optimally selects a set of matching questions and poses them to LLMs for verification, then refines the distribution of entity resolution results with the response of LLMs. This offers promising prospects to achieve a high-quality entity resolution result for real-world applications, especially to individuals or small companies without the need for extensive model training or significant financial investment.
Cyber-Physical-Social System (CPSS) is a pioneering solution in Crowd Computing, which integrates heterogeneous resources from cyber, physical, and social spaces, possessing collaborative capabilities in perception, computation, and control. However, existing CPSSs usually confine their functionality to rigid scenarios or tasks, and often oversimplify human resource modeling that fails to dynamically recognize human capabilities. In this work, we propose a Scenario-Driven CPSS to enable an adaptive resource choreography across scenarios. More concretely, we leverage temporal environments to identify events and disassemble these events into workflows, triggering the execution of corresponding capability units, where the capability units abstract the shared functionality of heterogeneous resource groups. Meanwhile, we improve the pre-assuming human capabilities to construct the relationship between human and their capabilities during the execution of workflows, sufficiently promoting human intelligence. Our real-world demo on fire rescue demonstrates the effectiveness of the solution.
Personal knowledge graphs (PKGs) offer individuals a way to store and consolidate their fragmented personal data in a central place, improving service personalization while maintaining full user control. Despite their potential, practical PKG implementations with user-friendly interfaces remain scarce. This work addresses this gap by proposing a complete solution to represent, manage, and interface with PKGs. Our approach includes (1) a user-facing PKG Client, enabling end-users to administer their personal data easily via natural language statements, and (2) a service-oriented PKG API. To tackle the complexity of representing these statements within a PKG, we present an RDF-based PKG vocabulary that supports this, along with properties for access rights and provenance.
In the modern Web landscape, data privacy and control are increasingly unattainable for users. Solid 1, a decentralized Web ecosystem, restores individual privacy and control by separating data from applications, allowing integration across applications while enabling users to control access. The growth in Solid's decentralized pods, and the escalating amounts of data stored in them, necessitate a decentralized search mechanism to query data within personal datastores, while respecting varying access constraints. This demo paper presents our decentralized search system (ESPRESSO) for querying RDF and non-RDF data in Solid datastores, tackling challenges like query propagation, data indexing, privacy, and results aggregation.
Processing SPARQL queries over large federations of SPARQL endpoints is essential for maintaining the Semantic Web decentralized. However, existing federation engines struggle to query more than a dozen of endpoints. We recently proposed FedUP, a new type of federation engine based on unions-over-joins query plans that outperforms state-of-the-art federation engines by orders of magnitude on large federations. This demonstration paper introduces Fediscount, a federated online shopping application based on the FedShop benchmark, illustrating the capabilities of FedUP. The application is based on standard Semantic Web technologies, enabling end-users to shop online in a virtual federated store comprising 20, 100, or even 200 SPARQL endpoints. This breakthrough opens up promising new avenues for developing and deploying federated applications.
The proliferation of hate speech (HS) has compromised the safety and trustworthiness of the internet, exacerbating social divides by promoting hatred and discrimination. Although recent studies have produced guidelines and developed advanced technologies for the automated detection of HS, their efficacy and adaptability in real-world applications remain unclear. Furthermore, existing guidelines on what constitutes HS might not reflect the perspectives and beliefs of individuals and communities. This paper introduces Brinjal, a multifaceted web plugin designed for the collaborative detection of HS. Brinjal enables individuals to identify instances of HS and engage in discussions to verify such content, thereby enhancing the collective understanding of HS. Additionally, Brinjal serves as a practical platform for deploying and evaluating advanced HS detection models, facilitating user interaction and performance assessment. Lastly, Brinjal includes an analytical tool for analyzing HS, offering insights based on the crowdsourced instances and discussions about HS across various websites. The video demonstration of Brinjal can be viewed here: https://youtu.be/\_JxziIVWBO4. Disclaimer: This paper contains violent and discriminatory content that may be disturbing to some readers.
We present SocialGenPod, a decentralised and privacy-friendly way of deploying generative AI Web applications. Unlike centralised Web and data architectures that keep user data tied to application and service providers, we show how one can use Solid - a decentralised Web specification - to decouple user data from generative AI applications. We demonstrate SocialGenPod using a prototype that allows users to converse with different Large Language Models, optionally leveraging Retrieval Augmented Generation to generate answers grounded in private documents stored in any Solid Pod that the user is allowed to access, directly or indirectly. SocialGenPod makes use of Solid access control mechanisms to give users full control of determining who has access to data stored in their Pods. SocialGenPod keeps all user data (chat history, app configuration, personal documents, etc) securely in the user's personal Pod; separate from specific model or application providers. Besides better privacy controls, this approach also enables portability across different services and applications. Finally, we discuss challenges, posed by the large compute requirements of state-of-the-art models, that future research in this area should address. Our prototype is open-source and available at: https://github.com/Vidminas/socialgenpod/.
Finding the best short- or long-term accommodation is troublesome in unknown areas. Current tools provided by the real-estate market offer valuable information regarding the property, such as price, photos, and descriptions of the space; however, this market has little explored other relevant information regarding the surrounding area, such as what is nearby and users' subjective perception of the property's area. To address this gap, we propose REAL-UP, an interactive tool designed to enrich real-estate marketplaces. In addition to information commonly provided by such applications, e.g., rent price, REAL-UP also provides subjective neighborhood information based on Location-Based Social Networks (LBSNs) messages. This novel tool helps to represent complex users' subjective perceptions of urban areas, which could ease the process of finding the best accommodation.
In this work, we introduce Ducho 2.0, the latest stable version of our framework. Differently from Ducho, Ducho 2.0 offers a more personalized user experience with the definition and import of custom extraction models fine-tuned on specific tasks and datasets. Moreover, the new version is capable of extracting and processing features through multimodal-by-design large models. Notably, all these new features are supported by optimized data loading and storing to the local memory. To showcase the capabilities of Ducho 2.0, we demonstrate a complete multimodal recommendation pipeline, from the extraction/processing to the final recommendation. The idea is to provide practitioners and experienced scholars with a ready-to-use tool that, put on top of any multimodal recommendation framework, may permit them to run extensive benchmarking analyses. All materials are accessible at: https://github.com/sisinflab/Ducho/
The research on table representation learning, data retrieval, and data integration in the context of data lakes requires large table corpora for the training and evaluation of the developed methods. Over the years, several large table corpora such as WikiTables, GitTables, or the Dresden Web Table Corpus have been published and are used by the research community. This paper complements the set of public table corpora with the Web Data Commons Schema.org table corpora, two table corpora consisting of 4.2 (Release 2020) and 5 million (Release 2023) relational tables describing products, events, local businesses, job postings, recipes, movies, books, as well as 37 further types of entities. The feature that distinguishes the corpora from all other publicly available large table corpora is that all tables that describe entities of a specific type use the same attributes to describe these entities, i.e. all tables use a shared schema, the schema.org vocabulary. The shared schema eases the integration of data from different sources and allows training processes to focus on specific types of entities or specific attributes. Altogether the tables contain ~653 million rows of data which have been extracted from the Common Crawl web corpus and have been grouped into separate tables for each class/host combination, i.e. all records of a specific class that originate from a specific website are put into a single table. This paper describes the creation of the WDC Schema.org Table Corpora, gives an overview of the content of the corpora, and discusses their use cases.
Predicting vehicle flow is crucial for traffic management but is often limited by the scope of sensors. In contrast, extensive mobile network coverage enables us to utilize counts of mobile users' network activities (cellular traffic) on roadways as a proxy for vehicle flow. However, cellular traffic counts, which encompass various user types, may not directly align with vehicle flow. To address this issue, we present a new task: utilizing cellular traffic to predict vehicle flow in camera-free areas. This is supported by our Tel2Veh dataset, which comprises extensive cellular traffic and sparse vehicle flows. To tackle this task, we propose a two-stage framework. It first independently extracts features from multimodal data, and then integrates them using a graph neural network (GNN)-based fusion to generate predictions of vehicle flow in camera-free areas. We pioneer the fusion of telecom and vision-based data, paving the way for future expansions in web-based applications and systems.
There is a great diversity of RDF datasets publicly available on the web. Choosing among them requires assessing their "fitness for use'' for a particular use case, and thus, finding the right quality measures and evaluating data sources according to them. However, this is not an easy task due to the large number of possible quality measures, and the multiplicity of implementation and assessment platforms. Therefore, there is a need for a common way to define measures and evaluate RDF datasets, using open standards and tools. IndeGx is a SPARQL-based framework to design indexes of Knowledge Graphs declaratively. We extend it to support more advanced data quality measures. We demonstrate our approach by reproducing two existing measures, showing how one can formalize and add measures using such an open declarative framework.
Fact-centric question answering (QA) often requires access to multiple, heterogeneous, information sources. By jointly considering several sources like a knowledge base (KB), a text collection, and tables from the web, QA systems can enhance their answer coverage and confidence. However, existing QA benchmarks are mostly constructed with a single source of knowledge in mind. This limits capabilities of these benchmarks to fairly evaluate QA systems that can tap into more than one information repository. To bridge this gap, we release CompMix, a crowdsourced QA benchmark which naturally demands the integration of a mixture of input sources. CompMix has a total of 9,410 questions, and features several complex intents like joins and temporal conditions. Evaluation of a range of QA systems on CompMix highlights the need for further research on leveraging information from heterogeneous sources.
Personalization in Information Retrieval is a topic studied for a long time. Nevertheless, there is still a lack of high-quality, real-world datasets to conduct large-scale experiments and evaluate models for personalized search. This paper contributes to filling this gap by introducing SE-PQA(StackExchange - Personalized Question Answering), a new curated resource to design and evaluate personalized models related to the task of community Question Answering (cQA). The contributed dataset includes more than 1 million queries and 2 million answers, annotated with a rich set of features modeling the social interactions among the users of a popular cQA platform. We describe the characteristics of SE-PQA and detail the features associated with questions and answers. We also provide reproducible baseline methods for the cQA task based on the resource, including deep learning models and personalization approaches. The results of the preliminary experiments conducted show the appropriateness of SE-PQA to train effective cQA models; they also show that personalization remarkably improves the effectiveness of all the methods tested. Furthermore, we show the benefits in terms of robustness and generalization of combining data from multiple communities for personalization purposes.
We present CNER-UAV, a fine-grained C hinese N ame E ntity R ecognition dataset specifically designed for the task of address resolution in U nmanned A erial V ehicle delivery systems. The dataset encompasses a diverse range of five categories, enabling comprehensive training and evaluation of NER models. To construct this dataset, we sourced the data from a real-world UAV delivery system and conducted a rigorous data cleaning and desensitization process to ensure privacy and data integrity. The resulting dataset, consisting of around 12,000 annotated samples, underwent human experts and L arge L anguage M odel annotation. We evaluated classical NER models on our dataset and provided in-depth analysis. The dataset and models are publicly available at https://github.com/zhhvvv/CNER-UAV.
User and Entity Behavior Analytics (UEBA) is key for managing security risks on information systems and comprehending user activities' impact on the network infrastructure. However, accessing network traffic and Web logs is challenging due to encryption or decentralized systems. Qualifying activities also requires contextualizing them according to the network's topology, as it determines potential exchanges and carries information about which services are used. This complexity hinders learning behavioral patterns when precise user action sequences are needed. We propose to tackle these challenges with Graphameleon, an open-source Web extension for capturing Web navigation traces. We model user activities in an RDF Knowledge Graph (KG), drawing from the UCO and NORIA-O ontologies. With this approach, we are able to distinguish analytics strategies implemented across different websites.
Since the inception of the Internet and WWW, providing the time among multiple nodes on the Internet has been one of the most critical challenges. David Mills is the pioneer to provide time on the Internet, inventing the Network Time Protocol (NTP), and synchronizing the clocks in computer systems. Now, the NTP is predominantly used on the Internet and WWW. In this paper, we revisit the NTP, and present the overview of the NTP. In particular, we highlight the advanced research effort, the SpaceNTP, to synchronize the clocks among assets and entities in space. The SpaceNTP designed for space environments will be the fundamental medium and enabling block to provide the future web services in space.
Adobe Flash, once a ubiquitous multimedia platform, played a pivotal role in shaping the digital landscape for nearly two decades. Its capabilities ranged from animated banners and immersive websites to complex online applications and games. Flash content was embedded on websites with the embed or the object element. To the browser, the embedded content is opaque by default, which means Flash content can't be used for the accessibility tree that the browser creates based on the DOM tree, which is used by platform-specific accessibility APIs to provide a representation that can be understood by assistive technologies, such as screen readers. With Flash losing out on popularity, HTML 5 introduced the canvas element, which for the first time allowed developers to draw graphics and animations with either the canvas scripting API or the WebGL API directly natively in the browser. Similar to Flash, such canvas-rendered content is opaque by default and unusable for the accessibility tree. Lastly, the implementation of the WebAssembly Garbage Collection (WasmGC) standard in browsers allowed developers to port applications, written in non-Web programming languages like Kotlin for non-Web platforms like Android, to the Web by compiling them to WasmGC and rendering the entire app into a canvas. In the most extreme of cases, this means that the entire HTML code of an application can consist of a sole canvas tag, which evidently is opaque to the browser and impossible to leverage for the accessibility tree. Without judging their quality, this paper focuses on documenting approaches then and now for making such opaque Web content more accessible to users of assistive technologies.
Web 2.0 provided impactful tools, based on user-generated content, for political campaigns and opinion engineering. However, in recent months, AI advances and the ease of access to AI-generated content (AIGC) have led to a paradigm shift in political participation by politicians and electorates alike. This paper aims to explore a historical analysis of this shift. We provide anecdotal evidence of new trends, potential impact, and challenges. We discuss the usage of AIGC in political campaigns, and how AIGC is used as a substitute for incarcerated politicians. Such a usage presents novel ways for leaders to reach the public and keep them politically active. However, AIGC also has risks when used for disinformation, such as DeepFake media and caller bots, to undermine and malign the opponents. On the other hand, the evidence shows that governments can nudge AIGC content by censoring Internet services. We also report challenges facing AIGC usage, such as model bias and hallucinations, along with a governance perspective on ethics and regulations.
Over the last 30 years, the World Wide Web has changed significantly. In this paper, we argue that common practices to prepare web pages for delivery conflict with many efforts to present content with minimal latency, one fundamental goal that pushed changes in the WWW. To bolster our arguments, we revisit reasons that led to changes of HTTP and compare them systematically with techniques to prepare web pages. We found that the structure of many web pages leverages features of HTTP/1.1 but hinders the use of recent HTTP features to present content quickly. To improve the situation in the future, we propose fine-grained content segmentation. This would allow to exploit streaming capabilities of recent HTTP versions and to render content as quickly as possible without changing underlying protocols or web browsers.
This essay will briefly narrate the relationship between my professional growth and the evolution of web technology and the internet, especially in Brazil, where I live. I start by briefly discussing my background and how I came to be involved with technology, from my earliest computer experiences to getting online and continuing my schooling. At this time, several significant Web and Internet events occurred, influencing my career path choice. I also write about how my work has focused on digital accessibility and how my engagement has impacted the Web in my country. I close the article by examining my present work environment and how the Web has impacted my life, motivating me to use free and open technology.
2024 will be the largest election year in history involving over 50 countries and approximately 4.2 billion people. Since 1996, the Web has been instrumental in political campaigns, enhancing public engagement and creating new communication avenues for elections. Nevertheless, the proliferation of generative AI technologies has made false information dissemination simpler and quicker, posing a substantial threat to election integrity and democratic processes. The 2024 global elections underscore the need to comprehend and tackle the impact of such technologies on democracy. In this paper, we undertake a detailed meta-analysis, scrutinizing 44 papers published in The Web Conference, detailing the influence of the Web on elections. Our research reveals key historical trends on how the Web has impacted elections: first, social media has revolutionized election strategies through direct voter-candidate interactions. Second, big data and algorithm-driven campaigns are commonplace. Third, AI advancements have exacerbated the spread of fake news, risking election fairness. Predominantly from studies published since 2018 among 44 papers, we underscore the necessity for advanced detection tools, policy formulation, and responsible AI use to maintain electoral integrity. This analysis offers an insight into the Web and AI's impact on elections, presenting pointers for addressing challenges and leveraging opportunities in the 2024 and future elections.
Diagnosis prediction is becoming crucial to develop healthcare plans for patients based on Electronic Health Records (EHRs). Existing works usually enhance diagnosis prediction via learning accurate disease representation, where many of them try to capture inclusive relations based on the hierarchical structures of existing disease ontologies such as those provided by ICD-9 codes. However, they overlook exclusive relations that can reflect different and complementary perspectives of the ICD-9 structures, and thus fail to accurately represent relations among diseases and ICD-9 codes. To this end, we propose to project disease embeddings and ICD-9 code embeddings into boxes, where a box is an axis-aligned hyperrectangle with a geometric region and two boxes can clearly "include" or "exclude" each other. Upon box embeddings, we further obtain patient embeddings via aggregating the disease representations for diagnosis prediction. Extensive experiments on two real-world EHR datasets show significant performance gains brought by our proposed framework, yielding average improvements of 6.04% for diagnosis prediction over state-of-the-art competitors.
AI-driven severity assessment techniques for dysarthric disorders show promise in aiding speech-language pathologists with diagnostics and therapeutic follow-ups for patients. Existing solutions generally focus on the average intelligibility and hoarseness of the individual speaker's speech (i.e., speaker-level classification). This potentially ignores the slight variations in pronunciation attributed to the speaker's dysarthric disorders, e.g., /t/ and /d/. To address this issue, we rethink the inherent differences in the dysarthria speech, and propose a non-intrusive severity assessment approach called DysarNet. Specifically, we first design a prosodic emphasis module based on frame-level speech features to highlight the fine-grained temporal changes including pronunciation content, rhythm, and timing. Second, we design a multi-scale aggregation strategy to collect statistical cues on articulatory information at different scales, i.e., frame-level and utterance-level. By doing so, multi-scale prosody and articulatory cues are directly assist the prediction network for assessing dysarthria severity from multiple views, and naturally achieve speaker-independent generalization ability. Experimental results on VCC 2018 and TORGO datasets show that our DysarNet excels in assessing dysarthria severity.
Identifying electrical signatures preceding a ventricular arrhythmia from the implantable cardioverter-defibrillators (ICDs) can help predict an upcoming ICD shock. To achieve this, we first deployed a large-scale study (N=326) to continuously monitor the electrogram (EGM) data from the ICDs and select the EGM segments prior to a shock event and under the normal condition. Next, we design a novel cohesive framework that integrates metric learning, prototype learning, and few-shot learning, enabling learning from an imbalanced dataset. We implement metric learning by leveraging a Siamese neural network architecture, which incorporates LSTM units. We innovatively utilize triplet and pair losses in a sequential manner throughout the training process on EGM samples. This approach generates embeddings that significantly enhance the distinction of EGM signals under different conditions. In the inference stage, k-means clustering identifies prototypes representing pre-shock and normal states from these embeddings. In summary, this framework leverages the predictive potential of signals before ICD shocks, addressing the gap in early cardiac arrhythmia detection. Our experimental results show a notable F1 score of 0.87, sensitivity of 0.97, and precision of 0.79. Our framework offers a significant advancement in cardiac care predictive analytics, promising enhanced ICD decision-making for improved patient outcomes.
Mental health is a state of mental well-being that enables people to cope with the stresses of life, realize their abilities, learn well and work well, and contribute to their community. It has intrinsic and instrumental value and is integral to our well-being, and its correlation with environmental factors has been a subject of growing interest. As the pressure of society keeps growing, depression has become a severe problem in modern cities, and finding a way to estimate depression rate is of significance to relieve the problem. In this study, we introduce a Contrastive Language-Image Pretraining (CLIP) based novel approach to predict mental health indicators, especially depression rate, through satellite and street view images. Our methodology uses state-of-the-art Multimodal Large Language Model (MLLM), GPT4-vision, to generate health related captions for satellite and street view images, then we use the generated image-text pairs to fine-tune the CLIP model, making its image encoder extract health related features such as green spaces, sports fields, and infrastructral characteristics. The fine-tuning process is employed to bridge the semantic gap between textual descriptions and visual representations, enabling a comprehensive analysis of geo-tagged images. Consequently, our methodology achieves a notable R2 value of 0.565 on prediction of depression rate in New York City with the combination of satellite and street view images. The successful deployment of Health CLIP in a real-world scenario underscores the practical applicability of our approach.
Cancer patients face a heightened risk of venous thromboembolism (VTE), emerging as the second most prevalent cause of death within this population. Central venous catheterization (CVC), a routine procedure in cancer care, amplifies the VTE risk, leading to catheter-related thrombosis (CRT). Although traditional risk-assessment models and certain AI methods exist for VTE prediction, their capability and application in CRT risk prediciton for cancer patients remains limited.
This paper addresses the shortcomings of current models (RAMs) by crafting a dedicated AI model to predict CRT risks for cancer patients. Leveraging a dataset encompassing 10,512 cancer patients undergoing catheterization over a decade, we meticulously select nine specific features for model construction, resulting in an impressive 0.794 AUROC in prediction, 54.9% higher than baseline. Furthermore, we estimate CRT-free probability using the Kaplan-Meier method. We also develop a WeChat Mini Program designed for efficient data collection and risk prediction, enhancing the efficiency of CRT risk detection for both doctors and patients.
As health and social care data networks evolve and adapt to greater digitalization and datafication of health, data and analytics systems are developing and bringing forward new ways to share, access and analyze data. Organizations and individuals making data sharing decisions for AI-enabled health and social care services need to be able to balance the benefits of such uses with the possible risks that may ensue - including those related to issues of privacy and security. In this paper, we provide an overview of our approach to privacy risk assessment for cross-domain access and re-use of sensitive data for research purposes using Spyderisk - an automated risk assessment tool. We apply Spyderisk to a real AI research scenario and consider the ways in which such techniques could support multiple stakeholders to assess privacy and security risks.
In the digital age, where health data and digital lives converge, data privacy and control are crucial. The advent of AI and Large Language Models (LLMs) brings advanced data analysis and healthcare predictions, but also privacy concerns. The ESPRESSO project 1 asserts that for AI to be trustworthy and effective in healthcare, it must prioritize user control over corporate interests. The shift towards decentralized personal online datastores (pods) and Solid 2 principles represents a new era of private, controllable Web interactions, balancing AI data protection and machine intelligence. This balance is particularly important for applications involving health data. However, decentralization poses challenges, particularly in secure, efficient data search and data retrieval, that need to be addressed first. We argue that a decentralized search system that provides a large-scale search across Solid pods, while considering data owners' control of their data and users' different access rights, is crucial for this new paradigm. In this paper, we describe how our current decentralized search system's prototype (ESPRESSO) helps to query structured and unstructured personal health data in Solid servers. The paper also describes a search scenario that shows how ESPRESSO can search health data combined with fitness personal data stored in different personal datastores
As AI systems are built and deployed to support mental health services, it is imperative to fully understand the stakeholder acceptability of such systems so that these concerns can be taken into account in system design. As such, we undertook a consultation with staff (therapists) and service-users at Adferiad Recovery (a large mental health charity). The aim was to capture insights about their understanding of trust, and different trust factors for AI in mental health care. Surveys, interviews and focus groups were conducted with service users and therapists. Key takeaways for computer scientists and the developers of AI systems are presented.
Patient risk prediction models are crucial as they enable healthcare providers to proactively identify and address potential health risks. Large pre-trained foundation models offer remarkable performance in risk prediction tasks by analyzing multimodal patient data. However, a notable limitation of pre-trained foundation models lies in their deterministic predictions (i.e., lacking the ability to acknowledge uncertainty). We propose Gaussian Process-based foundation models to enable the generation of accurate predictions with instance-level uncertainty quantification, thus allowing healthcare professionals to make more informed and cautious decisions. Our proposed approach is principled and architecture-agnostic. Experimental results show that our proposed approach achieves competitive performance on classical classification metrics. Moreover, we observe that the accuracy of certain predictions is much higher than that of the uncertain ones, which validates the uncertainty awareness of our proposed method. Therefore, healthcare providers can trust low-uncertainty predictions and conduct more comprehensive investigations on high-uncertainty predictions, ultimately enhancing patient outcomes with less expert intervention.
Progressive diagnosis prediction in healthcare is a promising yet challenging task. Existing studies usually assume a pre-defined prior for generating patient distributions (e.g., Gaussian). However, the inferred approximate posterior can deviate from the real-world distribution, which further affects the modeling of continuous disease progression over time. To alleviate such inference bias, we propose an enhanced progressive diagnostic prediction model (i.e., ProCNF), which integrates continuous normalizing flows (CNF) and neural ordinary differential equations (ODEs) to achieve more accurate approximations of patient health trajectories while capturing the continuity underlying disease progression. We first learn patient embeddings with CNF to construct a complex posterior approximation of patient distributions. Then, we devise a CNF-enhanced neural ODE module for progressive diagnostic prediction, which aims to improve the modeling of disease progression for individual patients. Extensive experiments on two real-world longitudinal EHR datasets show significant performance gains brought by our method over state-of-the-art competitors.
Prostate cancer risk prediction (PCRP) is crucial in guiding clinical decision-making and ensuring accurate diagnoses. The area under the receiver operating characteristic curve (AUC) is typically used for the evaluation of PCRP models. However, AUC considers regions with high false positive rates (FPRs), which are not applicable in clinical practice. To address this concern, we propose to use partial AUC (pAUC) as a more clinically meaningful metric which evaluates PCRP models with restricted FPR. Moreover, we propose a new PCRP framework named pAUCP, which optimizes pAUC to train PCRP models and adopts model ensemble to further enhance its usability. We construct clinical datasets obtained from two medical centers over an extended period to evaluate the proposed pAUCP framework. Extensive experiments demonstrate the rationality and superiority of the pAUCP framework, especially the cross-time and cross-center transferability of the obtained PCRP model.
Disinformation refers to the deliberate dissemination of fake or misleading information, which significantly threatens the modern social stability by undermining trust, intensifying polarization and manipulating public opinion. With the advances of generative AI, the landscape of modern disinformation is changing following the rise of Large Language Models. Recent studies has revealed the capability of generative language models to create convincing and misleading content against the truth and warned the availability of such models to be maliciously abused for deceptive generation.
However, AI-driven disinformation is a human-centered societal issue in nature, the realization of which requires not only the in- depth discussion on the latest trends from both sides of generative AI and disinformation, but a critical analysis on the uncertainty of their potential interaction in practice as well. The paper introduces the new vision of AI-driven disinformation campaigns from the perspectives of human-centered AI, proposes a framework of core research questions based on the existing research gap, discusses the preliminary discovery in literature and initial experiments, and elaborates the main lines of research in the future work.
Distributed Transparent Data Layer (DTDL) aims to overcome the significant storage inefficiencies in blockchain technology. The proposed scheme enhances scalability and enables broader adoption by allowing nodes to store only portions of the blockchain history, diverging from traditional methods like sharding. It consists of three main components: a transparent authentication scheme, a verifiable search tree, and a data availability sampling scheme, supporting diverse applications including zero-knowledge machine learning. This approach not only maintains transparency to the blockchain's upper layer but also offers seamless integration with existing systems without requiring forks. Additionally, the paper introduces an innovative transparent authentication method for Luby Transform (LT) codes using KZG commitments, enabling efficient and secure verification of encoded symbols without decoding. Addressing the challenges of data outsourcing in blockchain, our proposed model ensures data integrity and robust security in a potentially malicious publisher environment, marking a significant advancement in blockchain storage and data integrity solutions.
Clinical text in electronic health records (EHR) holds vital cues into a patient's journey, often absent in structured EHR data. Evidence-based healthcare decisions demand accurate extraction and modeling of these cues. The goal of our study is to predict Type-II Diabetes by utilizing concept-based models of visit sequences from longitudinal EHR data. We undertake the challenging task of fine-grained temporal information extraction from clinical text using a recent span-based approach with pre-trained transformers. We achieve a new state-of-the-art in end-to-end relation extraction from 2012 clinical temporal relations corpus. We propose to apply our model to a new dataset and extract patient-centric temporal knowledge graphs from their visits-fusing temporal orderings within documents and across visits. Beyond the current focus of our work on Type-II Diabetes risk prediction from EHR, our versatile framework can be extended to other domains including web-based healthcare systems for personalized medicine. It can not only model health outcomes having long progression timelines but also various socio-economic outcomes such as conflict, natural disasters, and financial markets by leveraging news, reports, and social-media text for extracting and modeling irregular time-series and help inform a variety of web-based applications and policies.
Utilizing graph analytics and learning has proven to be an effective method for exploring aspects of crypto economics such as network effects, decentralization, tokenomics, and fraud detection. However, the majority of existing research predominantly focuses on leading cryptocurrencies, namely Bitcoin (BTC) and Ethereum (ETH), overlooking the vast diversity among the more than 10,000 cryptocurrency projects. This oversight may result in skewed insights. In our paper, we aim to broaden the scope of investigation to encompass the entire spectrum of cryptocurrencies, examining various coins across their entire life cycles. Furthermore, we intend to pioneer advanced methodologies, including graph transfer learning and the innovative concept of "graph of graphs". By extending our research beyond the confines of BTC and ETH, our goal is to enhance the depth of our understanding of crypto economics and to advance the development of more intricate graph-based techniques.
Prioritizing long-term engagement rather than immediate benefits has garnered increasing attention in recent years. However, current research on long-term recommendation faces substantial challenges in terms of model evaluation and design: 1) Traditional evaluation approaches suffer from limitations due to the sparsity and bias in the offline data and fail to capture user psychological influences. 2) Existing recommenders based on Reinforcement Learning (RL) are entirely data-driven and constrained by sparse and long-tail distributed offline data. Fortunately, recent advancements in Large Foundation Models (LFMs), characterized by remarkable simulation and planning capacity, offer significant opportunities for long-term recommendation. Despite potential, due to the substantial scenario divergence between LFM pre-training and recommendation, employing LFMs in long-term recommendation still faces certain challenges. To this end, this research focuses on adapting the remarkable capabilities of LFMs to long-term recommendations to devise reliable evaluation schemes and efficient recommenders.
Modeling and forecasting such data is difficult because online activity data is high-dimensional and composed of multiple time-varying dynamics such as trends, seasonality, and diffusion of interest. In this paper, we propose D-Tracker, designed to capture latent dynamics in online activity data streams and forecast future values. Our proposed method has the following properties: (a) Interpretable: it uses interpretable differential equations to model the latent dynamics in online activity data, which enables us to capture trends and interest diffusion among locations; (b) Automatic: it determines the number of latent dynamics and the number of seasonal patterns fully automatically; (c) Scalable: it incrementally and adaptively detects shifting points of patterns for a semi-infinite collection of tensor streams. (c)Scalable : the computation time of D-Tracker is independent of the time series length. Experiments using web search volume data obtained from GoogleTrends show that the proposed method can achieve higher forecasting accuracy in less computation time than existing methods while extracting the patterns of interest diffusion among locations.
TikTok has become a dominant force in the social media landscape of the United States, and has spawned other social media sites emulating their algorithmically-driven short form content recommendation platform (e.g. Youtube Shorts and Instagram Reels). The short-form vertical content is designed to be consumed on mobile phones, but existing audits have predominantly, and to a limited degree, investigated TikTok using the web application. Additionally, there are no advertisements on the web version of TikTok, and as such the advertising ecosystem of the platform has thusfar largely gone unstudied. In this work we propose a technique for auditing TikTok's recommendation algorithm through interfacing with emulators and intercepting network traffic. In this way we are able to measure the personalization that comes from user-specified demographics such as gender and age and better understand how ads are delivered to these groups. Future work will investigate personalization from user interaction such as liking posts and following creators based on their interest, and will study the role that algorithmic personalization plays in ad targeting.
Intergroup dehumanisation represents a pressing concern for today's society. It hinders empathy, prosocial behaviour, and contributes to between-group aggression. Its consequences are particularly dangerous in the context of international military conflicts as dehumanisation contributes to support for war, war-related violence, and usually accompanies genocidal conflicts. This motivated the focus of this study on the blatant forms of dehumanisation towards an outgroup defined in political or national terms, with a specific focus on the relations between Ukrainians, Russians, and Belarusians around the time of the Russian invasion of Ukraine in 2022.
The study draws attention to previously under-researched aspect in outgroup dehumanisation, specifically the role of ingroup perception in it. Outgroup dehumanisation involves excluding the outgroup from the community one identifies with, thus reinforcing the boundary between ingroup and outgroup. This highlights the comparative nature of dehumanisation, suggesting its basis might lie more in comparative ingroup superiority bias rather than in outgroup inferiority bias. Existing research however generally concentrates solely on negative aspects of outgroup perception in dehumanising attitudes. While some studies have gauged dehumanisation through ingroup-outgroup perception differences, they lacked a ground truth measure for dehumanisation, leaving its comparative nature largely unexamined. Employing generative Large Language Model, we develop a dataset of Telegram channels posts, classified as dehumanising or neutral. Utilising NLP tools we analyse the role of ingroup-outgroup perception disparities in dehumanisation, specifically addressing its relation to affective polarisation.
With the development of information technology, a large amount of information and corpora has been incrementally sparked from the Web, stimulating an increasingly high demand for summarizing. Document Summarization is one of Natural Language Processing tasks, which aims to generate abridged versions of a given single or multiple documents as concise and coherent as possible while preserving salient information from the source texts. Recent research in the area has started to use knowledge graphs as they can capture more factual and applicable information from more facets along with source information, benefiting fact consistency and informativeness of generated summaries, rather than just from a linguistic perspective. However, there is no explicit investigation of the effects of different kinds of knowledge graphs on document summarization. The proposed method is to use structured informative and knowledgeable auxiliary information, especially knowledge graphs, into pre-trained summarization models, advancing summary qualities. Expected outcomes are exploring knowledge and knowledge graph incorporation for multi-document summarization, and achieving more informative, coherent, and factually consistent summaries.
Relation extraction is the task of extracting relationships from input text, where input can be a sentence, document, or multiple documents. This task has been popular for decades and is still of keen interest. Various techniques have been proposed to solve the relation extraction problem, among which the most popular are using distant supervision, deep learning-based models, reasoning-based models, and transformer-based models. We propose three approaches (named ReOnto, DocRE-CLip, and KDocRE) for relation extraction from text at three levels of granularity (sentence, document and across documents). These approaches embed knowledge in a deep learning based model to improve performance. ReOnto and DocRE-CLip have been evaluated and the source code is publicly available. We are currently implementing and evaluating KDocRE.
Sentiment analysis based on lexical corpora is widely employed despite its inherent limitations in capturing nuances such as sarcasm and irony. This research delves into the application of sentiment analysis to political communication. To address the limitations of the Bag of Words methodology, a comparative study of sentiment analysis tools and emotion detection from speech is conducted, using automated speech recognition as a benchmark. Emotion recognition from speech has shown promising results, indicating its potential superiority over other methods [1], [2]. This study uses media material from Polish radio and television broadcasts, focusing on political interviews during a significant period marked by a high-profile assassination attempt. Results indicate challenges at the micro-level, but aggregated data reveals a significant correlation between valence measured from voice and text. While sentiment analysis may lack sensitivity in capturing mourning-related discourse, it proves effective in political communication devoid of such nuances. This suggests that valence in sentiment analysis reflects emotional content derived from intonation fairly accurately.
Graph Neural Networks (GNNs) have achieved remarkable success in various real-world applications. However, GNNs may be trained on undesirable graph data, which can degrade their performance and reliability. To enable trained GNNs to efficiently unlearn unwanted data, a desirable solution is retraining-based graph unlearning, which partitions the training graph into subgraphs and trains sub-models on them, allowing fast unlearning through partial retraining. However, the graph partition process causes information loss in the training graph, resulting in the low model utility of sub-GNN models. In this paper, we propose GraphRevoker, a novel graph unlearning framework that better maintains the model utility of unlearnable GNNs. Specifically, we preserve the graph property with graph property-aware sharding and effectively aggregate the sub-GNN models for prediction with graph contrastive sub-model aggregation. We conduct extensive experiments to demonstrate the superiority of our proposed approach.
Online communities are powerful tools to connect people and are used worldwide by billions of people. Nearly all online communities rely upon moderators or admins to govern the community in order to mitigate potential harms such as harassment, polarization, and deleterious effects on mental health. However, online communities are complex systems, and studying the impact of community governance empirically at scale is challenging because of the many aspects of community governance and outcomes that must be quantified. In this work, we develop methods to quantify the governance of online communities at web scale. We survey community members to build a comprehensive understanding of what it means to make communities 'better,' then assess existing governance practices and associate them with important outcomes to inform community moderators. We collaborate with communities to deploy our governance interventions to maximize the positive impact of our work, and, at every step of the way, we make our datasets and methods public to support further research on this important topic.
Creativity is the ability to develop innovative functional ideas through unconventional associations. The consensus view on creativity in the literature involves divergence from stereotypical and habitual thought patterns [29, 39]. Creativity relies on search to explore diverse solutions. Search requires charting the mental terrain, leveraging past experiences and knowledge to manipulate and reconfigure components for new solutions [28]. The generally-accepted and overly-narrow view on creativity, however, neglects the fact that creativity is multidimensional [14]. This one-dimensional view of creativity triggers questions such as "Does one consider an unethical but novel creation to be creative?" and "Does one consider a new iPhone with mainstream functionalities but advanced camera features to be creative?" This research challenges the one-dimensional view of creativity, offering a more all-encompassing conceptualization of creativity [14]. The research examines the multidimensional nature of creativity by building a computational model of a designer's mutual search process across multiple mutually dependent search spaces. The research examines the trajectory of mutual search across multiple cognitive search spaces using a Generalized Additive Model (GAM). The field experiment employs 108 designers who develop their web designs through five iterations, utilizing computer graphics methods to extract the images. Through measuring the distance of search by considering changes in visual and source code in each iteration, the study argues that the search patterns differ in the degree of exploration in these search spaces over time. The research concludes that designers' search processes are non-linear and argues that there are more than one or two search spaces. The research also provides perceptual explanations of the multiple search processes in designs and argues for a more encompassing view of creativity.
Cryptocurrencies are becoming increasingly important for the modern economy. Prior literature focuses on aligning actor incentives to ensure the secure and efficient operation of cryptocurrencies against adversarial threats that are unobserved in the wild. In this work, we address the gap between the theory and practice of cryptocurrencies by advancing realistic approaches to analyze the economics and security of key cryptocurrency components: consensus mechanisms, transaction fee mechanisms (TFMs), and the application layer. We present novel models of these components that we evaluate both theoretically and using cryptocurrency clients. We augment our evaluation with the first evidence of an in-the-wild attack on a major cryptocurrency, highlighting our approach's practicality. Results contained in our work were adopted by cryptocurrency platforms that hold user assets worth over 300 billion.
Advancements in conversational systems have revolutionized information access, surpassing the limitations of single queries. However, developing dialogue systems requires a large amount of training data, which is a challenge in low-resource domains and languages. Traditional data collection methods like crowd-sourcing are labor-intensive and time-consuming, making them ineffective in this context. Data augmentation (DA) is an affective approach to alleviate the data scarcity problem in conversational systems. This tutorial provides a comprehensive and up-to-date overview of DA approaches in the context of conversational systems. It highlights recent advances in open domain and task-oriented conversation generation, and different paradigms of evaluating these models. We also discuss current challenges and future directions in order to help researchers and practitioners to further advance the field in this area.
Generative retrieval (GR) has witnessed significant growth recently in the area of information retrieval. Compared to the traditional "index-retrieve-then-rank'' pipeline, the GR paradigm aims to consolidate all information within a corpus into a single model. Typically, a sequence-to-sequence model is trained to directly map a query to its relevant document identifiers (i.e., docids). This tutorial offers an introduction to the core concepts of the GR paradigm and a comprehensive overview of recent advances in its foundations and applications. We start by providing preliminary information covering foundational aspects and problem formulations of GR. Then, our focus shifts towards recent progress in docid design, training approaches, inference strategies, and applications of GR. We end by outlining challenges and issuing a call for future GR research.This tutorial is intended to be beneficial to both researchers and industry practitioners interested in developing novel GR solutions or applying them in real-world scenarios.
Web applications serve as vital interfaces for users to access information, perform various tasks, and engage with content. Traditional web designs have predominantly focused on user interfaces and static experiences. With the advent of large language models (LLMs), there's a paradigm shift as we integrate LLM-powered agents into these platforms. These agents bring forth crucial human capabilities like memory and planning to make them behave like humans in completing various tasks, effectively enhancing user engagement and offering tailored interactions in web applications. In this tutorial, we delve into the cutting-edge techniques of LLM-powered agents across various web applications, such as web mining, social networks, recommender systems, and conversational systems. We will also explore the prevailing challenges in seamlessly incorporating these agents and hint at prospective research avenues that can revolutionize the way we interact with web platforms.
In the dynamic landscape of online businesses, recommender systems are pivotal in enhancing user experiences. While traditional approaches have relied on static supervised learning, the quest for adaptive, user-centric recommendations has led to the emergence of the formulation of contextual bandits. This tutorial investigates the contextual bandits as a powerful framework for personalized recommendations. We delve into the challenges, advanced algorithms and theories, collaborative strategies, and open challenges and future prospects within this field. Different from existing related tutorials, (1) we focus on the exploration perspective of contextual bandits to alleviate the "Matthew Effect'' in the recommender systems, i.e., the rich get richer and the poor get poorer, concerning the popularity of items; (2) in addition to the conventional linear contextual bandits, we will also dedicated to neural contextual bandits which have emerged as an important branch in recent years, to investigate how neural networks benefit contextual bandits for personalized recommendation both empirically and theoretically; (3) we will cover the latest topic, collaborative neural contextual bandits, to incorporate both user heterogeneity and user correlations customized for recommender system; (4) we will provide and discuss the new emerging challenges and open questions for neural contextual bandits with applications in the personalized recommendation, especially for large neural models.
Compared with other greedy personalized recommendation approaches, Contextual Bandits techniques provide distinct ways of modeling user preferences. We believe this tutorial can benefit researchers and practitioners by appreciating the power of exploration and the performance guarantee brought by neural contextual bandits, as well as rethinking the challenges caused by the increasing complexity of neural models and the magnitude of data.
Social computing platforms typically deal with data that are either related to humans or generated by humans. Consequently, effective design of these platforms needs to be cognizant ofsocial psychology theories. In this tutorial, we review and summarize the research thus far into the paradigm ofpsychology theory-informed design of social computing platforms where the design is guided by theories from social psychology in addition to theories from computer science. Specifically, we review techniques and frameworks that embrace this paradigm in the arena of social influence. In addition, we suggest open problems and new research directions.
Users routinely interact with the Web via information access systems such as search engines and recommender systems. How to accurately evaluate such interactive systems with reproducible experiments is an important, yet difficult challenge. To address this challenge, user simulation has emerged as a promising solution. This half-day tutorial focuses on providing a thorough introduction to user simulation techniques designed specifically for evaluating information access systems on the Web. We systematically review major research progress, covering both general frameworks for designing user simulators, and specific models and algorithms for simulating user interactions with search engines, recommender systems, and conversational assistants. We also highlight some important future research directions.
Modern machine learning and AI have revolutionized the generation of ranking and recommendations across many domains, taking data-driven approaches to inferring the candidate items a user is most likely to select. The theory of discrete choice provides the theoretical underpinnings for study of these problems. This theory is central to economics and behavioral sciences, and was recognized with the 2000 Nobel Prize in economics awarded to Daniel McFadden for his work on the analysis of discrete choice. Classical work in this area, and a wide range of recent advances, have much to offer in thinking about how to support users in making choices. However, many of the central tools of discrete choice are not broadly known to the researchers in our area. In this proposed tutorial, we will cover the foundations of the field, and provide connections to common tools of machine learning such as logistic regression, multinomial regression, and softmax. We will cover a number of results about the representational power and complexity of learning various models of choice. And finally, we will suggest both open problems in the field, and areas where discrete choice tools are relevant to key problems in the Web Conference community.
In World Wide Web (WWW) systems, networks (or graphs) serve as a fundamental tool for representing, analyzing, and understanding linked data, providing significant insights into the underlying systems. Naturally, most real-world systems have inherent temporal information, e.g., interactions in social networks occur at specific moments in time and last for a certain period. Temporal networks, i.e., network data modeling temporal information, enable novel and fundamental discoveries about the underlying systems they model, otherwise not captured by static networks that ignore such temporal information.
In this tutorial, we present state-of-the-art models and algorithmic techniques for mining temporal networks that can provide precious insights into a plethora of web-related applications. We present how temporal networks can be used to extract novel information, especially in web-related network data, and highlight the challenges that arise when modeling temporal information compared to traditional static network-based approaches. We first overview different temporal network models. We then show how such powerful models can be leveraged to extract novel insights through suitable mining primitives. In particular, we present recent advances addressing most foundational problems for temporal network mining---ranging from the computation of temporal centrality measures, temporal motif counting, and temporal communities to bursty events and anomaly detection.
Emerging as fundamental building blocks for diverse artificial intelligence applications, foundation models have achieved notable success across natural language processing and many other domains. Concurrently, graph machine learning has gradually evolved from shallow methods to deep models to leverage the abundant graph-structured data that constitute an important pillar in the data ecosystem for artificial intelligence. Naturally, the emergence and homogenization capabilities of foundation models have piqued the interest of graph machine learning researchers. This has sparked discussions about developing a next-generation graph learning paradigm, one that is pre-trained on broad graph data and can be adapted to a wide range of downstream graph-based tasks. However, there is currently no clear definition or systematic analysis for this type of work.
In this tutorial, we will introduce the concept of graph foundation models (GFMs), and provide a comprehensive exposition on their key characteristics and underpinning technologies. Subsequently, we will thoroughly review existing works that lay the groundwork towards GFMs, which are summarized into three primary categories based on their roots in graph neural networks, large language models, or a hybrid of both. Beyond providing a comprehensive overview and in-depth analysis of the current landscape and progress towards graph foundation models, this tutorial will also explore potential avenues for future research in this important and dynamic field. Finally, to help the audience gain a systematic understanding of the topics covered in this tutorial, we present further details in our recent preprint paper, "Towards Graph Foundation Models: A Survey and Beyond"[4], available at https://arxiv.org/pdf/2310.11829.pdf.
Large language models (LLMs) have significantly influenced recommender systems. Both academia and industry have shown growing interest in developing LLMs for recommendation purposes, an approach commonly referred to as LLM4Rec. This involves efforts such as utilizing LLMs for generative item retrieval and ranking, along with the potential for creating universal LLMs for varied recommendation tasks, signaling a possible paradigm shift in recommender systems. This tutorial is designed to review the progression of LLM4Rec and provide an in-depth analysis of the prevailing studies. We will discuss how LLMs advance recommender systems in model architecture, learning paradigms, and capabilities like conversation, generalization, planning, and content generation. Additionally, the tutorial will highlight open problems and challenges in this nascent field, addressing concerns related to trustworthiness, efficiency, online training, and recommendation data modeling. Concluding with a summary of the takeaways from previous research, the tutorial will suggest avenues for future investigations. Our aim is to help the audience grasp the developments in LLM4Rec, as well as to spark inspiration for further research. By doing so, we expect to contribute to the growth and success of LLM4Rec, possibly leading to a fundamental change in recommender paradigms.
Personalized recommendation stands as a ubiquitous channel for users to explore information or items aligned with their interests. Nevertheless, prevailing recommendation models predominantly rely on unique IDs and categorical features for user-item matching. While this ID-centric approach has witnessed considerable success, it falls short in comprehensively grasping the essence of raw item contents across diverse modalities, such as text, image, audio, and video. This underutilization of multimodal data poses a limitation to recommender systems, particularly in the realm of multimedia services like news, music, and short-video platforms. The recent surge in pretraining and generation techniques presents both opportunities and challenges in the development of multimodal recommender systems. This tutorial seeks to provide a thorough exploration of the latest advancements and future trajectories in multimodal pretraining and generation techniques within the realm of recommender systems. The tutorial comprises four talks, addressing multimodal pretraining, multimodal fusion, multimodal generation, and presenting successful stories alongside open challenges in the field of recommendation. Our target audience encompasses scholars, practitioners, and other parties interested in this domain. By providing a succinct overview of the field, we aspire to facilitate a swift understanding of multimodal recommendation and foster meaningful discussions on the future development of this evolving landscape.
Internet memes, in their ubiquitous spread across the digital landscape, have transformed into a potent communicative force. Their significance beckons keen interest from researchers and practitioners alike, necessitating a deep comprehension of their nuanced forms and functions. Recent studies have honed in on diverse facets of memes, particularly in detecting offensive material and discerning sarcasm, yet comprehensive instructional resources remain sparse. Addressing this void, our tutorial delivers an integrated framework for dissecting the complex humor of memes. It weaves together disciplines such as natural language processing, computer vision, and multimodal modeling, empowering participants to decode meanings, analyze sentiments, and identify offensive content within memes. Attendees will engage in hands-on exercises and observe demonstrations, tapping into established datasets and cutting-edge algorithms. This equips them with the expertise to navigate the intricacies of meme analysis and to contribute substantively to this dynamic domain. For more information, please check out our tutorial teaser video.
Given the sheer volume of contemporary e-commerce applications, recommender systems (RSs) have gained significant attention in both academia and industry. However, traditional cloud-based RSs face inevitable challenges, such as resource-intensive computation, reliance on network access, and privacy breaches. In response, a new paradigm called on-device recommender systems (ODRSs) has emerged recently in various industries like Taobao, Google, and Kuaishou. ODRSs unleash the computational capacity of user devices with lightweight recommendation models tailored for resource-constrained environments, enabling real-time inference with users' local data. This tutorial aims to systematically introduce methodologies of ODRSs, including (1) an overview of existing research on ODRSs; (2) a comprehensive taxonomy of ODRSs, where the core technical content to be covered span across three major ODRS research directions, including on-device deployment and inference, on-device training, and privacy/security of ODRSs; (3) limitations and future directions of ODRSs. This tutorial expects to lay the foundation and spark new insights for follow-up research and applications concerning this new recommendation paradigm.
Graph neural networks (GNNs) have emerged as fundamental methods for handling structured graph data in various domains, including citation networks, molecule prediction, and recommender systems. They enable the learning of informative node or graph representations, which are crucial for tasks such as link prediction and node classification in the context of graphs. To achieve high-quality graph representation learning, certain essential factors come into play: clean labels, accurate graph structures, and sufficient initial node features. However, real-world graph data often suffer from noise and sparse labels, while different datasets have unique feature constructions. These factors significantly impact the generalization capabilities of graph neural networks, particularly when faced with unseen tasks. Recently, due to the efficent text processing and task generalization capability of large language models (LLMs), there has been a promising approach to address the challenges mentioned above by combining large language models with graph data.
This tutorial offers an overview of incorporating large language models into the graph domain, accompanied by practical examples. The methods are categorized into three dimensions: utilizing LLMs as augmenters, predictors, and agents for graph learning tasks. We will delve into the current progress and future directions within this field. By introducing this emerging topic, our aim is to enhance the audience's understanding of LLM-based graph learning techniques, foster idea exchange, and encourage discussions that drive continuous advancements in this domain.
Privacy in general, and differential privacy (DP) in particular, have become important topics in data mining and machine learning. Digital advertising is a critical component of the internet and is powered by large-scale data analytics and machine learning models; privacy concerns around these are on the rise. Despite the central importance of private ad analytics and training privacy-preserving ad prediction models, there has been relatively little exposure of this subject to the broader Web community. In the past three years, the interest in privacy and the interest in online advertising have been steadily growing. The aim of this tutorial is to provide researchers with an introduction to the problems that arise in private analytics and modeling in advertising, survey recent results, and describe the main research challenges in the space.
This tutorial will delve into the fascinating realm of simulating human society using Large Language Model (LLM)-driven agents, exploring their applications in cities, social media, and economic systems. Through this tutorial, participants will gain insights into the integration of LLMs into human society simulation, providing a comprehensive understanding of how these models can accurately represent human interactions, decision-making processes, and societal dynamics from cities to social media and to economic systems. The tutorial will introduce the essential background, discuss the motivation and challenges, and elaborate on the recent advances.
Knowledge graph reasoning plays an important role in data mining, AI, Web, and social science. These knowledge graphs serve as intuitive repositories of human knowledge, allowing for the inference of new information. However, traditional symbolic reasoning, while powerful in its own right, faces challenges posed by incomplete and noisy data in the knowledge graphs. In contrast, recent years have witnessed the emergence of Neural Symbolic AI, an exciting development that fuses the capabilities of deep learning and symbolic reasoning. It aims to create AI systems that are not only highly interpretable and explainable but also incredibly versatile, effectively bridging the gap between symbolic and neural approaches. Furthermore, with the advent of large language models, the integration of LLMs with knowledge graph reasoning has emerged as a prominent frontier, offering the potential to unlock unprecedented capabilities. This tutorial aims to comprehensively review different aspects of knowledge graph reasoning applications and also introduce the recent advances about Neural Symbolic reasoning and combining knowledge graph reasoning with large language models. It is intended to benefit researchers and practitioners in the fields of data mining, AI, Web, and social science.
Text documents are usually connected in a graph structure, resulting in an important class of data named text-attributed graph, e.g., paper citation graph and Web page hyperlink graph. On the one hand, Graph Neural Networks (GNNs) consider text in each document as general vertex attribute and do not specifically deal with text data. On the other hand, Pre-trained Language Models (PLMs) and Topic Models (TMs) learn effective document embeddings. However, most models focus on text content in each single document only, ignoring link adjacency across documents. The above two challenges motivate the development of text-attributed graph representation learning, combining GNNs with PLMs and TMs into a unified model and learning document embeddings preserving both modalities, which fulfill applications, e.g., text classification, citation recommendation, question answering, etc.
In this lecture-style tutorial, we will provide a systematic review of text-attributed graph, including its formal definition, recent methods, diverse applications, and challenges. Specifically, i) we will formally define text-attributed graph and briefly review GNNs, PLMs, and TMs, which are the fundamentals of some existing methods. ii) We will then revisit the technical details of text-attributed graph models, which are generally split into two categories, PLM-based and TM-based. iii) Besides, we will show diverse applications built on text-attributed graph. iv) Finally, we will discuss some challenges of existing models and propose solutions for future research.
The pervasive abuse of misinformation to influence public opinion on social media has become increasingly evident in various domains, encompassing politics, as seen in presidential elections, and healthcare, most notably during the recent COVID-19 pandemic. This threat has grown in severity as the development of Large Language Models (LLMs) empowers manipulators to generate highly convincing deceptive content with greater efficiency. Furthermore, the recent strides in chatbots integrated with LLMs, such as ChatGPT, have enabled the creation of human-like interactive social bots, posing a significant challenge to both human users and the social-bot-detection systems of social media platforms.These challenges motivate researchers to develop algorithms to mitigate misinformation and social media manipulations. This tutorial introduces the advanced machine learning researches that are helpful for this goal, including (1) detection of social manipulators, (2) learning causal models of misinformation and social manipulation, and (3) LLM-generated misinformation detection. In addition, we also present possible future directions.
This tutorial focuses on curriculum learning (CL), an important topic in machine learning, which gains an increasing amount of attention in the research community. CL is a learning paradigm that enables machines to learn from easy data to hard data, imitating the meaningful procedure of human learning with curricula. As an easy-to-use plug-in, CL has demonstrated its power in improving the generalization capacity and convergence rate of various models in a wide range of scenarios such as computer vision, natural language processing, data mining, reinforcement learning, etc. Therefore, it is essential introducing CL to more scholars and researchers in the machine learning community. However, there have been no tutorials on CL so far, motivating the organization of our tutorial on CL at WWW 2024. To give a comprehensive tutorial on CL, we plan to organize it from the following aspects: (1) theories, (2) approaches, (3) applications, (4) tools and (5) future directions. First, we introduce the motivations, theories and insights behind CL. Second, we advocate novel, high-quality approaches, as well as innovative solutions to the challenging problems in CL. Then we present the applications of CL in various scenarios, followed by some relevant tools. In the end, we discuss open questions and the future direction in the era of large language models. We believe this topic is at the core of the scope of WWW and is attractive to the audience interested in machine learning from both academia and industry.
The recent achievements and availability of Large Language Models have paved the road to a new range of applications and use-cases. Pre-trained language models are now being involved at-scale in many fields where they were until now absent from. More specifically, the progress made by causal generative models has open the door to using them through textual instructions aka. prompts. Unfortunately, the performances of these prompts are highly dependent on the exact phrasing used and therefore practitioners need to adopt fail-retry strategies. This first international workshop on prompt engineering aims at gathering practitioners (both from Academia and Industry) to exchange about good practices, optimizations, results and novel paradigms about the design of efficient prompts to make use of LLMs.
In the private residential sales market, obtaining orders for exterior design requires a proposal considering the constraints of the location and customer. The design summary is a proposal document that describes concepts at the earliest stage of exterior design. It not only appeals to the customer, but also is referenced in the design drawing. However, the quality varies depending on the skill of the creator because the construction of a design summary requires the knowledge of human experts. This paper aims to generate the design summary using generative AI. Firstly, we analyze the characteristics of the design summary to identify the essential elements. Then, we propose a sequence of prompts to generate the design summary. Finally, we conducted a comparative evaluation between design summaries created by experts and those generated by generative AI.
Emotion classification in text is a challenging task due to the processes involved when interpreting a textual description of a potential emotion stimulus. In addition, the set of emotion categories is highly domain-specific. For instance, literature analysis might require the use of aesthetic emotions (e.g., finding something beautiful), and social media analysis could benefit from fine-grained sets (e.g., separating anger from annoyance) than only those that represent basic categories as they have been proposed by Paul Ekman (anger, disgust, fear, joy, surprise, sadness). This renders the task an interesting field for zero-shot classifications, in which the label set is not known at model development time. Unfortunately, most resources for emotion analysis are English, and therefore, most studies on emotion analysis have been performed in English, including those that involve prompting language models for text labels. This leaves us with a research gap that we address in this paper: In which language should we prompt for emotion labels on non-English texts? This is particularly of interest when we have access to a multilingual large language model, because we could request labels with English prompts even for non-English data. Our experiments with natural language inference-based language models show that it is consistently better to use English prompts even if the data is in a different language.
Smishing, which refers to social engineering attacks delivered through mobile devices such as smartphones, poses significant threats, yet limited data hinder the development of effective countermeasures. To tackle this, we propose a novel prompt engineering method for data augmentation in smishing detection. Distinguished by its utilization of insights from social science on smishing mechanisms, our approach offers a promising avenue for improving machine learning models in combating smishing attacks.
Prompt Engineering has emerged as a pivotal technique in Natural Language Processing, providing a flexible approach for leveraging pre-trained language models. Particularly, a prompt is used to instruct the model to adopt the nature of given prompts, which became a well-adoptable approach in wide areas of domains. Yet, existing prompt-guided frameworks are experiencing various challenges, such as crafting prompts for specific tasks to achieve clarity and conciseness and avoid ambiguity, which requires time and computational resources. Further existing methods heavily rely on the extensive labelled datasets, yet many domain-specific challenges exist, particularly in healthcare. This study presents Prompt-Eng, a novel framework emphasizing its wide-ranging applications in healthcare, where we design precise prompts with positive and negative aspects; we hypothesize that designing prompts in pairs helps models to generalize effectively. We delve into the significance of quick design and optimization, highlighting its influence in shaping model responses. In addition, we explore the increasing demand for prompts that are aware of the context in multimodal data analysis and the incorporation of prompt engineering in new machine-learning approaches. The essence of our approach is in creating tailored prompts, which serve as instructive guidelines for the models during the prediction procedure. The proposed methodology emphasizes utilizing context-aware prompt pairs to facilitate interpreting and extracting healthcare information from a health corpus by models. The study uses the medical MIMIC-III \footnotehttps://physionet.org/content/mimiciii/1.4/ corpus to predict medicine prescriptions. The paper also explores visual and textual prompts for X-ray image analysis for pneumonia prediction on the MIMIC-CXR \footnote\urlhttps://physionet.org/content/mimic-cxr/2.0.0/ dataset. This approach stands out from existing methods by addressing challenges such as clarity, conciseness, and context awareness, thereby enabling improved interpretation and extraction of healthcare information from diverse data sources.
When interacting with Retrieval-Augmented Generation (RAG)-based conversational agents, the users must carefully craft their queries to be understood correctly. Yet, understanding the system's capabilities can be challenging for the users, leading to ambiguous questions that necessitate further clarification. This work aims to bridge the gap by developing a suggestion question generator. To generate suggestion questions, our approach involves utilizing dynamic context, which includes both dynamic few-shot examples and dynamically retrieved contexts. Through experiments, we show that the dynamic contexts approach can generate better suggestion questions as compared to other prompting approaches.
This is the 10th edition of the workshop series labeled "AW4City - Web Applications and Smart Cities", which started back in Florence in 2015 and kept on taking place every year in conjunction with the WWW conference series. Last year the workshop was held in Austin, Texas. The workshop series aims to investigate the Web and Web applications' role in establishing smart city (SC) promises. The workshop series aim to investigate the role of the Web and of Web applications in SC growth. This year, the workshop focuses on the new era of the Web and web intelligence in cities and communities. In the era of digital twinning and metaverse (so-called citiverse for cities), and under the UN 2030 Agenda for sustainable growth, cities are being transformed into virtual spaces that generate new types of value and new experiences to their citizens and enterprises that can enhance living and offer new opportunities for economic growth. Moreover, AI and web intelligence generate new types of automated transactions in these virtual spaces, while they can utilize data spaces and standardization for optimal data flow. This workshop aims to demonstrate how the Web transforms cities into new virtual environments.
In the transformative landscape of smart cities, the integration of the cutting-edge web technologies into time series forecasting presents a pivotal opportunity to enhance urban planning, sustainability, and economic growth. The advancement of deep neural networks has significantly improved forecasting performance. However, a notable challenge lies in the ability of these models to generalize well to out-of-distribution (OOD) time series data. The inherent spatial heterogeneity and domain shifts across urban environments create hurdles that prevent models from adapting and performing effectively in new urban environments. To tackle this problem, we propose a solution to derive invariant representations for more robust predictions under different urban environments instead of relying on spurious correlation across urban environments for better generalizability. Through extensive experiments on both synthetic and real-world data, we demonstrate that our proposed method outperforms traditional time series forecasting models when tackling domain shifts in changing urban environments. The effectiveness and robustness of our method can be extended to diverse fields including climate modeling, urban planning, and smart city resource management.
With the content evolution on the web and the Internet, there is a need for cyberspace that can be used to work, live, and play in digital worlds regardless of geography. The Metaverse provides the possibility of the future Internet, representing a future trend for the web and the Internet. In the future, the Metaverse is a dataspace where the real and the virtual are combined instead of a virtual space. In this paper, we have a comprehensive survey of the compelling Metaverse, including issues of the Metaverse and Metaverse's evolution and future. We hope this survey can provide some helpful prospects and insightful directions for the Metaverse.
This paper examines Smart Cities transition with the focus on urban mobility. We demonstrate how a multi-stakeholder, multi-disciplinary approach can support integration of systems, data, people, and organisations with a case study on new integrated mobility solutions and urban decarbonisation.
Context based data receive an increasing attention for Smart City (SC) flow homogenization and standardization. SC hubness can be a solution that can simplify these flows and transform them to standardized message exchanges, which can be easily retrieved. Several use cases can justify the SC hubness' potential and will be summarized in this paper. However, the aim of this article is to analyze and explain their potential in the era of metaverse in cities, the so-called "citiverse". The role of data in citiverse gains an additional importance since it is a data-oriented ecosystem, but, it has to be seen around the new capabilities and values that citiverse may bring.
The TempWeb workshop series is an established co-located event at The Web Conference that aims at bringing together researchers and practitioners across various domains. Naturally, submissions address core fields of studies in computer science and aspects such as temporal IE/IR, Web mining, Web archiving and large scale data analysis to just name a few. Aiming at the investigation of infrastructures, scalable methods, and innovative software for aggregating, querying, and analyzing heterogeneous data at Web scale, TempWeb has developed as a forum for a community from science and industry. However, TempWeb's sweet spot is its close link to application domains such as the social sciences, marketing, economics, etc. Thus, the workshop continuously attracts not only technical innovations, but also studies on the societal implications of digital media usage along the temporal dimension. To this end, TempWeb can be considered a melting pot of interdisciplinary research. Frequent contributors and emerging collaborations show the advantages of collaborations triggered by the workshop participation. In its 2024 edition, topics of TempWeb again cover a wide spectrum of topics on Web related research ranging from studies of control control, via terminological studies, up to community analysis and content recommendation.
Behavioral web data such as social web activity streams, query logs or behavioral traces from web search and navigation are crucial to understand the temporal evolution of the web and the human interactions that produce web data and models trained on such data. Thus, behavioral web data empowers research in various fields, such as (temporal) information retrieval, computational social science or cognitive and behavioral modeling of users over time. On the other hand, archiving and using such kind of data is associated with a number of technical and non-technical challenges, e.g. legal and ethical concerns, fast decay of data over time, as well as dependency on 3rd party gatekeepers, such as Twitter/X or Google. We present case studies from many years of research into archiving and temporal analysis of behavioral web data, including large-scale social web archives such as TweetsKB, based on an archive of 14 bn tweets harvested continuously since 2013, and web search and navigation behavior tracked through user studies, longitudinal panels or crowdsourcing-based quasi experiments.
In this talk, we will explore challenges and opportunities of spatio-temporal information access as connecting temporal and spatial dimensions of mining and analysis. We focus on use cases and examples in the development of systems and services in smart sustainable cities, and in urban energy and climate transitions.
This paper presents a case study regarding a comparative examination of the Louvain and Leiden community detection algorithms. The case study was conducted on a real-world communication network consisting of 3,222,623 nodes and 27,423,553 edges. In particular, the network in our case study models the communication between Twitter users during the initial four weeks of the 2022 war in Ukraine. In addition, we also applied dynamic topic modeling in order to examine differences in the detected communities.
As generative AI continues to evolve, it becomes increasingly important for site owners to effectively communicate their conditions and preferences to web agents to maintain data sovereignty. This necessity underscores the importance of an ecosystem where the technical means to prevent unauthorized data mining and to set conditions on the usage of web resources are readily available. Our research focuses on the temporal development of such technical content control methods, examining two primary mechanisms: the regulation of web robots via the Robots Exclusion Protocol and the semantic annotation of web documents with licensing information. Through a longitudinal study, we analyze the implementation and recent modifications of robots.txt files, robot directives (such as noindex, nofollow, etc.), and license-related HTML annotations. This study is driven by the growing awareness among site owners regarding the control over their content in the face of the progression of AI, highlighting the critical need for effective web content control strategies to protect and appropriately manage the wealth of texts, images, videos, and other content populating the internet.
In modern NLP applications, word embeddings are a crucial backbone that can be readily shared across a number of tasks. However, as the text distributions change and word semantics evolve over time, the downstream applications using the embeddings can suffer if the word representations do not conform to the data drift. Thus, maintaining word embeddings to be consistent with the underlying data distribution is a key problem. In this work, we tackle this problem and propose TransDrift\footnoteCodebase: https://github.com/data-iitd/transdrift , a transformer-based prediction model for word embeddings. Leveraging the flexibility of transformer, our model accurately learns the dynamics of the embedding drift and predicts the future embedding. In experiments, we compare with existing methods and show that our model makes significantly more accurate predictions of the word embedding than the baselines. Crucially, by applying the predicted embeddings as a backbone for downstream classification tasks, we show that our embeddings lead to superior performance compared to the previous methods.
Temporal question answering (QA) involves explicit (e.g., "...before 2024") or implicit (e.g., "...during the Cold War period") time constraints. Implicit constraints are more challenging; yet benchmarks for temporal QA largely disregard such questions. This shortcoming spans three aspects. First, implicit questions are scarce in existing benchmarks. Second, questions are created based on hand-crafted rules, thus lacking diversity in formulations. Third, the source for answering is either a KB or a text corpus, disregarding cues from multiple sources. We propose a benchmark, called TIQ (Temporal Implicit Questions), based on novel techniques for constructing questions with implicit time constraints. First, questions are created automatically, with systematic control of topical diversity, timeframe, head vs. tail entities, etc. Second, questions are formulated using diverse snippets and further paraphrasing by a large language model. Third, snippets for answering come from a variety of sources including KB, text, and infoboxes. The TIQ benchmark contains 10,000 questions with ground-truth answers and underlying snippets as supporting evidence.
Contemporary sequential recommendation systems predominantly leverage statistical correlations derived from user interaction histories to predict future preferences. However, these correlations often mask implicit challenges. On the one hand, user data is frequently plagued by implicit, noisy feedback, misdirecting users towards items that fail to align with their actual interests, which is magnified in sequential recommendation contexts. On the other hand, prevalent methods tend to over-rely on similarity-based attention mechanisms across item pairs, which are prone to utilizing heuristic shortcuts, thereby leading to suboptimal recommendation.
To tackle these issues, we put forward a causality-driven user modeling approach for sequential recommendation, which pivots towards a causal perspective. Specifically, we involves the application of a causal graph to identify confounding factors that give rise to spurious correlations and to isolate conceptual variables that causally encapsulate user preferences. By learning the representation of these disentangled causal variables at the conceptual level, we can distinguish between causal and non-causal associations while preserving the inherent sequential nature of user behaviors. This enables us to ascertain which elements are critical and which may induce unintended biases. The framework of our method can be compatible with various mainstream sequential models, which offers a robust foundation for reconstructing more accurate and meaningful user and item representations driven by causality.
Online advertising contributes a considerable part of the tech sector's revenue, and has been remarkably influencing the public agenda. With evolving developments, AI is playing an increasingly significant role in online advertising. We propose to create a forum for researchers, developers, users, ventures, policymakers, and other stakeholders to exchange ideas, research, innovations, etc. with emphasis on (1) AI driven mechanism design for distributing advertisements, (2) generative AI for creating content in advertisements, such as the promotion images/videos, and (3) ethics issues, especially in political advertisements, such as user privacy, fairness, hating speech, misinformation, etc. Relevant but not mentioned areas are also much encouraged. We plan to organize a half-day workshop.
In this work, we demonstrated the use of Partial Convolutions for RGBD image inpainting. We proposed the two models L-PConv and Attn-PConv. The baseline partial convolution model is outperformed by both of our proposed models, with the Attn-PConv model performing the best. The proposed Attn-PConv model is able to infill missing pixels with relatively less training time when compared to some other GAN-based models. As far as we know this is the first time a partial convolution model has been used successfully for RGBD image inpainting. The results( Image SSIM: 0.9787, Image PSNR: 30.9665, Depth SSIM: 0.9818, Depth PSNR: 35.7311) indicate that our model is successful in RGBD image inpainting. The addition of the additional loss terms and the Attentive Normalization techniques help improve the performance of the model significantly. We believe our model can be successfully used in AR-related applications where infilling missing pixels is performed frequently especially for both RGB and Depth images together. Beyond the scope of this study, we envision practical applications for our model in augmented reality, particularly in scenarios where frequent pixel infilling is required for both RGB and Depth images. In the future our research trajectory aims to incorporate higher resolution, aligning with the capabilities of modern cameras capable of capturing images at 4k resolution.
Expressing opinions and interacting with others on the Web has led to the production of an abundance of online discourse data, such as claims and viewpoints on controversial topics, their sources and contexts (events, entities). This data constitutes a valuable source of insights for studies into misinformation spread, bias reinforcement, echo chambers or political agenda setting. Computational methods, mostly from the field of NLP, have emerged that tackle a wide range of tasks in this context, including argument and opinion mining, claim detection, checkworthiness detection, stance detection or fact verification. However, computational models require robust definitions of classes and concepts under investigation. Thus, these computational tasks require a strong interdisciplinary and epistemological foundation, specifically with respect to the underlying definitions of key concepts such as claims, arguments, stances, check-worthiness or veracity. This requires a highly interdisciplinary approach combining expertise from fields such as communication studies, computational linguistics and computer science. As opposed to facts, claims are inherently more complex. Their interpretation strongly depends on the context and a variety of intentional or unintended meanings, where terminology and conceptual understandings strongly diverge across communities. From a computational perspective, in order to address this complexity, the synergy of multiple approaches, coming both from symbolic (knowledge representation) and statistical AI seem to be promising to tackle such challenges. This workshop aims at strengthening the relations between these communities, providing a forum for shared works on the modeling, extraction and analysis of discourse on the Web. It will address the need for a shared understanding and structured knowledge about discourse data in order to enable machine-interpretation, discoverability and reuse, in support of scientific or journalistic studies into the analysis of societal debates on the Web. Beyond research into information and knowledge extraction, data consolidation and modeling for knowledge graphs building, the workshop targets communities focusing on the analysis of online discourse, relying on methods from machine learning, natural language processing, large language models and Web data mining.
This study investigates engagement patterns related to OpenAI's ChatGPT on Japanese Twitter, focusing on two distinct user groups - early and late engagers, inspired by the Innovation Theory. Early engagers are defined as individuals who initiated conversations about ChatGPT during its early stages, whereas late engagers are those who began participating at a later date. To examine the nature of the conversations, we conduct a dual methodology, encompassing both quantitative and qualitative analyses. The quantitative analysis reveals that early engagers often engage with more forward-looking and speculative topics, emphasizing the technological advancements and potential transformative impact of ChatGPT. Conversely, the late engagers intereact more with contemporary topics, focusing on the optimization of existing AI capabilities and considering their inherent limitations. Through our qualitative analysis, we propose a method to measure the proportion of shared or unique viewpoints within topics across both groups. We found that early engagers generally concentrate on a more limited range of perspectives, whereas late engagers exhibit a wider range of viewpoints. Interestingly, a weak correlation was found between the volume of tweets and the diversity of discussed topics in both groups. These findings underscore the importance of identifying semantic diversity, rather than relying solely on the volume of tweets, for understanding differences in communication styles between groups within a given topic. Moreover, our versatile dual methodology holds potential for broader applications, such as studying online discourse patterns within different user groups, or in contexts beyond ChatGPT.
While the use of machine learning for the detection of propaganda techniques in text has garnered considerable attention, most approaches focus on "black-box'' solutions with opaque inner workings. Interpretable approaches provide a solution, however, they depend on careful feature engineering and costly expert annotated data. Additionally, language features specific to propagandistic text are generally the focus of rhetoricians or linguists, and there is no data set labeled with such features suitable for machine learning. This study codifies 22 rhetorical and linguistic features identified in literature related to the language of persuasion for the purpose of annotating an existing data set labeled with propaganda techniques. To help human experts annotate natural language sentences with these features, RhetAnn, a web application, was specifically designed to minimize an otherwise considerable mental effort. Finally, a small set of annotated data was used to fine-tune GPT-3.5, a generative large language model (LLM), to annotate the remaining data while optimizing for financial cost and classification accuracy. This study demonstrates how combining a small number of human annotated examples with GPT can be an effective strategy for scaling the annotation process at a fraction of the cost of traditional annotation relying solely on human experts. The results are on par with the best performing model at the time of writing, namely GPT-4, at 10x less the cost. Our contribution is a set of features, their properties, definitions, and examples in a machine-readable format, along with the code for RhetAnn and the GPT prompts and fine-tuning procedures for advancing state-of-the-art interpretable propaganda technique detection.
In today's digital era, the rapid spread of misinformation poses threats to public well-being and societal trust. As online misinformation proliferates, manual verification by fact checkers becomes increasingly challenging. We introduce FACT-GPT (Fact-checking Augmentation with Claim matching Task-oriented Generative Pre-trained Transformer), a framework designed to automate the claim matching phase of fact-checking using Large Language Models (LLMs). This framework identifies new social media content that either supports or contradicts claims previously debunked by fact-checkers. Our approach employs LLMs to generate a labeled dataset consisting of simulated social media posts. This data set serves as a training ground for fine-tuning more specialized LLMs. We evaluated FACT-GPT on an extensive dataset of social media content related to public health. The results indicate that our fine-tuned LLMs rival the performance of larger pre-trained LLMs in claim matching tasks, aligning closely with human annotations. This study achieves three key milestones: it provides an automated framework for enhanced fact-checking; demonstrates the potential of LLMs to complement human expertise; offers public resources, including datasets and codes, to further research and applications in the fact-checking domain.
This paper introduces a Bayesian framework designed to measure the degree of association between categorical random variables. The method is grounded in the formal definition of variable independence and is implemented using Markov Chain Monte Carlo (MCMC) techniques. Unlike commonly employed techniques in Association Rule Learning, this approach enables a clear and precise estimation of confidence intervals and the statistical significance of the measured degree of association. We applied the method to non-exclusive emotions identified by annotators in 4,613 tweets written in Portuguese. This analysis revealed pairs of emotions that exhibit associations and mutually opposed pairs. Moreover, the method identifies hierarchical relations between categories, a feature observed in our data, and is utilized to cluster emotions into basic-level groups.
Social media influence campaigns pose significant challenges to public discourse and democracy. Traditional detection methods fall short due to the complexity and dynamic nature of social media. Addressing this, we propose a novel detection method using Large Language Models (LLMs) that incorporates both user metadata and network structures. By converting these elements into a text format, our approach effectively processes multilingual content and adapts to the shifting tactics of malicious campaign actors. We validate our model through rigorous testing on multiple datasets, showcasing its superior performance in identifying influence efforts. This research not only offers a powerful tool for detecting campaigns, but also sets the stage for future enhancements to keep up with the fast-paced evolution of social media-based influence tactics.
YouTube's recommendation system is integral to shaping user experiences by suggesting content based on past interactions using collaborative filtering techniques. Nonetheless, concerns about potential biases and homogeneity in these recommendations are prevalent, with the danger of leading users into filter bubbles and echo chambers that reinforce their pre-existing beliefs. Researchers have sought to understand and address these biases in recommendation systems. However, traditionally, such research has relied primarily on metadata, such as video titles, which does not always encapsulate the full content or context of the videos. This reliance on metadata can overlook the nuances and substantive content of videos, potentially perpetuating the very biases and echo chambers that the research aims to unravel. This study advances the examination of sentiment, toxicity, and emotion within YouTube content by conducting a comparative analysis across various depths of titles and narratives extracted by leveraging GPT-4. Our analysis reveals a clear trend in sentiment, emotion, and toxicity levels as the depth of content analysis increases. Notably, there is a general shift from neutral to positive sentiments in both YouTube video titles and narratives. Emotion analysis indicates an increase in positive emotions, particularly joy, with a corresponding decrease in negative emotions such as anger and disgust in narratives, while video titles show a steady decrease in anger. Additionally, toxicity analysis presents a contrasting pattern, with video titles displaying an upward trend in toxicity, peaking at the greatest depth analyzed, whereas narratives exhibit a high initial toxicity level that sharply decreases and stabilizes at lower depths. These findings suggest that the depth of engagement with video content significantly influences emotional and sentiment expressions.
Fact-check consumers can have different preferences regarding the amount of text being used for explaining the claim veracity verdict. Dynamically adapting the size of a fact-check report is thus an important functionality for systems designed to convey claim verification explainability. Recent works have experimented with applying transformers-based or LLM-based text summarization methods in a zero-shot or few-shot manner, making use of some existing texts available in the summary parts of fact-check reports (e.g., called "justification'' in PolitiFact). However, for complex fact-checks, the purely sub-symbolic summarizers tend to either omit some elements of the fact-checker's argumentation chains or include contextual statements that may not be essential at the given level of granularity. In this paper, we propose a new method for enhancing fact-check summarization with the aim of injecting elements of structured fact-checker argumentation. This argumentation is, in turn, not only captured at the discourse level but tied to an entity graph representing the fact-check, for which we employ the PURO diagrammatic language. We have empirically performed a manual analysis of fact-check reports from two fact-checker websites, yielding (1) textual snippets containing the argumentation essence of the fact-check report and (2) categorized argumentation elements tied to entity graphs. These snippets are then fed to a state-of-the-art hybrid summarizer which has previously produced accurate fact-check summaries, as an additional input. We observe mild improvements on various ROUGE metrics, even if the validity of the results is limited given the small size of the dataset. We also compare the human-provided argumentation element categories with those returned, for the given fact-check ground truth summary, using a pre-trained language model upon both basic and augmented prompting. This yields a moderate accuracy as the model often fails to comply with the explicit given instructions.
The emergence of Data-centric AI (DCAI) represents a pivotal shift in AI development, redirecting focus from model refinement to prioritizing data quality. This paradigmatic transition emphasizes the critical role of data in AI. While past approaches centered on refining models, they often overlooked potential data imperfections, raising questions about the true potential of enhanced model performance. DCAI advocates the systematic engineering of data, complementing existing efforts and playing a vital role in driving AI success. This transition has spurred innovation in various machine learning and data mining algorithms and their applications on the Web. Therefore, we propose the DCAI Workshop at WWW'24, which offers a platform for academic researchers and industry practitioners to showcase the latest advancements in DCAI research and their practical applications in the real world.
Over the past decades, text classification underwent remarkable evolution across diverse domains. Despite these advancements, most existing model-centric methods in text classification cannot generalize well on class-imbalanced datasets that contain high-similarity textual information. Instead of developing new model architectures, data-centric approaches enhance the performance by manipulating the data structure. In this study, we aim to investigate robust data-centric approaches that can help text classification in our collected dataset, the metadata of survey papers about Large Language Models (LLMs). In the experiments, we explore four paradigms and observe that leveraging arXiv's co-category information on graphs can help robustly classify the text data over the other three paradigms, conventional machine-learning algorithms, pre-trained language models' fine-tuning, and zero-shot / few-shot classifications using LLMs.
Time series forecasting holds significant value in various application scenarios. However, existing forecasting methods primarily focus on optimizing model architecture while neglecting the substantial impact of data quality on model learning. In this study, we aim to enhance model performance by optimizing data utilization based on data quality and propose a Data Quality-based Gradient Optimization (DQGO) method to facilitate training of recurrent neural networks. Firstly, we define sample quality as the matching degree between samples and model, and suggest using the attention entropy to calculate the sample quality through an attention mechanism. Secondly, we optimize the model's gradient vector by giving different weights to samples with different quality. Through experiments conducted on six datasets, the results demonstrate that DQGO significantly improves LSTM's performance. In certain cases, it even surpasses the state-of-the-art models.
Graph summarization as a preprocessing step is an effective and complementary technique for scalable graph neural network (GNN) training. In this work, we propose the Coarsening Via Convolution Matching (ConvMatch) algorithm and a highly scalable variant, A-ConvMatch, for creating summarized graphs that preserve the output of graph convolution. We evaluate ConvMatch on six real-world link prediction and node classification graph datasets, and show it is efficient and preserves prediction performance while significantly reducing the graph size. Notably, ConvMatch achieves up to 95% of the prediction performance of GNNs on node classification while trained on graphs summarized down to 1% the size of the original graph. Furthermore, on link prediction tasks, ConvMatch consistently outperforms all baselines, achieving up to a 2X improvement.
Document-level event extraction faces numerous challenges in accurately modeling real-world financial scenarios, particularly due to the inadequacies in existing datasets regarding data scale and fine-grained annotations. The development of datasets is a crucial factor in driving research progress; therefore, we present a high-quality Chinese document-level event extraction dataset, CFinDEE. This dataset, grounded in real-world financial news, defines 22 event types and 116 argument roles, annotating 26,483 events and 107,096 event arguments. CFinDEE aims to address these shortcomings by providing more comprehensive annotations and data augmentation, offering richer resources for document-level event extraction in the financial domain. CFinDEE extends data both horizontally and vertically, where horizontal expansion enriches the types of financial events, enhancing the diversity of the dataset; vertical expansion, by increasing the scale of the data, effectively boosts the practical value of the dataset. Experiments conducted on multiple advanced models have validated the high applicability and effectiveness of the CFinDEE dataset for document-level event extraction tasks in the financial field.
Computing related entities for a given seed entity is an important task in exploratory search and comparative data analysis.Prior works, using the seed-based set expansion paradigm, have focused on the single aspect of identifying homogeneous sets with high pairwise relatedness. A few recent works discuss cluster-based approaches to tackle multi-faceted set expansion, however, they fail in harnessing the specificity of the clusters and generating an explanation for them. This paper poses the multi-faceted set expansion as an optimization problem, where the goal is to compute multiple groups of entities that convey different aspects in an explainable manner, with high similarity within each group and diversity across groups. To extend a seed entity, we collect a large pool of candidate entities and facets (e.g., categories)from Wikipedia and knowledge bases, and construct a candidate graph. We propose FASETS, an efficient algorithm for computing faceted groups of bounded size, based on random walks over the candidate graph. Our extensive evaluation shows the superiority of FASETS against prior baselines, with regard to ground-truth collected from crowdsourcing.
The rapid evolution of text-to-image diffusion models has opened the door of generative AI, enabling the translation of textual descriptions into visually compelling images with remarkable quality. However, a persistent challenge within this domain is the optimization of prompts to effectively convey abstract concepts into concrete objects. For example, text encoders can hardly express "peace", while can easily illustrate olive branches and white doves. This paper introduces a novel approach named Prompt Optimizer for Abstract Concepts (POAC) specifically designed to enhance the performance of text-to-image diffusion models in interpreting and generating images from abstract concepts. We propose a Prompt Language Model (PLM), which is initialized from a pre-trained language model, and then fine-tuned with a curated dataset of abstract concept prompts. The dataset is created with GPT-4 to extend the abstract concept to a scene and concrete objects. Our framework employs a Reinforcement Learning (RL)-based optimization strategy, focusing on the alignment between the generated images by a stable diffusion model and optimized prompts. Through extensive experiments, we demonstrate that our proposed POAC significantly improves the accuracy and aesthetic quality of generated images, particularly in description of abstract concepts and alignment with optimized prompts. We also present a comprehensive analysis of our model's performance across diffusion models under different settings, showcasing its versatility and effectiveness in enhancing abstract concept representation.
With the widespread adoption of deep learning-based models in practical applications, concerns about their fairness have become increasingly prominent. Existing research indicates that both the model itself and the datasets on which they are trained can contribute to unfair decisions. In this paper, we address the data-related aspect of the problem, aiming to enhance the data to guide the model towards greater trustworthiness. Due to their uncontrolled curation and limited understanding of fairness drivers, real-world datasets pose challenges in eliminating unfairness. Recent findings highlight the potential of Foundation Models in generating substantial datasets. We leverage these foundation models in conjunction with state-of-the-art explainability and fairness platforms to generate counterfactual examples. These examples are used to augment the existing dataset, resulting in a more fair learning model. Our experiments were conducted on the CelebA and UTKface datasets, where we assessed the quality of generated counterfactual data using various bias-related metrics. We observed improvements in bias mitigation across several protected attributes in the fine-tuned model when utilizing counterfactual data.
In the evolving landscape of machine learning research, two significant developments have emerged to prominence: foundation models and federated learning. The FL@FM-TheWebConf'24 workshop provides an exciting forum for researchers and practitioners to discuss promising works and chart potential trajectories forward for this emerging field.
Federated learning (FL) is a promising approach for solving multilingual tasks, potentially enabling clients with their own language-specific data to collaboratively construct a high-quality neural machine translation (NMT) model. However, communication constraints in practical network systems present challenges for exchanging large-scale NMT engines between FL parties. In this paper, we propose a meta-learning-based adaptive parameter selection methodology, MetaSend, that improves the communication efficiency of model transmissions from clients during FL-based multilingual NMT training. Our approach learns a dynamic threshold for filtering parameters prior to transmission without compromising the NMT model quality, based on the tensor deviations of clients between different FL rounds. Through experiments on two NMT datasets with different language distributions, we demonstrate that MetaSend obtains substantial improvements over baselines in translation quality in the presence of a limited communication budget.
Federated Multilingual Modeling (FMM) has become an essential approach in natural language processing (NLP) due to increasing linguistic diversity and the heightened emphasis on data privacy. However, FMM faces two primary challenges: 1) the high communication costs inherent in network operations, and 2) the complexities arising from parameter interference, as languages exhibit both unique characteristics and shared features. To tackle these issues, we introduce a communication-efficient framework for Multilingual Modeling (MM) that combines low-rank adaptation with a hierarchical language tree structure. Our method maintains the base model's weights while focusing on updating only the Low-rank adaptation (LoRA) parameters, significantly reducing communication costs. Additionally, we mitigate parameter conflicts by organizing languages based on their familial ties rather than merging all LoRA parameters together. Our experimental findings reveal that this novel model surpasses established baseline models in performance and markedly decreases communication overhead.
Generative AI has made impressive strides in enabling users to create diverse and realistic visual content such as images, videos, and audio. However, training generative models on large centralized datasets can pose challenges in terms of data privacy, security, and accessibility. Federated learning (FL) is an approach that uses decentralized techniques to collaboratively train a shared deep learning model while retaining the training data on individual edge devices to preserve data privacy. This paper proposes a novel method for training a Denoising Diffusion Probabilistic Model (DDPM) across multiple data sources using FL techniques. Diffusion models, a newly emerging generative model, show promising results in achieving superior quality images than Generative Adversarial Networks (GANs). Our proposed method Phoenix is an unconditional diffusion model that leverages strategies to improve the data diversity of generated samples even when trained on data with statistical heterogeneity or Non-IID (Non-Independent and Identically Distributed) data. We demonstrate how our approach outperforms the default diffusion model in a FL setting. These results indicate that high-quality samples can be generated by maintaining data diversity, preserving privacy, and reducing communication between data sources, offering exciting new possibilities in the field of generative AI.
With the tremendous success of large language models such as ChatGPT, artificial intelligence has entered a new era of large models. Multimodal data, which can comprehensively perceive and recognize the physical world, has become an essential path towards general artificial intelligence. However, multimodal large models trained on public datasets often underperform in specific industrial domains. In this paper, we tackle the problem of building large vision-language intelligent models for specific industrial domains by leveraging the general large models and federated learning. We compare the challenges faced by federated learning in the era of small models and large models from different dimensions, and propose a technical framework for federated learning in the era of large models.Specifically, our framework mainly considers three aspects: heterogeneous model fusion, flexible aggregation methods, and data quality improvement. Based on this framework, we conduct a case study of leading enterprises contributing vision-language data and expert knowledge to city safety operation management. The preliminary experiments show that enterprises can enhance and accumulate their intelligence capabilities through federated learning, and jointly create an intelligent city model that provides high-quality intelligent services covering energy infrastructure security, residential community security and urban operation management.
The advent of large language models (LLMs) presents both opportunities and challenges for the information retrieval (IR) community. On one hand, LLMs will revolutionize how people access information, meanwhile the retrieval techniques can play a crucial role in addressing many inherent limitations of LLMs. On the other hand, there are open problems regarding the collaboration of retrieval and generation, the potential risks of misinformation, and the concerns about cost-effectiveness. To seize the critical moment for development, it calls for the joint effort from academia and industry on many key issues, including identification of new research problems, proposal of new techniques, and creation of new evaluation protocols. It has been one year since the launch of ChatGPT in November last year, and the entire community is currently undergoing a profound transformation in techniques. Therefore, this workshop will be a timely venue to exchange ideas and forge collaborations. The organizers, committee members, and invited speakers are composed of a diverse group of researchers coming from leading institutions in the world. This event will be made up of multiple sessions, including invited talks, paper presentations, hands-on tutorials, and panel discussions. All the materials collected for this workshop will be archived and shared publicly, which will present a long-term value to the community.
The Large Language Model (LLM) is renowned for its ability to encode a vast amount of general domain knowledge, enabling it to excel in question-answering, dialogue systems, and summarization tasks. However, the medical domain presents a unique challenge to LLM due to the distribution of medical knowledge, which follows a long-tail pattern. Existing approaches address this challenge by injecting medical knowledge into LLM through single sources such as medical textbooks or medical knowledge bases. However, medical knowledge is distributed across multiple heterogeneous information sources. A medical question-answering system can enhance answer coverage and confidence by considering these diverse knowledge sources together. To bridge this gap, we propose a novel approach called Heterogeneous Knowledge Retrieval-Augmented LLM for medical domain question answering. Our experiments, conducted on the MedQA-USMLE dataset, demonstrate promising performance improvements. These results underscore the importance of harnessing heterogeneous knowledge sources in the medical domain.
This paper deals with Video Moment Retrieval (VMR) in a weakly-supervised fashion, which aims to retrieve local video clips with only global video-level descriptions. Scrutinizing the recent advances in VMR, we find that the fully-supervised models achieve strong performance, but they are heavily relied on the precise temporal annotations. Weakly-supervised methods do not rely on temporal annotations, however, their performance is much weaker than the fully-supervised ones. To fill such gap, we propose to take advantage of a pretrained video-text model as hitchhiker to generate pseudo temporal labels. The pseudo temporal labels, together with the descriptive labels, are then utilized to guide the training of the proposed VMR model. The proposed Location-irrelevant Proposal Learning (LPL) model is based on a pretrained video-text model with cross-modal prompt learning, together with different strategies to generate reasonable proposals with various lengths. Despite the simplicity, we find that our method performs much better than the previous state-of-the-art methods on standard benchmarks, eg., +4.4% and +1.4% in mIoU on the Charades and ActivityNet-Caption datasets respectively, which benefits from training with fine-grained video-text pairs. Further experiments on two synthetic datasets with shuffled temporal location and longer video length demonstrate our model's robustness towards temporal localization bias as well as its strength in handling long video sequences.
Query keyword matching plays a crucial role in sponsored search advertising by retrieving semantically related keywords of the user query to target relevant advertisements. Conventional technical solutions adopt the retrieve-judge-then-rank retrieval framework structured in cascade funnels. However, it has limitations in accurately depicting the semantic relevance between the query and keyword, and the cumulative funnel losses result in unsatisfactory precision and recall. To address the above issues, this paper proposes a Large Language Model (LLM)-based keyword generation method (LKG) to reach related keywords from the search query in one step. LKG models the query keyword matching as an end-to-end keyword generation task based on the LLM through multi-match prompt tuning. Moreover, it employs the feedback tuning and the prefix tree-based constrained beam search to improve the generation quality and efficiency. Extensive offline experiments and online A/B testing demonstrate the effectiveness and superiority of LKG which is fully deployed in the Baidu sponsored search system bringing significant improvements.
While dense retrieval methods have made significant advancements, sparse retrieval techniques continue to offer advantages in terms of interpretability and generalizability. However, query-document term mismatch in sparse retrieval persists, rendering it infeasible for many practical applications. Recent research has shown that Large Language Models (LLMs) hold relevant information that can enhance sparse retrieval through the application of prompt engineering. In this paper, we build upon this concept to explore various strategies employing LLMs for information retrieval purposes. Specifically, we utilize LLMs to enhance sparse retrieval by query rewriting and query expansion. In query rewriting, the original query is refined by creating several new queries. For query expansion, LLMs are employed to generate extra terms, thereby enriching the original query. We conduct experiments on a range of well-known information retrieval datasets, including MSMARCO-passage, TREC2019, TREC2020, Natural Questions, SCIFACT. The experiments show that LLMs can be beneficial for sparse methods since the added information provided by the LLMs can help diminish the discrepancy between the term frequencies of the important terms in a query and the relevant document. In certain domains, we demonstrate that the effectiveness of LLMs is constrained, indicating that they may not consistently perform optimally, which will be explored in future research.
The number of individuals having identical names on the internet is increasing. Thus making the task of searching for a specific individual tedious. The user must vet through many profiles with identical names to get to the actual individual of interest. The online presence of an individual forms the profile of the individual. We need a solution that helps users by consolidating the profiles of such individuals by retrieving factual information available on the web and providing the same as a single result. We present a novel solution that retrieves web profiles belonging to those bearing identical Full Names through an end-to-end pipeline. Our solution involves information retrieval from the web (extraction), LLM-driven Named Entity Extraction (retrieval), and standardization of facts using Wikipedia, which returns profiles with fourteen multi-valued attributes. After that, profiles that correspond to the same real-world individuals are determined. We accomplish this by identifying similarities among profiles based on the extracted facts using a Prefix Tree inspired data structure (validation) and utilizing ChatGPT's contextual comprehension (revalidation). The system offers varied levels of strictness while consolidating these profiles, namely strict, relaxed, and loose matching. The novelty of our solution lies in the innovative use of GPT -- a highly powerful yet an unpredictable tool, for such a nuanced task. A study involving twenty participants, along with other results, found that one could effectively retrieve information for a specific individual.
Retrieval and ranking lie at the heart of several applications like search, question-answering, and recommendations. The use of Large language models (LLMs) such as BERT in these applications have shown promising results in recent times. Recent works on text-based retrievers and rankers show promising results by using bi-encoders (BE) architecture with BERT like LLMs for retrieval and a cross-attention transformer (CAT) architecture BERT or other LLMs for ranking the results retrieved. Although the use of CAT architecture for re-ranking improves ranking metrics, their robustness to data shifts is not guaranteed. In this work we analyze the robustness of CAT-based rankers. Specifically, we show that CAT rankers are sensitive to item distribution shifts conditioned on a query, we refer to this as conditional item distribution shift (CIDS). CIDS naturally occurs in large online search systems as the retrievers keep evolving, making it challenging to consistently train and evaluate rankers with the same item distribution. In this paper, we formally define CIDS and show that while CAT rankers are sensitive to this, BE models are far more robust to CIDS. We propose a simple yet effective approach referred to as BI-CAT which augments BE model outputs with CAT rankers, to significantly improve the robustness of CAT rankers without any drop in in-distribution performance. We conducted a series of experiments on two publicly available ranking datasets and one dataset from a large e-commerce store. Our results on dataset with CIDS demonstrate that the BI-CAT model significantly improves the robustness of CAT rankers by roughly 100-1000bps in F1 without any reduction in in-distribution model performance.
Conversational search provides a more convenient interface for users to search by allowing multi-turn interaction with the search engine. However, the effectiveness of the conversational dense retrieval methods is limited by the scarcity of training data required for their fine-tuning. Thus, generating more training conversational sessions with relevant labels could potentially improve search performance. Based on the promising capabilities of large language models (LLMs) on text generation, we propose ConvSDG, a simple yet effective framework to explore the feasibility of boosting conversational search by using LLM for session data generation. Within this framework, we design dialogue/session-level and query-level data generation with unsupervised and semi-supervised learning, according to the availability of relevance judgments. The generated data are used to fine-tune the conversational dense retriever. Extensive experiments on four widely used datasets demonstrate the effectiveness and broad applicability of our ConvSDG framework compared with several strong baselines.
Graphs are widely applied to encode entities with various relations in web applications such as social media and recommender systems. Meanwhile, graph learning-based technologies, such as graph neural networks, are demanding to support the analysis, understanding, and usage of the data in graph structures. Recently, the boom of language foundation models, especially Large Language Models (LLMs), has advanced several main research areas in artificial intelligence, such as natural language processing, graph mining, and recommender systems. The synergy between LLMs and graph learning holds great potential to prompt the research in both areas. For example, LLMs can facilitate existing graph learning models by providing high-quality textual features for entities and edges, or enhancing the graph data with encoded knowledge and information. It may also innovate with novel problem formulations on graph-related tasks. Due to the research significance as well as the potential, the convergent area of LLMs and graph learning has attracted considerable research attention. Therefore, we propose to hold the workshop Large Language Models for Graph Learning at WWW'24, in order to provide a venue to gather researchers in academia and practitioners in the industry to present the recent progress on relevant topics and exchange their critical insights.
The field of SocialNLP stands at the confluence of two pivotal domains: natural language processing (NLP) and social computing, offering a multidisciplinary framework for exploring the intersections between these areas. This domain is characterized by its tri-directional focus. First, it endeavors to tackle prevalent challenges within the realm of social computing by leveraging advanced NLP methodologies. Second, it aims to address traditional NLP problems by harnessing the vast and dynamic data available from social networks and social media platforms. Third, SocialNLP is dedicated to identifying and solving emerging issues that lie at the nexus of social computing and NLP. The 12th iteration of the SocialNLP workshop, a hallmark event in this vibrant field, is scheduled to convene at TheWebConf (WWW) 2024. This edition has witnessed the acceptance of nine rigorously peer-reviewed papers, chosen from a competitive pool, reflecting an acceptance ratio of 53.8%. This achievement underscores the high caliber of research and the innovative approaches that the submitted works propose to advance the field of SocialNLP. The organizing committee extends its heartfelt gratitude to all contributing authors for their submissions, to the members of the program committee for their indispensable role in the rigorous review process, and to the workshop chairs for their visionary leadership and dedication. Their collective efforts have been instrumental in fostering an environment of academic excellence and collaboration that continues to drive forward the frontiers of knowledge in the interdisciplinary area of SocialNLP.
In the event of a disaster, many people post a lot of information on SNS that promotes or inhibits action we designate this information "behavioral facilitation information". Such information is likely to have various effects on user behavior. A wide variety of people browse SNSs, and different readers perceive the same information in different ways. Therefore, in this study, we focus on users' attributes which are personality traits, age, and gender, and analyze how different users perceive information that promotes behavior. Specifically, we extract behavioral facilitation information from SNSs at the time of a disaster using deep learning, and classify the information into four user's personality traits: "suggestion," "inhibition," "encouragement," and "wish." Then, we conduct an experiment in which subjects classified by user's attributes which are personality traits, age, and gender read and judge how they feel about behavioral facilitation information. We then analyze the results, to determine the relationship between the behavioral facilitation information and the reader's attributes.
Authorship Attribution (AA) seeks to determine the authorship of texts by examining distinctive writing styles. Although current AA methods have shown promising results, they often underperform in scenarios with significant topic shifts. This limitation arises from their inability to effectively separate topical content from the author's stylistic elements. Furthermore, most studies have focused on individual-level AA, overlooking the potential of regional-level AA to uncover linguistic patterns influenced by cultural and geographical factors. To bridge these gaps, this paper introduces ContrastDistAA, a novel framework that leverages contrastive learning and mutual information maximization to disentangle content and stylistic features in latent representations for AA. Our extensive experiments demonstrate that ContrastDistAA surpasses existing state-of-the-art models in both individual and regional-level AA tasks. This breakthrough not only improves the accuracy of authorship attribution but also broadens its applicability to include regional linguistic analysis, making a substantial contribution to the field of computational linguistics.
A crucial element in the combat against hate speech is the development of efficient algorithms for automatically detecting hate speech. Previous research, however, has primarily neglected important insights from the field of psychology literature, particularly the relationship between personality and hate, resulting in suboptimal performance in hate speech detection. To this end, we propose a novel framework for detecting hate speech focusing on people's personality factors reflected in their writing. Our framework has two components: (i) a knowledge distillation model for fully automating the process of personality inference from text and (ii) a personality-based deep learning model for hate speech detection. Our approach is unique in that it incorporates low-level personality factors, which have been largely neglected in prior literature, into automated hate speech detection and proposes novel deep learning components for fully exploiting the intricate relationship between personality and hate (i.e., intermediate personality factors). The evaluation shows that our model significantly outperforms state-of-the-art baselines. Our study paves the way for future research by incorporating personality aspects into the design of automated hate speech detection. In addition, it offers substantial assistance to online social platforms and governmental authorities facing challenges in effectively moderating hate speech.
In social media, neural network models have been applied to hate speech detection, sentiment analysis, etc., but neural network models are susceptible to adversarial attacks. For instance, in a text classification task, the attacker elaborately introduces perturbations to the original texts that hardly alter the original semantics in order to trick the model into making different predictions. By studying textual adversarial attack methods, the robustness of language models can be evaluated and then improved. Currently, most of the research in this field focuses on English, and there is also a certain amount of research on Chinese. However, there is little research targeting Chinese minority languages. With the rapid development of artificial intelligence technology and the emergence of Chinese minority language models, textual adversarial attacks become a new challenge for the information processing of Chinese minority languages. In response to this situation, we propose a multi-granularity Tibetan textual adversarial attack method based on masked language models called TSTricker. We utilize the masked language models to generate candidate substitution syllables or words, adopt the scoring mechanism to determine the substitution order, and then conduct the attack method on several fine-tuned victim models. The experimental results show that TSTricker reduces the accuracy of the classification models by more than 28.70% and makes the classification models change the predictions of more than 90.60% of the samples, which has an evidently higher attack effect than the baseline method.
Memes are important because they serve as conduits for expressing emotions, opinions, and social commentary online, providing valuable insight into public sentiment, trends, and social interactions. By combining textual and visual elements, multi-modal fusion techniques enhance meme analysis, enabling the classification of offensive and sentimental memes effectively. Early and late fusion methods effectively integrate multi-modal data but face limitations. Early fusion integrates features from different modalities before classification. Late fusion combines classification outcomes from each modality after individual classification and reclassifies the combined results. This paper compares early and late fusion models in meme analysis. It showcases their efficacy in extracting meme concepts and classifying meme reasoning. Pre-trained vision encoders, including ViT and VGG-16, and language encoders such as BERT, AlBERT, and DistilBERT, were employed to extract image and text features. These features were subsequently utilized for performing both early and late fusion techniques. This paper further compares the explainability of fusion models through SHAP analysis. In comprehensive experiments, various classifiers such as XGBoost and Random Forest, along with combinations of different vision and text features across multiple sentiment scenarios, showcased the superior effectiveness of late fusion over early fusion.
Our approach to automatically summarizing online mental health posts could help counselors by reducing their reading time, enabling quicker and more effective support for individuals seeking mental health assistance. Neural text summarization methods demonstrate promising performance owing to their strong pre-training procedure. Random token/span masking technique is often relied upon by existing pre-trained language models; an approach that overlooks the importance of content when learning word representations. In an attempt to rectify this, we propose using source and summary alignments as a saliency signal to enhance the pre-training strategy of language model for better representation learning of important content, paving the way for a positive impact on the model fine-tuning phase. Our experiments on a mental health-related dataset for user post summarization MentSum reveal improved performance, as evidenced by human evaluation metrics, surpassing the current state-of-the-art system.
The social restrictions, disruptions in daily activities, and psychological stressors arising from the COVID-19 pandemic constitute a psychological burden for people worldwide, which can be especially detrimental for individuals with mental disorders like Eating Disorders (ED). In this research, we aim to comprehend how COVID-19 has affected individuals with eating disorders through a comparative analysis of data obtained from online communities. We collected data spanning two years before and after the declaration of the pandemic from the subreddits r/AnorexiaNervosa, r/BingeEatingDisorder, and r/EatingDisorders. The research presents multi-faceted tasks where we analyze the content of each of the subreddits by applying a strategy that combines topic modeling, social network analysis, and time series modeling for a better understanding of these communities on both content and network levels. Through a comparative analysis, we address the discussion topic changes based on users' content and determine how COVID-19 leads to changes in communication patterns within the communities. Finally, we implement time series models like ARIMA, Prophet, LSTM, and Transformer on daily posts and comments count to forecast users' activities within the subreddit and establish a performance comparison of these time series models. The findings indicate that both the content of users' discussions and the level of communication and online support-seeking related to eating disorders on Reddit underwent significant changes during the pandemic. The data of this study is available at this GitHub https://github.com/alamincse32/Reddit-Data-for-Eating-Disoder-Community-During-Covid-Pandemic
The DESERE Workshop, our First Workshop on Decentralised Search and Recommendation, offers a platform for researchers to explore and share innovative ideas on decentralised web services, mainly focusing on three major topics: (i) societal impact of decentralised systems: their effects on privacy, policy, and regulation; (ii) decen- tralising applications: algorithmic and performance challenges that arise from decentralisation; and (iii) infrastructure to support de- centralised systems and services: peer-to-peer networks, routing, and performance evaluation tools.
Due to the economic and societal problems being caused by the Web's growing centralization, there is an increasing interest in de-centralizing data on the Web. This decentralization does however cause a number of technical challenges. If we want to give users in decentralized environments the same level of user experience as they are used to with centralized applications, we need solutions to these challenges. We discuss how query engines can act as layer between applications on the one hand, and decentralized environments on the other hand, Query engines therefore act as an abstraction layer that hides the complexities of decentralized data management for application developers. In this article, we outline the requirements for query engines over decentralized environments. Furthermore, we show how existing approaches meet these requirements, and which challenges remain.
As such, this article offers a high-level overview of a roadmap in the query and decentralization research domains.
Federated Learning (FL) and the Social Linked Data (\textttSolid ~\footnotehttps://solidproject.org/ ) framework represent decentralized approaches to machine learning and web development, respectively, with a focus on preserving privacy. Federated learning enables the distributed training of machine learning models across datasets partitioned across multiple clients, whereas applications developed with the Solid approach store data inPersonal Online Data Stores (pods) under the control of individual users. This paper discusses the merits and challenges of executing Federated Learning on Solid pods and the readiness of the Solid server architecture to support this. We aim to detail these challenges, in addition to identifying avenues for further work to fully harness the benefits of Federated Learning in Solid environments, where users retain sovereignty over their data.
The rise of generative models has driven significant advancements in recommender systems, leaving unique opportunities for enhancing users' personalized recommendations. This workshop serves as a platform for researchers to explore and exchange innovative concepts related to the integration of generative models into recommender systems. It primarily focuses on five key perspectives: (i) improving recommender algorithms, (ii) generating personalized content, (iii) evolving the user-system interaction paradigm, (iv) enhancing trustworthiness checks, and (v) refining evaluation methodologies for generative recommendations. With generative models advancing rapidly, an increasing body of research is emerging in these domains, underscoring the timeliness and critical importance of this workshop. The related research will introduce innovative technologies to recommender systems and contribute to fresh challenges in both academia and industry. In the long term, this research direction has the potential to revolutionize the traditional recommender paradigms and foster the development of next-generation recommender systems.
Sequence recommendation tasks often have performance bottlenecks, mainly reflected in the following two aspects: previous research relied on a single item embedding distribution, resulting in a decrease in overall modeling ability. In addition, the implicit dynamic preferences reflected in user interaction sequences are not distinguished, and the feature representation ability is insufficient. To address these issues, we propose a novel model called Diffusion Recommendation with Implicit Sequence Influence (DiffRIS). Specifically, we establish an implicit feature extraction module, which includes multi-scale CNN and residual LSTM networks that learn local and global features of sequence information, respectively, to explore the length dependence of data features. Subsequently, we use the output of the module as a conditional input for the diffusion model, guiding the denoising process based on historical interactions. Through experiments on two open-source datasets, we find that implicit features of sequences have a positive impact on the diffusion process. The proposed DiffRIS framework performs well compared to multiple baseline models, effectively improving the accuracy of sequential recommendation models. We believe that the proposed DiffRIS can provide some research ideas for diffusion sequence recommendation.
Conversational Recommender System (CRS) interacts with users through natural language to understand their preferences and provide personalized recommendations in real-time. CRS has demonstrated significant potential, prompting researchers to address the development of more realistic and reliable user simulators as a key focus. Recently, the capabilities of Large Language Models (LLMs) have attracted a lot of attention in various fields. Simultaneously, efforts are underway to construct user simulators based on LLMs. While these works showcase innovation, they also come with certain limitations that require attention. In this work, we aim to analyze the limitations of using LLMs in constructing user simulators for CRS, to guide future research. To achieve this goal, we conduct analytical validation on the notable work, iEvaLM. Through multiple experiments on two widely-used datasets in the field of conversational recommendation, we highlight several issues with the current evaluation methods for user simulators based on LLMs: (1) Data leakage, which occurs in conversational history and the user simulator's replies, results in inflated evaluation results. (2) The success of CRS recommendations depends more on the availability and quality of conversational history than on the responses from user simulators. (3) Controlling the output of the user simulator through a single prompt template proves challenging. To overcome these limitations, we propose SimpleUserSim, employing a straightforward strategy to guide the topic toward the target items. Our study validates the ability of CRS models to utilize the interaction information, significantly improving the recommendation results.
Multimodal recommendation aims at to modeling the feature distributions of items by using their multi-modal information. Prior efforts typically focus on the denoising of the user-item graph with a degree-sensitive strategy, which may not well-handle the users' consistent preference across modalities. More importantly, it has been observed that existing methods may learn ill-posed item embeddings due to their focus on a specific auxiliary optimization task for multimodal representations rather than explicitly modeling them. This paper therefore presents a solution that takes the advantages of the explicit uncertainty injection ability of Diffusion Model (DM) for the modeling and fusion of multi-modal information. Specifically, we propose a novel Multimodal Conditioned Diffusion Model for Recommendation (MCDRec), which tailors DM with two technical modules to model the high-order multimodal knowledge. The first module is multimodal-conditioned representation diffusion (MRD), which integrates pre-extracted multimodal knowledge into the item representation modeling via a tailored DM. This smoothly bridges the insurmountable gap between the multi-modal content features and the collaborative signals. Secondly, with the diffusion-guided graph denoising (DGD) module, MCDRec may effectively denoise the user-item graph by filtering the occasional interactions in user historical behaviors. This is achieved with the power of DM in aligning the users' collaborative preferences with their shared items' content information. Extensive experiments compared to several SOTA baselines on two real-word datasets demonstrate the effectiveness of MCDRec. The specific visualization also reveals the potential of MRD to precisely handling the high-order representation correlations among the user embeddings and the multi-modal heterogeneous representations of items.
The third iteration of the half-day workshop on Cryptoasset Analytics continues to offer a forum for researchers from diverse fields to share their latest discoveries in the domain of cryptoassets. This workshop retains its importance to the Web research community, motivated by how fundamental concepts of cryptoassets are progressively melding with Web technologies on a technical level, and through the observation that the evolution of socio-technical cryptoasset ecosystems are intricately linked to the Web.
The program features a mix of of invited talks alongside a carefully curated selection of peer-reviewed contributions. Workshop topics encompass a variety of themes, including empirical analyses, statistical methodologies, privacy measures, architectural and regulatory insights.
The integration of bots in Distributed Ledger Technologies (DLTs) fosters efficiency and automation. However, their use is also associated with predatory trading and market manipulation, and can pose threats to system integrity. It is therefore essential to understand the extent of bot deployment in DLTs; despite this, current detection systems are predominantly rule-based and lack flexibility. In this study, we present a novel approach that utilizes machine learning for the detection of financial bots on the Ethereum platform. First, we systematize existing scientific literature and collect anecdotal evidence to establish a taxonomy for financial bots, comprising 7 categories and 24 subcategories. Next, we create a ground-truth dataset consisting of 133 human and 137 bot addresses. Third, we employ both unsupervised and supervised machine learning algorithms to detect bots deployed on Ethereum. The highest-performing clustering algorithm is a Gaussian Mixture Model with an average cluster purity of 82.6%, while the highest-performing model for binary classification is a Random Forest with an accuracy of 83%. Our machine learning-based detection mechanism contributes to understanding the Ethereum ecosystem dynamics by providing additional insights into the current bot landscape.
The security of blockchain systems depends on the distribution of mining power across participants. If sufficient mining power is controlled by one entity, they can force their own version of events. This may allow them to double spend coins, for example. For Proof of Work (PoW) blockchains, however, the distribution of mining power cannot be read directly from the blockchain and must instead be inferred from the number of blocks mined in a specific sample window. We introduce a framework to quantify this statistical uncertainty for the Nakamoto coefficient, which is a commonly-used measure of blockchain decentralization. We show that aggregating blocks over a day can lead to considerable uncertainty, with Bitcoin failing more than half the hypothesis tests (α=0.05) when using a daily granularity. For these reasons, we recommend that blocks are aggregated over a sample window of at least 7 days. Instead of reporting a single value, our approach produces a range of possible Nakamoto coefficient values that have statistical support at a particular significance level α.
This paper presents the results of a comprehensive empirical study of losses to arbitrageurs (following the formalization of loss-versus-rebalancing or LVR by [Milionis et al., 2022]) incurred by liquidity on automated market makers (AMMs). Through a systematic comparison between historical returns from trading fees and losses to arbitrageurs, our findings indicate an insufficient compensation from fees for arbitrage losses across many of the largest AMM liquidity pools (on Uniswap). Remarkably, we identify a higher profitability among less capital-efficient Uniswap v2 pools compared to their Uniswap v3 counterparts.
Moreover, we investigate one possible LVR mitigation by quantifying how arbitrage losses reduces with shorter block times. We observe notable variations in the manner of decline of arbitrage losses across different trading pairs. For instance, when comparing 100ms block times to Ethereum's current 12-second block times, the decrease in losses to arbitrageurs ranges between 20% to 70%, depending on the specific trading pair.
Stealth addresses are a privacy-enhancing technology that provides recipient anonymity on blockchains. In this work, we investigate the recipient anonymity and unlinkability guarantees of Umbra, the most widely used implementation of the stealth address scheme on Ethereum, and its three off-chain scalability solutions, i.e., Arbitrum, Optimism, and Polygon. Specifically, we define and evaluate four heuristics to uncover the real recipients of stealth payments. We find that for the majority of Umbra payments, it is straightforward to establish the recipient, hence nullifying the benefits of using Umbra. In particular, we identify the real recipient of 48.5%, 25.8%, 65.7%, and 52.6% of all Umbra transactions on the Ethereum main net, Polygon, Arbitrum, and Optimism networks, respectively. Finally, we suggest easily implementable countermeasures to evade our deanonymization and linking attacks.
The increasing number of distinct blockchains has led to a growing need for data exchange and asset transfer across various isolated blockchains. To address this, cross-chain bridges have emerged as a critical mechanism for enabling interoperability and facilitating data and asset exchange across diverse blockchains. Among these bridges, the Layer-0 bridge stands out as a scalability solution that enhances blockchain performance at the foundational layer of data transition, without altering the blockchain's structure. Stargate is a notable Layer-0 Lock-and-Unlock cross-chain bridge that supports transactions across various EVM-based blockchains, with the highest Total Value Locked (TVL) among cross-chain bridges of the same kind. While previous cross-chain research has primarily focused on Layer-2 bridges, this study specifically examines Stargate and analyzes its dynamics as well as potential vulnerabilities. We collect transaction data of Stargate on six blockchains including Ethereum, Polygon, Binance Smart Chain, Avalanche, Arbitrum and Optimism. Our findings reveal the transaction patterns and evidence of exploitations of Stargate by investigating its transaction dynamics over time.
Learning on graphs (LOG) has a profound impact on various high-impact domains, such as information retrieval, social network analysis, computational chemistry and transportation. Despite decades of theoretical development, algorithmic advancements, and open-source systems that answers what the optimal learning results are, concerns about the trustworthiness of state-of-the-art LOG techniques have emerged in practical applications. Consequently, crucial research questions arise: why are LOG techniques untrustworthy with respect to critical social aspects like fairness, transparency, privacy, and security? How can we ensure the trustworthiness of learning algorithms on graphs? To address the increasingly important safety and ethical challenges in learning on graphs, it is essential to achieve a paradigm shift from solely addressing what questions to understanding how and why questions. Building upon the success of the first TrustLOG workshop in 2022, the second TrustLOG workshop aims to bring together researchers and practitioners to present, discuss, and advance cutting-edge research in the realm of trustworthy learning on graphs. The workshop serves as a platform to stimulate the TrustLOG community, fostering the identification of new research challenges, and shedding light on potential future directions.
Foundation models such as GPT-4 for natural language processing (NLP), Flamingo for computer vision (CV), have set new benchmarks in AI by delivering state-of-the-art results across various tasks with minimal task-specific data. Despite their success, the application of these models to the graph domain is challenging due to the relational nature of graph-structured data. To address this gap, we propose the Graph Foundation Model (GFM) Workshop, the first workshop for GFMs, dedicated to exploring the adaptation and development of foundation models specifically designed for graph data. The GFM workshop focuses on two critical questions: (1) How can the underlying capabilities of existing foundation models be effectively applied to graph data? (2) What foundational principles should guide the creation of models tailored to the graph domain? Through a curated set of panel sections, keynote talks, and paper presentations, our workshop intends to catalyze innovative approaches and theoretical frameworks for Graph Foundation Models (GFMs). We target a broad audience, encompassing researchers, practitioners, and students, and aim to lay the groundwork for the next wave of breakthroughs in integrating graph data with foundation models.
A variety of knowledge graph embedding approaches have been developed. Most of them obtain embeddings by learning the structure of the knowledge graph within a link prediction setting. As a result, the embeddings reflect only the structure of a single knowledge graph, and embeddings for different knowledge graphs are not aligned, e.g., they cannot be used to find similar entities across knowledge graphs via nearest neighbor search. However, knowledge graph embedding applications such as entity disambiguation require a more global representation, i.e., a representation that is valid across multiple sources. We propose to learn universal knowledge graph embeddings from large-scale interlinked knowledge sources. To this end, we fuse large knowledge graphs based on the owl:sameAs relation such that every entity is represented by a unique identity. We instantiate our idea by computing universal embeddings based on DBpedia and Wikidata yielding embeddings for about 180 million entities, 15 thousand relations, and 1.2 billion triples. We believe our computed embeddings will support the emerging field of graph foundation models. Moreover, we develop a convenient API to provide embeddings as a service. Experiments on link prediction suggest that universal knowledge graph embeddings encode better semantics compared to embeddings computed on a single knowledge graph. For reproducibility purposes, we provide our source code and datasets open access.
In the realm of personalization, integrating diverse information sources such as consumption signals and content-based representations is becoming increasingly critical to build state-of-the-art solutions. In this regard, two of the biggest trends in research around this subject are Graph Neural Networks (GNNs) and Foundation Models (FM). While GNNs emerged as a popular solution in industry for powering personalization at scale, FMs have only recently caught attention for their promising performance in personalization tasks like ranking and retrieval. In this paper, we present a graph-based foundation modeling approach tailored to personalization. Central to this approach is a Heterogeneous GNN (HGNN) designed to capture multi-hop content and consumption relationships across a range of recommendable item types. To ensure the generality required from a Foundation Model, we employ a Large Language Model (LLM) text-based featurization of nodes that accommodates all item types, and construct the graph using co-interaction signals, which inherently transcend content specificity. To facilitate practical generalization, we further couple the HGNN with an adaptation mechanism based on a two-tower (2T) architecture, which also operates agnostically to content type. This multi-stage approach ensures high scalability; while the HGNN produces general purpose embeddings, the 2T component models in a continuous space the sheer size of user-item interaction data. Our comprehensive approach has been rigorously tested and proven effective in delivering recommendations across a diverse array of products within a real-world, industrial audio streaming platform.
The first international workshop on multimedia content analysis for social good (MM4SG) was held in conjunction with the Web Conference 2024. The workshop aimed to address the challenge of effectively analyzing and moderating multimodal content across digital platforms. In an era where multimodal data including memes, text-embedded images, and fabricated content swiftly capture the public's attention and influence societal discourse, the need for advanced content moderation strategies is more pressing than ever. This workshop serves as a platform for research and collaboration between experts in natural language processing, machine learning, computational social science, and ethics. In this paper, we describe the inaugural edition of the MM4SG workshop. We also include the future directions for our workshop's upcoming editions.
The emergence of Large Language Models (LLMs) has marked a substantial advancement in Natural Language Processing (NLP), contributing significantly to enhanced task performance both within and outside specific domains. However, amidst these achievements, three key questions remain unanswered: 1) The mechanism through which LLMs accomplish their tasks and their limitations, 2) Effectively harnessing the power of LLMs across diverse domains, and 3) Strategies for enhancing the performance of LLMs. This talk aims to delve into our research group's endeavors to address these pivotal questions. Firstly, I will outline our approach, which involves utilizing ontology-guided prompt perturbations to unravel the primary limitations of LLMs in solving mathematical problems. Moving on to the second question, we will explore the utilization of synthetic data generated by LLMs to bolster challenging downstream tasks, particularly focusing on structured prediction where LLMs face persistent challenges. I will elaborate on our initiatives aimed at improving LLMs by incorporating highly effective retrieval strategies, specifically addressing the prevalent challenge of hallucinations that often plagues contemporary LLMs. Finally, I will present a technique on LLM realignment to restore safety lost during fine-tuning.
Do longitudinal studies reveal a skewed gender distribution among newborn babies depicted in Bollywood movies? Who dominates the speaking time in political conversations on 24x7 news networks in the United States-men or women? How does Twitter discourse on gender equality evolve when a woman dies in police custody in Iran after being arrested (reportedly) due to improper headscarf-wearing? What is the representation of women in divorce court proceedings in India? This broad talk, where cutting-edge AI intersects with social science research questions, encompasses a diverse array of studies that unveil gender bias in various forms. In this presentation, I will describe the substantive findings, social impact, methodological challenges, scope for multimodal investigations, and the novelties entailed in this research. I will conclude the talk with our findings on worrisome gender bias in several large language models.
War brings about strong feelings of hate, and showcases the love of humanity. During war, social media is utilized for citizen journalism, supply organization and activism, but also for people to express their emotions. In this paper, we present a multi-modal multi-platform dataset that depicts the expression of love and hate towards both sides of the Israel-Hamas War. This dataset presents posts in English from Facebook and Instagram that contain the terms "love" or "hate" in the context of the war during 7 October 2023 (onset of war) to 31 December 2023. We find that over time, the number of posts on the war decreased, suggesting interest in the war has waned; posts about Love reference religion while posts about Hate references hostility; and emojis in Love posts represent hearts, peace and listening while emojis in Hate posts represent being watched, sadness and warning. Finally, we generated Instagram posts with GPT4-V using our dataset as a reference, and the model returned posts of generic love messages with art-form images. We hope our dataset is useful to researchers studying multi-modal and multi-platform information and emotions on social media during a war.
Traditionally, sentiment analysis methods rely solely on text or image data. However, most user-generated social media content includes both textual and image content. In this study, we propose a novel Dual-Pipeline based Attentional method that uses different modalities of data, including text and images, to analyse and interpret emotions and sentiments expressed in tweets. Our proposed method simultaneously extracts meaningful local and global contextual features from multiple modalities. Local fusion layers within each pipeline combine modality-specific features using an attention mechanism to enrich the joint multimodal representation. A global fusion layer consolidates the collective sentiment representation by seamlessly intermixing the outputs of both pipelines. We evaluate our proposed method using performance metrics such as accuracy and F1-score. Through extensive experimentation on the MVSA dataset, our method demonstrates superior performance compared to state-of-the-art techniques in identifying the sentiment conveyed in social media data.
Social media multimodal sentiment analysis has proliferated the research attention of the research community, as it opens up various paradigms for social issues such as cyberbullying, hate speech, healthcare, politics, business analysis and many more. At the same time, it is an open problem to learn the intrinsic representation of multiple modalities in order to identify correlated patterns. In the proposed work a sentiment analysis framework is presented that defines Textual Context guided Vision Transformer with Rotated Multi-Head Attention, in order to exploit correlation between image-text pair and mine rich discriminatory features for multimodal sentiment analysis. A novel Rotated-Multi-head attention mechanism is defined that translates the visual or text embeddings in distinct feature space resulting in Adaptively Rotated Refined Embedding (ARREmb). To exhibit the performance of the proposed work, extensive experiments are carried out on three publicly available datasets-BG, Twitter and MVSA-single dataset, in terms of Precision, Recall, F1-score and accuracy. The experiments support superior performance of the proposed approach by laying out comparison with SOTA followed by ablation study.
Internet memes have emerged as a novel format for communication and expressing ideas on the web. Their fluidity and creative nature are reflected in their widespread use, often across platforms and occasionally for unethical or harmful purposes. While computational work has already analyzed their high-level virality over time and developed specialized classifiers for hate speech detection, there have been no efforts to date that aim to holistically track, identify, and map internet memes posted on social media. To bridge this gap, we investigate whether internet memes across social media platforms can be contextualized by using a semantic repository of knowledge, namely, a knowledge graph. We collect thousands of potential internet meme posts from two social media platforms, namely Reddit and Discord, and develop an extract-transform-load procedure to create a data lake with candidate meme posts. By using vision transformer-based similarity, we match these candidates against the memes cataloged in IMKG --- a recently released knowledge graph of internet memes. We leverage this grounding to highlight the potential of our proposed framework to study the prevalence of memes on different platforms, map them to IMKG, and provide context about memes on social media.
We introduce an ensemble model approach for multimodal sentiment analysis, focusing on the fusion of textual and video data to enhance the accuracy and depth of emotion interpretation. By integrating three foundational models-IFFSA, BFSA, and TBJE-using advanced ensemble techniques, we achieve a significant improvement in sentiment analysis performance across diverse datasets, including MOSI and MOSEI. Specifically, we propose two novel models-IFFSA and BFSA, which utilise the large language models BERT and GPT-2 to extract the features from text modality and ResNet and VGG for video modality. Our work uniquely contributes to the field by demonstrating the synergistic potential of combining different modal analytical strengths, thereby addressing the intricate challenge of nuanced emotion detection in multimodal contexts. Through comprehensive experiments and an extensive ablation study, we not only validate the superior performance of our ensemble model against current state-of-the-art benchmarks but also reveal critical insights into the model's capability to discern complex emotional states. Our findings underscore the strategic advantage of ensemble methods in multimodal sentiment analysis and set a new precedent for future research in effectively integrating multimodal data sources.
The growth of interactive and multimedia content on the Internet has made it an essential news source for people worldwide. Social media is a platform for sharing information and facilitates the spread of fake news. The dissemination of disinformation on social media has a significant impact on society. Conventional methods used in the identification of fake news often struggle to analyze textual, visual, and combined aspects of news shared on social media. Therefore, we propose the Multimodal Approach for Fake News Identification (MuAFaNI), which uses a combined representation of text and images to assess news authenticity as fake or real. MuAFaNI uses the RoBERTa language model for text analysis and ResNet-50 for image analysis. Experiments on two prominent social media datasets, Twitter and Weibo, showed that MuAFaNI performed better than state-of-the-art fake news techniques in terms of accuracy, precision, recall and F1 score.
During the conflict between Ukraine and Russia, hate speech targeted toward specific groups was widespread on different social media platforms. With most social platforms allowing multimodal content, the use of multimodal content to express hate speech is widespread on the Internet. Although there has been considerable research in detecting hate speech within unimodal content, the investigation into multimodal content remains insufficient. The limited availability of annotated multimodal datasets further restricts our ability to explore new methods to interpret and identify hate speech and its targets. The availability of annotated datasets for hate speech detection during political events, such as invasions, are even limited. To fill this gap, we introduce a comprehensive multimodal dataset consisting of 20,675 posts related to the Russia-Ukraine crisis, which were manually annotated as either 'Hate Speech' or 'No Hate Speech'. Additionally, we categorize the hate speech data into three targets: 'Individual', 'Organization', and 'Community'. Our benchmarked evaluations show that there is still room for improvement in accurately identifying hate speech and its targets. We hope that the availability of this dataset and the evaluations performed on it will encourage the development of new methods for identifying hate speech and its targets during political events like invasions and wars. The dataset and resources are made available at https://github.com/Farhan-jafri/Russia-Ukraine.
In today's digital era, memes have become a popular means of communication that often reflect societal attitudes as well as prejudices. Misogyny memes are a form of memes that explicitly discriminate against women in various aspects, such as shaming or stereotyping. This research aims to identify misogynous memes through deep learning multimodal analysis and determine which modality, text or image, plays a more significant role in fairness considerations. To achieve this, we utilized the dataset GOAT-benchmarks, which comprises over 6,000 diverse memes covering topics like implicit hate speech, sexism, and cyberbullying. Furthermore, we evaluated the fairness of these models by assessing their performance across different demographic groups. Our findings revealed that while both text and image modalities contribute to identifying misogynous memes, text plays a significant role in misogyny identification, while image contributes further in terms of fairness. This study emphasizes the importance of multimodal analysis in recognizing and mitigating biases in online content.
Disclaimer: This paper contains content that may be disturbing to some readers.