Skip to main content
Cornell University
We gratefully acknowledge support from the Simons Foundation, member institutions, and all contributors. Donate
arxiv logo > cs.MM
arXiv logo
Cornell University Logo

quick links

  • Login
  • Help Pages
  • About

Multimedia

Authors and titles for recent submissions

  • Fri, 1 Aug 2025
  • Thu, 31 Jul 2025
  • Wed, 30 Jul 2025
  • Tue, 29 Jul 2025
  • Mon, 28 Jul 2025

See today's new changes

Total of 36 entries
Showing up to 50 entries per page: fewer | more | all

Fri, 1 Aug 2025 (showing 3 of 3 entries )

[1] arXiv:2507.23444 [pdf, html, other]
Title: Hybrid CNN-Mamba Enhancement Network for Robust Multimodal Sentiment Analysis
Xiang Li, Xianfu Cheng, Xiaoming Zhang, Zhoujun Li
Subjects: Multimedia (cs.MM)
[2] arXiv:2507.23779 (cross-list from cs.CV) [pdf, html, other]
Title: Phi-Ground Tech Report: Advancing Perception in GUI Grounding
Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Yang, Chong Luo, Tianyi Chen, Justin Wagle, Tim Franklin, Baining Guo
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[3] arXiv:2507.23042 (cross-list from cs.CV) [pdf, html, other]
Title: Early Goal-Guided Multi-Scale Fusion for Real-Time Vision-Language Driving
Santosh Patapati, Trisanth Srinivasan
Comments: 6 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Multimedia (cs.MM); Robotics (cs.RO)

Thu, 31 Jul 2025 (showing 5 of 5 entries )

[4] arXiv:2507.22731 [pdf, html, other]
Title: GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation
Quanwei Yang, Luying Huang, Kaisiyuan Wang, Jiazhi Guan, Shengyi He, Fengguo Li, Hang Zhou, Lingyun Yu, Yingying Li, Haocheng Feng, Hongtao Xie
Comments: 10 pages, 5 figures, Accepted by ICCV 2025
Subjects: Multimedia (cs.MM)
[5] arXiv:2507.22676 (cross-list from cs.CL) [pdf, html, other]
Title: Listening to the Unspoken: Exploring 365 Aspects of Multimodal Interview Performance Assessment
Jia Li, Yang Wang, Wenhao Qian, Zhenzhen Hu, Richang Hong, Meng Wang
Comments: 8 pages, 4 figures, ACM MM 2025. github:this https URL
Subjects: Computation and Language (cs.CL); Multimedia (cs.MM)
[6] arXiv:2507.22481 (cross-list from eess.IV) [pdf, html, other]
Title: Towards Blind Bitstream-corrupted Video Recovery via a Visual Foundation Model-driven Framework
Tianyi Liu, Kejun Wu, Chen Cai, Yi Wang, Kim-Hui Yap, Lap-Pui Chau
Comments: 10 pages, 5 figures, accepted by ACMMM 2025
Subjects: Image and Video Processing (eess.IV); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[7] arXiv:2507.22367 (cross-list from cs.CL) [pdf, html, other]
Title: Traits Run Deep: Enhancing Personality Assessment via Psychology-Guided LLM Representations and Multimodal Apparent Behaviors
Jia Li, Yichao He, Jiacheng Xu, Tianhao Luo, Zhenzhen Hu, Richang Hong, Meng Wang
Comments: 8 pages, 3 figures, ACM MM 2025
Subjects: Computation and Language (cs.CL); Multimedia (cs.MM)
[8] arXiv:2507.22099 (cross-list from cs.CV) [pdf, html, other]
Title: Runtime Failure Hunting for Physics Engine Based Software Systems: How Far Can We Go?
Shuqing Li, Qiang Chen, Xiaoxue Ren, Michael R. Lyu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM); Software Engineering (cs.SE)

Wed, 30 Jul 2025 (showing 7 of 7 entries )

[9] arXiv:2507.21926 [pdf, other]
Title: Efficient Sub-pixel Motion Compensation in Learned Video Codecs
Théo Ladune, Thomas Leguay, Pierrick Philippe, Gordon Clare, Félix Henry
Subjects: Multimedia (cs.MM); Image and Video Processing (eess.IV)
[10] arXiv:2507.21557 [pdf, html, other]
Title: PC-JND: Subjective Study and Dataset on Just Noticeable Difference for Point Clouds in 6DoF Virtual Reality
Chunling Fan, Yun Zhang, Dietmar Saupe, Raouf Hamzaoui, Weisi Lin
Comments: 13 pages, 10 figures, Journal
Subjects: Multimedia (cs.MM)
[11] arXiv:2507.21395 [pdf, html, other]
Title: Sync-TVA: A Graph-Attention Framework for Multimodal Emotion Recognition with Cross-Modal Fusion
Zeyu Deng, Yanhui Lu, Jiashu Liao, Shuang Wu, Chongfeng Wei
Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[12] arXiv:2507.21741 (cross-list from cs.CV) [pdf, html, other]
Title: MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
Shaojun E, Yuchen Yang, Jiaheng Wu, Yan Zhang, Tiejun Zhao, Ziyan Chen
Comments: 9 pages
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[13] arXiv:2507.21507 (cross-list from cs.CV) [pdf, html, other]
Title: VAGU & GtS: LLM-Based Benchmark and Framework for Joint Video Anomaly Grounding and Understanding
Shibo Gao, Peipei Yang, Yangyang Liu, Yi Chen, Han Zhu, Xuyao Zhang, Linlin Huang
Comments: 21 pages, 19 figures, 8 tables
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[14] arXiv:2507.21195 (cross-list from cs.CR) [pdf, html, other]
Title: MaXsive: High-Capacity and Robust Training-Free Generative Image Watermarking in Diffusion Models
Po-Yuan Mao, Cheng-Chang Tsai, Chun-Shien Lu
Subjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[15] arXiv:2507.20177 (cross-list from cs.CV) [pdf, html, other]
Title: Towards Universal Modal Tracking with Online Dense Temporal Token Learning
Yaozong Zheng, Bineng Zhong, Qihua Liang, Shengping Zhang, Guorong Li, Xianxian Li, Rongrong Ji
Comments: arXiv admin note: text overlap with arXiv:2401.01686
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Tue, 29 Jul 2025 (showing 13 of 13 entries )

[16] arXiv:2507.20738 [pdf, other]
Title: Dark Side of Modalities: Reinforced Multimodal Distillation for Multimodal Knowledge Graph Reasoning
Yu Zhao, Ying Zhang, Xuhui Sui, Baohang Zhou, Haoze Zhu, Jeff Z. Pan, Xiaojie Yuan
Comments: Accepted by ACM MM 2025
Subjects: Multimedia (cs.MM)
[17] arXiv:2507.20627 [pdf, other]
Title: Controllable Video-to-Music Generation with Multiple Time-Varying Conditions
Junxian Wu, Weitao You, Heda Zuo, Dengming Zhang, Pei Chen, Lingyun Sun
Comments: Accepted by the 33rd ACM International Conference on Multimedia (ACMMM 2025). The project page is available at this https URL
Subjects: Multimedia (cs.MM); Artificial Intelligence (cs.AI); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[18] arXiv:2507.19863 [pdf, html, other]
Title: Anchoring Trends: Mitigating Social Media Popularity Prediction Drift via Feature Clustering and Expansion
Chia-Ming Lee, Bo-Cheng Qiu, Cheng-Jun Kang, Yi-Hsuan Wu, Jun-Lin Chen, Yu-Fan Lin, Yi-Shiuan Chou, Chih-Chung Hsu
Comments: Accepted by ACM Multimedia 2025
Subjects: Multimedia (cs.MM)
[19] arXiv:2507.20900 (cross-list from cs.SD) [pdf, html, other]
Title: Music Arena: Live Evaluation for Text-to-Music
Yonghyun Kim, Wayne Chi, Anastasios N. Angelopoulos, Wei-Lin Chiang, Koichi Saito, Shinji Watanabe, Yuki Mitsufuji, Chris Donahue
Subjects: Sound (cs.SD); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[20] arXiv:2507.20745 (cross-list from cs.CV) [pdf, other]
Title: Regularizing Subspace Redundancy of Low-Rank Adaptation
Yue Zhu, Haiwen Diao, Shang Gao, Jiazuo Yu, Jiawen Zhu, Yunzhi Zhuge, Shuai Hao, Xu Jia, Lu Zhang, Ying Zhang, Huchuan Lu
Comments: 10 pages, 4 figures, Accepted by ACMMM2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Multimedia (cs.MM)
[21] arXiv:2507.20730 (cross-list from cs.HC) [pdf, other]
Title: Vocalize: Lead Acquisition and User Engagement through Gamified Voice Competitions
Edvin Teskeredzic, Muamer Paric, Adna Sestic, Petra Fribert, Anamarija Lukac, Hadzem Hadzic, Kemal Altwlkany, Emanuel Lacic
Comments: Accepted to ACM Hypertext 2025
Subjects: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
[22] arXiv:2507.20518 (cross-list from cs.CV) [pdf, other]
Title: T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval
Yili Li, Gang Xiong, Gaopeng Gou, Xiangyan Qu, Jiamin Zhuang, Zhen Li, Junzheng Shi
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[23] arXiv:2507.20368 (cross-list from cs.CV) [pdf, html, other]
Title: MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation
Shuolin Xu, Bingyuan Wang, Zeyu Cai, Fangteng Fu, Yue Ma, Tongyi Lee, Hongchuan Yu, Zeyu Wang
Comments: 8 pages,6 figures
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[24] arXiv:2507.20300 (cross-list from cs.HC) [pdf, html, other]
Title: Talking-to-Build: How LLM-Assisted Interface Shapes Player Performance and Experience in Minecraft
Xin Sun, Lei Wang, Yue Li, Jie Li, Massimo Poesio, Julian Frommel, Koen Hinriks, Jiahuan Pei
Subjects: Human-Computer Interaction (cs.HC); Multimedia (cs.MM)
[25] arXiv:2507.20286 (cross-list from cs.CV) [pdf, html, other]
Title: T$^\text{3}$SVFND: Towards an Evolving Fake News Detector for Emergencies with Test-time Training on Short Video Platforms
Liyuan Zhang, Zeyun Cheng, Yan Yang, Yong Liu, Jinke Ma
Comments: 16 pages, 3 figures, published to DASFAA 2025
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[26] arXiv:2507.19836 (cross-list from cs.GR) [pdf, html, other]
Title: ChoreoMuse: Robust Music-to-Dance Video Generation with Style Transfer and Beat-Adherent Motion
Xuanchen Wang, Heng Wang, Weidong Cai
Comments: 10 pages, 5 figures, accepted by the 33rd ACM International Conference on Multimedia (ACM MM 2025), demo page: this https URL
Subjects: Graphics (cs.GR); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Sound (cs.SD)
[27] arXiv:2507.19835 (cross-list from cs.SD) [pdf, html, other]
Title: SonicGauss: Position-Aware Physical Sound Synthesis for 3D Gaussian Representations
Chunshi Wang, Hongxing Li, Yawei Luo
Comments: Accepted by ACMMM'25
Subjects: Sound (cs.SD); Multimedia (cs.MM)
[28] arXiv:2507.19821 (cross-list from cs.CV) [pdf, html, other]
Title: LAVA: Language Driven Scalable and Versatile Traffic Video Analytics
Yanrui Yu, Tianfei Zhou, Jiaxin Sun, Lianpeng Qiao, Lizhong Ding, Ye Yuan, Guoren Wang
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)

Mon, 28 Jul 2025 (showing 8 of 8 entries )

[29] arXiv:2507.18932 [pdf, html, other]
Title: Benchmarking Multimodal Understanding and Complex Reasoning for ESG Tasks
Lei Zhang, Xin Zhou, Chaoyue He, Di Wang, Yi Wu, Hong Xu, Wei Liu, Chunyan Miao
Subjects: Multimedia (cs.MM); Computation and Language (cs.CL)
[30] arXiv:2507.18750 [pdf, other]
Title: CatchPhrase: EXPrompt-Guided Encoder Adaptation for Audio-to-Image Generation
Hyunwoo Oh, SeungJu Cha, Kwanyoung Lee, Si-Woo Kim, Dong-Jin Kim
Subjects: Multimedia (cs.MM); Sound (cs.SD); Audio and Speech Processing (eess.AS)
[31] arXiv:2507.19225 (cross-list from cs.SD) [pdf, html, other]
Title: Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation
Fang Kang, Yin Cao, Haoyu Chen
Subjects: Sound (cs.SD); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[32] arXiv:2507.19209 (cross-list from cs.CV) [pdf, html, other]
Title: Querying Autonomous Vehicle Point Clouds: Enhanced by 3D Object Counting with CounterNet
Xiaoyu Zhang, Zhifeng Bao, Hai Dong, Ziwei Wang, Jiajun Liu
Subjects: Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[33] arXiv:2507.19125 (cross-list from eess.IV) [pdf, html, other]
Title: Learned Image Compression with Hierarchical Progressive Context Modeling
Yuqi Li, Haotian Zhang, Li Li, Dong Liu
Comments: 17 pages, ICCV 2025
Subjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM)
[34] arXiv:2507.19092 (cross-list from cs.DL) [pdf, html, other]
Title: Comparing OCR Pipelines for Folkloristic Text Digitization
Octavian M. Machidon, Alina L. Machidon
Journal-ref: 4th edition of DigitalHeritage World Congress and Expo 2025
Subjects: Digital Libraries (cs.DL); Multimedia (cs.MM)
[35] arXiv:2507.19037 (cross-list from cs.SD) [pdf, html, other]
Title: MLLM-based Speech Recognition: When and How is Multimodality Beneficial?
Yiwen Guan, Viet Anh Trinh, Vivek Voleti, Jacob Whitehill
Subjects: Sound (cs.SD); Computation and Language (cs.CL); Multimedia (cs.MM); Audio and Speech Processing (eess.AS)
[36] arXiv:2507.18940 (cross-list from cs.CL) [pdf, html, other]
Title: LLaVA-NeuMT: Selective Layer-Neuron Modulation for Efficient Multilingual Multimodal Translation
Jingxuan Wei, Caijun Jia, Qi Chen, Yujun Cai, Linzhuang Sun, Xiangxiang Zhang, Gaowei Wu, Bihui Yu
Subjects: Computation and Language (cs.CL); Multimedia (cs.MM)
Total of 36 entries
Showing up to 50 entries per page: fewer | more | all
  • About
  • Help
  • Click here to contact arXiv Contact
  • Click here to subscribe Subscribe
  • Copyright
  • Privacy Policy
  • Web Accessibility Assistance
  • arXiv Operational Status
    Get status notifications via email or slack