AI论文速递 2026-03-09

🖼️ 计算机视觉

1. Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

➡️ 基于Transformer的-Based Inpainting

👤 Leif Van Holland, Domenic Zingsheim, Mana Takhsha

📄 High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.

📄 高质量的多摄像头三维流传输对众多AR/VR应用中的沉浸式体验至关重要。由于实时性限制导致的视角数量有限，常使渲染图像出现信息缺失与表面不完整的问题。现有方法通常依赖简单的启发式策略进行空洞填充，易导致画面不一致或视觉伪影。我们提出一种独立于底层表征、面向应用的新型修复方法，将其作为新视角渲染后的图像后处理步骤，用于补全缺失纹理。该方法设计为兼容任意标定多摄像头系统的独立模块。为此，我们引入基于时空嵌入的多视角感知Transformer网络架构，在保持细节精度的同时确保跨帧一致性。此外，分辨率无关的设计使其能适配不同相机配置，而自适应分块选择策略则平衡推理速度与质量，实现实时性能。我们在同等实时约束条件下与前沿修复技术进行对比评估，结果表明该模型在质量与速度间达到最佳平衡，在图像与视频评价指标上均优于现有方法。

📚 AI论文速递 2026-03-09

🖼️ 计算机视觉

1. Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

2. FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

3. RoboPocket: Improve Robot Policies Instantly with Your Phone

4. Accelerating Text-to-Video Generation with Calibrated Sparse Attention

5. Universal quantum computation with group surface codes

6. POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

7. Calculating trace distances of bosonic states in Krylov subspace

8. The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

9. Safe-SAGE: Social-Semantic Adaptive Guidance for Safe Engagement through Laplace-Modulated Poisson Safety Functions

10. Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels

11. Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

12. cuRoboV2: Dynamics-Aware Motion Generation with Depth-Fused Distance Fields for High-DoF Robots

13. NL2GDS: LLM-aided interface for Open Source Chip Design

14. Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

15. Observing and Controlling Features in Vision-Language-Action Models

🧠 大语言模型

1. Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

2. FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

3. Accelerating Text-to-Video Generation with Calibrated Sparse Attention

4. Universal quantum computation with group surface codes

5. POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

6. The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks

7. Safe-SAGE: Social-Semantic Adaptive Guidance for Safe Engagement through Laplace-Modulated Poisson Safety Functions

8. Mirror codes: High-threshold quantum LDPC codes beyond the CSS regime

9. Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels

10. Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

11. Ansatz-Free Learning of Lindbladian Dynamics In Situ

12. Local limits of uniform triangulations with boundaries in high genus

13. NL2GDS: LLM-aided interface for Open Source Chip Design

14. Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

15. Observing and Controlling Features in Vision-Language-Action Models

🎮 强化学习

1. RoboPocket: Improve Robot Policies Instantly with Your Phone

2. POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

3. Calculating trace distances of bosonic states in Krylov subspace

4. Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels

5. Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

6. Ansatz-Free Learning of Lindbladian Dynamics In Situ

7. Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

8. Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

9. SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

10. Thermodynamic Response Functions in Singular Bayesian Models

11. Kraus Constrained Sequence Learning For Quantum Trajectories from Continuous Measurement

12. Local strategies are pretty good at computing Boolean properties of quantum sequences

13. RealWonder: Real-Time Physical Action-Conditioned Video Generation

14. Residual RL--MPC for Robust Microrobotic Cell Pushing Under Time-Varying Flow

15. High-Pressure Inelastic Neutron Spectroscopy: A true test of Machine-Learned Interatomic Potential energy landscapes

🎨 多模态学习

1. Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

2. HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token

3. Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

4. NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries

5. SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning

6. Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh

7. Visual-Informed Speech Enhancement Using Attention-Based Beamforming

8. Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum

9. Mario: Multimodal Graph Reasoning with Large Language Models

10. Aura: Universal Multi-dimensional Exogenous Integration for Aviation Time Series

11. UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark

12. A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset

13. Haptics in Cognition: Disruptor or Enabler of Memory?

14. VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters

15. Differentially Private Multimodal In-Context Learning

📊 新数据集

1. FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning

2. cuRoboV2: Dynamics-Aware Motion Generation with Depth-Fused Distance Fields for High-DoF Robots

3. NL2GDS: LLM-aided interface for Open Source Chip Design

4. Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

5. SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis

6. Quantum Simulation of Coupled Harmonic Oscillators: From Theory to Implementation

7. Spin-resolved microscopy of $^{87}$Sr SU($N$) Fermi-Hubbard systems

8. Spatiotemporal Pauli processes: Quantum combs for modelling correlated noise in quantum error correction

9. Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields

10. Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

11. EdgeDAM: Real-time Object Tracking for Mobile Devices

12. NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance

13. Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

14. Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry