7大主题 × 10篇 | 由 伊利虾 🦐 自动整理
📄 High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rende...
📄 High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered 图像. Existing approaches typically rely on simple heuristics for the hole filling, which can 结果 in inconsistencies or visual artifacts. We 提出 to complete the missing textures 使用 a 新颖, 应用-targeted inpainting 方法 independent of the underlying 表示 as an 图像-based post-processing step after the 新颖 view rende...
📄 We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation...
📄 We 介绍 FaceCam, a system that generates 视频 under customizable camera trajectories for monocular human portrait 视频 input. Recent camera control approaches 基于 large 视频-生成 模型 have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait 视频 due to 规模-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we 提出 a face-tailored 规模-aware 表示 for camera transformations that provides deterministic conditioning without relying on 3D priors. We 训练 a 视频 生成...
📄 Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduc...
📄 Recent 扩散模型 模型 enable high-quality 视频 生成, but suffer from slow runtimes. The large Transformer-based backbones used in these 模型 are bottlenecked by spatiotemporal 注意力机制. In this paper, we identify that a significant fraction of 词元-to-词元 connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the 注意力机制 computation in these cases can be skipped with little to no effect on the 结果. This observation continues to hold for connections among local 词元 blocks. Motivated by this, we introduc...
📄 We introduce group surface codes, which are a natural generalization of the $\mathbb{Z}_2$ surface code, and equivalent to quantum double models of finite groups with specific boundary conditions. We show that group surface codes can be leveraged to perform non-Clifford gates in $\mathbb{Z}_2$ surface codes, thus enabling universal computation with well-established means of performing logical Clifford gates. Moreover, for suitably chosen groups, we demonstrate that arbitrary reversible classical gates can be implemented transversally in the group surface code. We present the logical operations...
📄 We 介绍 group surface codes, which are a natural 泛化 of the $\mathbb{Z}_2$ surface code, and equivalent to 量子 double 模型 of finite groups with specific boundary conditions. We 表明 that group surface codes can be leveraged to perform non-Clifford gates in $\mathbb{Z}_2$ surface codes, thus enabling universal computation with well-established means of performing logical Clifford gates. Moreover, for suitably chosen groups, we 展示 that arbitrary reversible classical gates can be implemented transversally in the group surface code. We 提出 the logical operations...
📄 Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient v...
📄 高效的 and stable 训练 of large 语言 模型 (LLMs) remains a core challenge in modern 机器学习 systems. To address this challenge, Reparameterized Orthogonal Equivalence 训练 (POET), a spectrum-preserving 框架 that optimizes each weight matrix through orthogonal equivalence transformation, has been 提出. Although POET provides strong 训练 稳定性, its original implementation incurs high 记忆 consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we 介绍 POET-X, a scalable and 记忆-高效的 v...
📄 We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenome...
📄 We study two recurring phenomena in Transformer 语言 模型: massive activations, in which a small number of 词元 exhibit extreme outliers in a few channels, and 注意力机制 sinks, in which certain 词元 attract disproportionate 注意力机制 mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same 词元, but their functional roles and causal relationship remain unclear. Through systematic 实验, we 表明 that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenome...
📄 Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach p...
📄 Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We 提出 Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified 框架 that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated 使用 a Laplace guidance field. Our 方法 p...
📄 The realization of quantum error correction protocols whose logical error rates are suppressed far below physical error rates relies on an intricate combination: the error-correcting code's efficiency, the syndrome extraction circuit's fault tolerance and overhead, the decoder's quality, and the device's constraints, such as physical qubit count and connectivity. This work makes two contributions towards error-corrected quantum devices. First, we introduce mirror codes, a simple yet flexible construction of LDPC stabilizer codes parameterized by a group $G$ and two subsets of $G$ whose total s...
📄 The realization of 量子 错误 纠正 protocols whose logical 错误 rates are suppressed far below physical 错误 rates relies on an intricate combination: the 错误-correcting code's 效率, the syndrome extraction circuit's fault tolerance and overhead, the 解码器's quality, and the device's constraints, such as physical 量子比特 count and connectivity. This work makes two contributions towards 错误-corrected 量子 devices. First, we 介绍 mirror codes, a simple yet flexible construction of LDPC stabilizer codes parameterized by a group $G$ and two subsets of $G$ whose total s...
📄 To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the model thr...
📄 To 规模 the solution of 优化 and simulation problems, prior work has explored machine-学习 surrogates that inexpensively map problem 参数 to corresponding solutions. Commonly used approaches, including supervised and self-supervised 学习 with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult 优化 landscapes. To address their trade-offs, we 提出 a 新颖 框架 that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the 模型 thr...
📄 Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce fals...
📄 Large 语言 模型 sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the 模型 answers truthfully -- and lie 检测 -- classifying whether a given response is false. Prior work evaluates such methods on 模型 specifically 训练 to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are 训练 to censor politically sensitive topics: Qwen3 模型 frequently produce fals...
📄 High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rende...
📄 High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered 图像. Existing approaches typically rely on simple heuristics for the hole filling, which can 结果 in inconsistencies or visual artifacts. We 提出 to complete the missing textures 使用 a 新颖, 应用-targeted inpainting 方法 independent of the underlying 表示 as an 图像-based post-processing step after the 新颖 view rende...
📄 We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation...
📄 We 介绍 FaceCam, a system that generates 视频 under customizable camera trajectories for monocular human portrait 视频 input. Recent camera control approaches 基于 large 视频-生成 模型 have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait 视频 due to 规模-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we 提出 a face-tailored 规模-aware 表示 for camera transformations that provides deterministic conditioning without relying on 3D priors. We 训练 a 视频 生成...
📄 Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce R...
📄 规模化 imitation 学习 is fundamentally constrained by the 效率 of 数据 collection. While handheld interfaces have emerged as a scalable solution for in-the-wild 数据 acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying 策略's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to 规模. To reconcile this trade-off, we 介绍 R...
📄 Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduc...
📄 Recent 扩散模型 模型 enable high-quality 视频 生成, but suffer from slow runtimes. The large Transformer-based backbones used in these 模型 are bottlenecked by spatiotemporal 注意力机制. In this paper, we identify that a significant fraction of 词元-to-词元 connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the 注意力机制 computation in these cases can be skipped with little to no effect on the 结果. This observation continues to hold for connections among local 词元 blocks. Motivated by this, we introduc...
📄 We introduce group surface codes, which are a natural generalization of the $\mathbb{Z}_2$ surface code, and equivalent to quantum double models of finite groups with specific boundary conditions. We show that group surface codes can be leveraged to perform non-Clifford gates in $\mathbb{Z}_2$ surface codes, thus enabling universal computation with well-established means of performing logical Clifford gates. Moreover, for suitably chosen groups, we demonstrate that arbitrary reversible classical gates can be implemented transversally in the group surface code. We present the logical operations...
📄 We 介绍 group surface codes, which are a natural 泛化 of the $\mathbb{Z}_2$ surface code, and equivalent to 量子 double 模型 of finite groups with specific boundary conditions. We 表明 that group surface codes can be leveraged to perform non-Clifford gates in $\mathbb{Z}_2$ surface codes, thus enabling universal computation with well-established means of performing logical Clifford gates. Moreover, for suitably chosen groups, we 展示 that arbitrary reversible classical gates can be implemented transversally in the group surface code. We 提出 the logical operations...
📄 Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient v...
📄 高效的 and stable 训练 of large 语言 模型 (LLMs) remains a core challenge in modern 机器学习 systems. To address this challenge, Reparameterized Orthogonal Equivalence 训练 (POET), a spectrum-preserving 框架 that optimizes each weight matrix through orthogonal equivalence transformation, has been 提出. Although POET provides strong 训练 稳定性, its original implementation incurs high 记忆 consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we 介绍 POET-X, a scalable and 记忆-高效的 v...
📄 Continuous-variable quantum systems are central to quantum technologies, with Gaussian states playing a key role due to their broad applicability and simple description via first and second moments. Distinguishing Gaussian states requires computing their trace distance, but no analytical formula exists for general states, and numerical evaluation is difficult due to the exponential cost of representing infinite-dimensional operators. We introduce an efficient numerical method to compute the trace distance between a pure and a mixed Gaussian state, based on a generalized Lanczos algorithm that ...
📄 Continuous-variable 量子 systems are central to 量子 technologies, with Gaussian states playing a 键 role due to their broad applicability and simple description via first and second moments. Distinguishing Gaussian states requires computing their trace distance, but no analytical formula exists for general states, and numerical 评估 is difficult due to the exponential cost of representing infinite-dimensional operators. We 介绍 an 高效的 numerical 方法 to compute the trace distance between a pure and a mixed Gaussian state, 基于 a generalized Lanczos 算法 that ...
📄 We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenome...
📄 We study two recurring phenomena in Transformer 语言 模型: massive activations, in which a small number of 词元 exhibit extreme outliers in a few channels, and 注意力机制 sinks, in which certain 词元 attract disproportionate 注意力机制 mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same 词元, but their functional roles and causal relationship remain unclear. Through systematic 实验, we 表明 that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenome...
📄 Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach p...
📄 Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We 提出 Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified 框架 that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated 使用 a Laplace guidance field. Our 方法 p...
📄 To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the model thr...
📄 To 规模 the solution of 优化 and simulation problems, prior work has explored machine-学习 surrogates that inexpensively map problem 参数 to corresponding solutions. Commonly used approaches, including supervised and self-supervised 学习 with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult 优化 landscapes. To address their trade-offs, we 提出 a 新颖 框架 that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the 模型 thr...
📄 While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic ba...
📄 While datasets for 视频 understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we 介绍 MM-Lifelong, a 数据集 designed for 多模态 Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working 记忆 Bottleneck due to context saturation, while representative agentic ba...
📄 Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal repres...
📄 Hallucinations remain a persistent challenge for 视觉语言 模型 (VLMs), which often describe nonexistent objects or fabricate facts. Existing 检测 methods typically operate after 文本 生成, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any 词元 is generated by probing a 模型's internal representations in a single forward pass. Across a diverse set of 视觉语言 任务 and eight modern VLMs, including Llama-3.2-视觉, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal repres...
📄 Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning ...
📄 Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in 多模态, multiparty settings, where the collaborators bring different information to the table. We 介绍 the Distributed Partial Information Puzzle (DPIP), a collaborative construction 任务 that elicits rich 多模态 communication under epistemic asymmetry. We 提出 a 多模态 数据集 of these interactions, annotated and temporally aligned across 语音, gesture, and action modalities to support reasoning ...
📄 We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to in...
📄 We focus on the 任务 of retrieving nail design 图像 基于 dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing 视觉语言 foundation 模型 often struggle to in...
📄 Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate sarcasm detection as structured reasoning and adopt a dual-track distillation strategy: high-quality teacher trajectories initialize the student model, while the full set of trajectories trains a generative reward model (GenRM) to evaluate reasoning q...
📄 多模态 sarcasm 检测 requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable 鲁棒 sarcasm reasoning with foundation 模型, we 提出 SarcasmMiner, a 强化学习 based post-训练 框架 that resists hallucination in 多模态 reasoning. We reformulate sarcasm 检测 as structured reasoning and adopt a dual-track 蒸馏 strategy: high-quality teacher trajectories initialize the student 模型, while the full set of trajectories trains a 生成式 奖励 模型 (GenRM) to evaluate reasoning q...
📄 We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately...
📄 We 提出 the Multilingual Cloud 语料库, the first national-规模, parallel, 多模态 linguistic 数据集 of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four 语言 families, Bangladesh has lacked a systematic, cross-family digital 语料库 for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our 语料库 comprises 85792 structured textual entries, each containing a Bengali stimulus 文本, an English translation, and an IPA transcription, together with approximately...
📄 Recent studies have demonstrated that incorporating auxiliary information, such as speaker voiceprint or visual cues, can substantially improve Speech Enhancement (SE) performance. However, single-channel methods often yield suboptimal results in low signal-to-noise ratio (SNR) conditions, when there is high reverberation, or in complex scenarios involving dynamic speakers, overlapping speech, or non-stationary noise. To address these issues, we propose a novel Visual-Informed Neural Beamforming Network (VI-NBFNet), which integrates microphone array signal processing and deep neural networks (...
📄 Recent studies have demonstrated that incorporating auxiliary information, such as speaker voiceprint or visual cues, can substantially 改进 语音 Enhancement (SE) 性能. However, single-channel methods often yield suboptimal 结果 in low signal-to-noise ratio (SNR) conditions, when there is high reverberation, or in complex scenarios involving dynamic speakers, overlapping 语音, or non-stationary noise. To address these issues, we 提出 a 新颖 Visual-Informed Neural Beamforming 网络 (VI-NBFNet), which integrates microphone array signal processing and deep neural networks (...
📄 Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasonin...
📄 Knowledge-Based Visual Question Answering (KB-VQA) requires 模型 to answer questions about an 图像 by integrating external knowledge, posing significant challenges due to noisy 检索 and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained 多模态 large 语言 模型 (MLLMs), making effective reasoning and 领域 adaptation difficult in the post-训练 stage. In this work, we 提出 \textit{Wiki-R1}, a 数据-生成-based curriculum 强化学习 框架 that systematically incentivizes reasonin...
📄 Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-moda...
📄 Recent advances in large 语言 模型 (LLMs) have opened 新的 avenues for 多模态 reasoning. Yet, most existing methods still rely on pretrained 视觉语言 模型 (VLMs) to encode 图文 pairs in isolation, ignoring the relational structure that 现实世界 多模态 数据 naturally form. This motivates reasoning on 多模态 graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling 大语言模型-based reasoning on such heterogeneous 多模态 signals while preserving graph topology introduces two 键 challenges: resolving weak cross-moda...
📄 Time series forecasting has witnessed an increasing demand across diverse industrial applications, where accurate predictions are pivotal for informed decision-making. Beyond numerical time series data, reliable forecasting in practical scenarios requires integrating diverse exogenous factors. Such exogenous information is often multi-dimensional or even multimodal, introducing heterogeneous interactions that unimodal time series models struggle to capture. In this paper, we delve into an aviation maintenance scenario and identify three distinct types of exogenous factors that influence tempor...
📄 Time series forecasting has witnessed an increasing demand across diverse industrial applications, where accurate predictions are pivotal for informed decision-making. Beyond numerical time series 数据, reliable forecasting in 实际 scenarios requires integrating diverse exogenous factors. Such exogenous information is often multi-dimensional or even 多模态, introducing heterogeneous interactions that unimodal time series 模型 struggle to capture. In this paper, we delve into an aviation maintenance scenario and identify three distinct types of exogenous factors that influence tempor...
📄 We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation...
📄 We 介绍 FaceCam, a system that generates 视频 under customizable camera trajectories for monocular human portrait 视频 input. Recent camera control approaches 基于 large 视频-生成 模型 have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait 视频 due to 规模-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we 提出 a face-tailored 规模-aware 表示 for camera transformations that provides deterministic conditioning without relying on 3D priors. We 训练 a 视频 生成...
📄 Effective robot autonomy requires motion generation that is safe, feasible, and reactive. Current methods are fragmented: fast planners output physically unexecutable trajectories, reactive controllers struggle with high-fidelity perception, and existing solvers fail on high-DoF systems. We present cuRoboV2, a unified framework with three key innovations: (1) B-spline trajectory optimization that enforces smoothness and torque limits; (2) a GPU-native TSDF/ESDF perception pipeline that generates dense signed distance fields covering the full workspace, unlike existing methods that only provide...
📄 Effective robot autonomy requires motion 生成 that is safe, feasible, and reactive. Current methods are fragmented: fast planners output physically unexecutable trajectories, reactive controllers struggle with high-fidelity perception, and existing solvers fail on high-DoF systems. We 提出 cuRoboV2, a unified 框架 with three 键 innovations: (1) B-spline trajectory 优化 that enforces smoothness and torque limits; (2) a GPU-native TSDF/ESDF perception pipeline that generates dense signed distance fields covering the full workspace, unlike existing methods that only provide...
📄 The growing complexity of hardware design and the widening gap between high-level specifications and register-transfer level (RTL) implementation hinder rapid prototyping and system design. We introduce NL2GDS (Natural Language to Layout), a novel framework that leverages large language models (LLMs) to translate natural language hardware descriptions into synthesizable RTL and complete GDSII layouts via the open-source OpenLane ASIC flow. NL2GDS employs a modular pipeline that captures informal design intent, generates HDL using multiple LLM engines and verifies them, and orchestrates automat...
📄 The growing complexity of 硬件 design and the widening gap between high-level specifications and register-transfer level (RTL) implementation hinder rapid prototyping and system design. We 介绍 NL2GDS (Natural 语言 to Layout), a 新颖 框架 that leverages large 语言 模型 (LLMs) to translate natural 语言 硬件 descriptions into synthesizable RTL and complete GDSII layouts via the open-source OpenLane ASIC flow. NL2GDS employs a modular pipeline that captures informal design intent, generates HDL 使用 multiple 大语言模型 engines and verifies them, and orchestrates automat...
📄 While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic ba...
📄 While datasets for 视频 understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we 介绍 MM-Lifelong, a 数据集 designed for 多模态 Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working 记忆 Bottleneck due to context saturation, while representative agentic ba...
📄 Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimatio...
📄 Estimating heterogeneous treatment effects (HTEs) from right-censored survival 数据 is critical in high-stakes applications such as precision medicine and individualized 策略-making. Yet, the survival 分析 setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, 评估 practices remain fragmented and inconsistent. We 介绍 SurvHTE-Bench, the first comprehensive 基准测试 for HTE estimatio...
📄 We investigate the quantum algorithm of Babbush et al. (arXiv:2303.13012v3) for simulating coupled harmonic oscillators, which promises exponential speedups over classical methods. Focusing on linearly connected oscillator chains, we bridge the gap between theory and implementation by developing and comparing three concrete realizations of the algorithm. First, we implement a sparse initial state preparation combined with product-formula (Suzuki-Trotter) Hamiltonian simulation. Second, we implement a fully quantum, oracle-based framework in which classical data are accessed via oracles, the Ha...
📄 We investigate the 量子 算法 of Babbush et al. (arXiv:2303.13012v3) for simulating coupled harmonic oscillators, which promises exponential speedups over classical methods. Focusing on linearly connected oscillator chains, we bridge the gap between 理论 and implementation by developing and comparing three concrete realizations of the 算法. First, we implement a sparse initial state preparation combined with product-formula (Suzuki-Trotter) Hamiltonian simulation. Second, we implement a fully 量子, oracle-based 框架 in which classical 数据 are accessed via oracles, the Ha...
📄 Quantum-gas microscopes provide direct access to the phases of the Hubbard model, bringing microscopic insight into the complex competition between interactions, SU(2) magnetism, and doping. Alkaline-earth(-like) fermions extend this spin-1/2 paradigm by realizing higher symmetries and giving access to SU(N) Hubbard models, with rich phase diagrams to be unveiled. Despite its fundamental interest, a microscopic exploration of SU(N) quantum systems has remained elusive. Here we report the realization of a quantum-gas microscope for fermionic $^{87}$Sr. Our imaging scheme, based on cooling and f...
📄 量子-gas microscopes provide direct access to the phases of the Hubbard 模型, bringing microscopic insight into the complex competition between interactions, SU(2) magnetism, and doping. Alkaline-earth(-like) fermions extend this spin-1/2 paradigm by realizing higher symmetries and giving access to SU(N) Hubbard 模型, with rich phase diagrams to be unveiled. Despite its fundamental interest, a microscopic exploration of SU(N) 量子 systems has remained elusive. Here we report the realization of a 量子-gas microscope for fermionic $^{87}$Sr. Our imaging scheme, 基于 cooling and f...
📄 Correlated noise is a critical failure mode in quantum error correction (QEC), as temporal memory and spatial structure concentrate faults into error bursts that undermine standard threshold assumptions. Yet, a fundamental gap persists between the stochastic Pauli models ubiquitous in QEC and the microscopic, non-Markovian descriptions of physical device dynamics. We close this gap by introducing \emph{Spatiotemporal Pauli Processes} (SPPs). By applying a multi-time Pauli twirl -- operationally realised by Pauli-frame randomisation -- to a general process tensor, we map arbitrary multi-time, n...
📄 Correlated noise is a critical failure mode in 量子 错误 纠正 (QEC), as temporal 记忆 and spatial structure concentrate faults into 错误 bursts that undermine standard threshold assumptions. Yet, a fundamental gap persists between the 随机 Pauli 模型 ubiquitous in QEC and the microscopic, non-Markovian descriptions of physical device dynamics. We close this gap by introducing \emph{Spatiotemporal Pauli Processes} (SPPs). By applying a multi-time Pauli twirl -- operationally realised by Pauli-frame randomisation -- to a general process tensor, we map arbitrary multi-time, n...
📄 Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural repres...
📄 Hyperspectral 图像 (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material 检测 and identification. Longwave infrared (LWIR) HSI can be used for gas plume 检测 and 分析. Oftentimes, only a few 图像 of a scene of interest are available and are analyzed individually. The ability to combine information from multiple 图像 into a single, cohesive 表示 could enhance 分析 by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural repres...
📄 Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking with...
📄 Trustworthiness is a core research challenge for agentic AI systems built on Large 语言 模型 (LLMs). To enhance trust, natural 语言 claims from diverse sources, including human-written 文本, web content, and 模型 outputs, are commonly checked for factuality by retrieving external knowledge and 使用 an 大语言模型 to verify the faithfulness of claims to the retrieved evidence. As a 结果, such methods are constrained by 检索 errors and external 数据 availability, while leaving the 模型 intrinsic fact-verification capabilities largely unused. We 提出 the 任务 of fact-checking with...
📄 High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rende...
📄 High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered 图像. Existing approaches typically rely on simple heuristics for the hole filling, which can 结果 in inconsistencies or visual artifacts. We 提出 to complete the missing textures 使用 a 新颖, 应用-targeted inpainting 方法 independent of the underlying 表示 as an 图像-based post-processing step after the 新颖 view rende...
📄 We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation...
📄 We 介绍 FaceCam, a system that generates 视频 under customizable camera trajectories for monocular human portrait 视频 input. Recent camera control approaches 基于 large 视频-生成 模型 have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait 视频 due to 规模-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we 提出 a face-tailored 规模-aware 表示 for camera transformations that provides deterministic conditioning without relying on 3D priors. We 训练 a 视频 生成...
📄 Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduc...
📄 Recent 扩散模型 模型 enable high-quality 视频 生成, but suffer from slow runtimes. The large Transformer-based backbones used in these 模型 are bottlenecked by spatiotemporal 注意力机制. In this paper, we identify that a significant fraction of 词元-to-词元 connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the 注意力机制 computation in these cases can be skipped with little to no effect on the 结果. This observation continues to hold for connections among local 词元 blocks. Motivated by this, we introduc...
📄 We introduce group surface codes, which are a natural generalization of the $\mathbb{Z}_2$ surface code, and equivalent to quantum double models of finite groups with specific boundary conditions. We show that group surface codes can be leveraged to perform non-Clifford gates in $\mathbb{Z}_2$ surface codes, thus enabling universal computation with well-established means of performing logical Clifford gates. Moreover, for suitably chosen groups, we demonstrate that arbitrary reversible classical gates can be implemented transversally in the group surface code. We present the logical operations...
📄 We 介绍 group surface codes, which are a natural 泛化 of the $\mathbb{Z}_2$ surface code, and equivalent to 量子 double 模型 of finite groups with specific boundary conditions. We 表明 that group surface codes can be leveraged to perform non-Clifford gates in $\mathbb{Z}_2$ surface codes, thus enabling universal computation with well-established means of performing logical Clifford gates. Moreover, for suitably chosen groups, we 展示 that arbitrary reversible classical gates can be implemented transversally in the group surface code. We 提出 the logical operations...
📄 Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient v...
📄 高效的 and stable 训练 of large 语言 模型 (LLMs) remains a core challenge in modern 机器学习 systems. To address this challenge, Reparameterized Orthogonal Equivalence 训练 (POET), a spectrum-preserving 框架 that optimizes each weight matrix through orthogonal equivalence transformation, has been 提出. Although POET provides strong 训练 稳定性, its original implementation incurs high 记忆 consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we 介绍 POET-X, a scalable and 记忆-高效的 v...
📄 We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenome...
📄 We study two recurring phenomena in Transformer 语言 模型: massive activations, in which a small number of 词元 exhibit extreme outliers in a few channels, and 注意力机制 sinks, in which certain 词元 attract disproportionate 注意力机制 mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same 词元, but their functional roles and causal relationship remain unclear. Through systematic 实验, we 表明 that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenome...
📄 Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach p...
📄 Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We 提出 Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified 框架 that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated 使用 a Laplace guidance field. Our 方法 p...
📄 The realization of quantum error correction protocols whose logical error rates are suppressed far below physical error rates relies on an intricate combination: the error-correcting code's efficiency, the syndrome extraction circuit's fault tolerance and overhead, the decoder's quality, and the device's constraints, such as physical qubit count and connectivity. This work makes two contributions towards error-corrected quantum devices. First, we introduce mirror codes, a simple yet flexible construction of LDPC stabilizer codes parameterized by a group $G$ and two subsets of $G$ whose total s...
📄 The realization of 量子 错误 纠正 protocols whose logical 错误 rates are suppressed far below physical 错误 rates relies on an intricate combination: the 错误-correcting code's 效率, the syndrome extraction circuit's fault tolerance and overhead, the 解码器's quality, and the device's constraints, such as physical 量子比特 count and connectivity. This work makes two contributions towards 错误-corrected 量子 devices. First, we 介绍 mirror codes, a simple yet flexible construction of LDPC stabilizer codes parameterized by a group $G$ and two subsets of $G$ whose total s...
📄 To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the model thr...
📄 To 规模 the solution of 优化 and simulation problems, prior work has explored machine-学习 surrogates that inexpensively map problem 参数 to corresponding solutions. Commonly used approaches, including supervised and self-supervised 学习 with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult 优化 landscapes. To address their trade-offs, we 提出 a 新颖 框架 that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the 模型 thr...
📄 Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce fals...
📄 Large 语言 模型 sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the 模型 answers truthfully -- and lie 检测 -- classifying whether a given response is false. Prior work evaluates such methods on 模型 specifically 训练 to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are 训练 to censor politically sensitive topics: Qwen3 模型 frequently produce fals...
📄 We conducted a HI 21cm absorption study of a sample of 147 nearby (z < 0.1) low-power radio sources with $10\,\mathrm{mJy} < S_{1.4\,\mathrm{GHz}} < 30\,\mathrm{mJy}$ and $\log(P_{1.4\,\mathrm{GHz}}/\mathrm{W\,Hz^{-1}}) = 20.5-23.7$, using the Five-hundred-meter Aperture Spherical radio Telescope. By investigating the origin and kinematics of HI absorbing gas, we aim to study the interplay between the active galactic nucleus (AGN) and its surrounding interstellar medium. Our observations detect 12 new absorbers, combining results from the pilot survey (three absorbers out of 26 sources), yield...
📄 We conducted a HI 21cm absorption study of a 样本 of 147 nearby (z < 0.1) low-power radio sources with $10\,\mathrm{mJy} < S_{1.4\,\mathrm{GHz}} < 30\,\mathrm{mJy}$ and $\log(P_{1.4\,\mathrm{GHz}}/\mathrm{W\,Hz^{-1}}) = 20.5-23.7$, 使用 the Five-hundred-meter Aperture Spherical radio Telescope. By investigating the origin and kinematics of HI absorbing gas, we aim to study the interplay between the active galactic nucleus (AGN) and its surrounding interstellar medium. Our observations detect 12 新的 absorbers, combining 结果 from the pilot 综述 (three absorbers out of 26 sources), yield...
📄 Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural repres...
📄 Hyperspectral 图像 (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material 检测 and identification. Longwave infrared (LWIR) HSI can be used for gas plume 检测 and 分析. Oftentimes, only a few 图像 of a scene of interest are available and are analyzed individually. The ability to combine information from multiple 图像 into a single, cohesive 表示 could enhance 分析 by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural repres...
📄 The emergence of generative AI models has dramatically expanded the availability and use of synthetic data across scientific, industrial, and policy domains. While these developments open new possibilities for data analysis, they also raise fundamental statistical questions about when synthetic data can be used in a valid, reliable, and principled manner. This paper reviews the current landscape of synthetic data generation and use from a statistical perspective, with the goal of clarifying the assumptions under which synthetic data can meaningfully support downstream discovery, inference, and...
📄 The emergence of 生成式 AI 模型 has dramatically expanded the availability and use of synthetic 数据 across scientific, industrial, and 策略 domains. While these developments open 新的 possibilities for 数据 分析, they also raise fundamental statistical questions about when synthetic 数据 can be used in a valid, reliable, and principled manner. This paper reviews the current landscape of synthetic 数据 生成 and use from a statistical perspective, with the goal of clarifying the assumptions under which synthetic 数据 can meaningfully support downstream discovery, 推理, and...
📄 Thermonuclear X-ray bursts from the surface of accreting neutron stars are the most common astrophysical explosions in our galaxy. They provide a unique window into the physics of neutron stars, the physics of matter under extreme conditions, and the physics of astrophysical thermonuclear explosions. X-ray bursts are powered by a broad range of nuclear reactions that need to be understood to interpret observations. The relevant nuclei are mostly neutron deficient and unstable, and thus experimental information and theoretical understanding is limited and an active area of research in nuclear s...
📄 Thermonuclear X-ray bursts from the surface of accreting neutron stars are the most common astrophysical explosions in our galaxy. They provide a unique window into the physics of neutron stars, the physics of matter under extreme conditions, and the physics of astrophysical thermonuclear explosions. X-ray bursts are powered by a broad range of nuclear reactions that need to be understood to interpret observations. The relevant nuclei are mostly neutron deficient and unstable, and thus experimental information and 理论 understanding is limited and an active area of research in nuclear s...
📄 Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech foundation models, which are typically adapted through domain-specific fine-tuning, resulting in multiple customised checkpoints, for which repeating full fine-tuning when new data becomes available is computationally prohibitive. In this work, we study model merging for multi-domain ASR and benchmark 11 merging algorithms for 10 European Portuguese domains, evaluating in-domain accuracy, robustness un...
📄 模型 merging is a scalable alternative to multi-任务 训练 that combines the capabilities of multiple specialised 模型 into a single 模型. This is particularly attractive for large 语音 foundation 模型, which are typically adapted through 领域-specific 微调, resulting in multiple customised checkpoints, for which repeating full 微调 when 新的 数据 becomes available is computationally prohibitive. In this work, we study 模型 merging for multi-领域 ASR and 基准测试 11 merging algorithms for 10 European Portuguese domains, evaluating in-领域 准确率, 鲁棒性 un...
📄 The processes governing protostellar mass growth remain debated, although episodic accretion is now understood as a key feature of protostellar evolution across all masses. Luminosity bursts have been observed in both low- and high-mass protostars, but the overall statistics remain limited, especially for high-mass objects. Over the past decade, numerical simulations of high-mass core collapse have provided a theoretical framework for interpreting protostellar variability, yet additional observational constraints are required to determine the characteristics and importance of bursts. In this w...
📄 The processes governing protostellar mass growth remain debated, although episodic accretion is now understood as a 键 特征 of protostellar evolution across all masses. Luminosity bursts have been observed in both low- and high-mass protostars, but the overall statistics remain limited, especially for high-mass objects. Over the past decade, numerical simulations of high-mass core collapse have provided a 理论 框架 for interpreting protostellar variability, yet additional observational constraints are required to determine the characteristics and importance of bursts. In this w...
📄 The Microchannel X-ray Telescope on board the Space-based multi-band astronomical Variable Objects Monitor (SVOM) satellite detects and localizes the X-ray afterglow of gamma-ray bursts. One year after the launch, this paper presents the in-flight performance of the scientific analyses conducted by the on-board computer. After summarizing the analysis steps, the paper reviews the on-board results obtained with 15 gamma-ray burst afterglows detected by the telescope between October 2024 and August 2025. For all bursts, the localization uncertainty is estimated to be below 2 arcmin, as required ...
📄 The Microchannel X-ray Telescope on board the Space-based multi-band astronomical Variable Objects Monitor (SVOM) satellite detects and localizes the X-ray afterglow of gamma-ray bursts. One year after the launch, this paper presents the in-flight 性能 of the scientific analyses conducted by the on-board computer. After summarizing the 分析 steps, the paper reviews the on-board 结果 obtained with 15 gamma-ray burst afterglows detected by the telescope between October 2024 and August 2025. For all bursts, the localization uncertainty is estimated to be below 2 arcmin, as required ...
📄 Our understanding of the early Universe has long been limited by biased galaxy samples selected through various color criteria. With deep JWST infrared imaging, mass-complete galaxy samples can now be studied up to $z \sim 8$ for the first time. However, recent work has revealed systematic uncertainties in measuring physical properties of galaxies based solely on JWST/NIRCam and HST photometry, due to their limited wavelength coverage. This highlights the need for supplementary data, particularly in the rest-frame UV and near-infrared. Here we present the ULTIMATE-deblending project, which wil...
📄 Our understanding of the early Universe has long been limited by biased galaxy 样本 selected through various color criteria. With deep JWST infrared imaging, mass-complete galaxy 样本 can now be studied up to $z \sim 8$ for the first time. However, recent work has revealed systematic uncertainties in measuring physical properties of galaxies based solely on JWST/NIRCam and HST photometry, due to their limited wavelength coverage. This highlights the need for supplementary 数据, particularly in the rest-frame UV and near-infrared. Here we 提出 the ULTIMATE-deblending project, which wil...
📄 We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately...
📄 We 提出 the Multilingual Cloud 语料库, the first national-规模, parallel, 多模态 linguistic 数据集 of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four 语言 families, Bangladesh has lacked a systematic, cross-family digital 语料库 for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our 语料库 comprises 85792 structured textual entries, each containing a Bengali stimulus 文本, an English translation, and an IPA transcription, together with approximately...
📄 While it is well known that galaxies are composites of many emission processes, quantifying the various contributions remains challenging. In this work, we use unsupervised machine learning based clustering algorithms to evaluate the agreement between the clustering tools and astrophysical classifications, and hence quantify the fractional contributions of star formation processes and nuclear black hole activity to the total galaxy energy budget of radio sources. We perform clustering on the multiwavelength (optical, infrared (IR), and radio) active galactic nuclei (AGN) diagnostic spaces, usi...
📄 While it is well known that galaxies are composites of many emission processes, quantifying the various contributions remains challenging. In this work, we use unsupervised 机器学习 based clustering algorithms to evaluate the agreement between the clustering tools and astrophysical classifications, and hence quantify the fractional contributions of star formation processes and nuclear black hole activity to the total galaxy energy budget of radio sources. We perform clustering on the multiwavelength (optical, infrared (IR), and radio) active galactic nuclei (AGN) diagnostic spaces, usi...
📄 Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce R...
📄 规模化 imitation 学习 is fundamentally constrained by the 效率 of 数据 collection. While handheld interfaces have emerged as a scalable solution for in-the-wild 数据 acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying 策略's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to 规模. To reconcile this trade-off, we 介绍 R...
📄 Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient v...
📄 高效的 and stable 训练 of large 语言 模型 (LLMs) remains a core challenge in modern 机器学习 systems. To address this challenge, Reparameterized Orthogonal Equivalence 训练 (POET), a spectrum-preserving 框架 that optimizes each weight matrix through orthogonal equivalence transformation, has been 提出. Although POET provides strong 训练 稳定性, its original implementation incurs high 记忆 consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we 介绍 POET-X, a scalable and 记忆-高效的 v...
📄 Continuous-variable quantum systems are central to quantum technologies, with Gaussian states playing a key role due to their broad applicability and simple description via first and second moments. Distinguishing Gaussian states requires computing their trace distance, but no analytical formula exists for general states, and numerical evaluation is difficult due to the exponential cost of representing infinite-dimensional operators. We introduce an efficient numerical method to compute the trace distance between a pure and a mixed Gaussian state, based on a generalized Lanczos algorithm that ...
📄 Continuous-variable 量子 systems are central to 量子 technologies, with Gaussian states playing a 键 role due to their broad applicability and simple description via first and second moments. Distinguishing Gaussian states requires computing their trace distance, but no analytical formula exists for general states, and numerical 评估 is difficult due to the exponential cost of representing infinite-dimensional operators. We 介绍 an 高效的 numerical 方法 to compute the trace distance between a pure and a mixed Gaussian state, 基于 a generalized Lanczos 算法 that ...
📄 To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the model thr...
📄 To 规模 the solution of 优化 and simulation problems, prior work has explored machine-学习 surrogates that inexpensively map problem 参数 to corresponding solutions. Commonly used approaches, including supervised and self-supervised 学习 with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult 优化 landscapes. To address their trade-offs, we 提出 a 新颖 框架 that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the 模型 thr...
📄 Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce fals...
📄 Large 语言 模型 sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the 模型 answers truthfully -- and lie 检测 -- classifying whether a given response is false. Prior work evaluates such methods on 模型 specifically 训练 to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are 训练 to censor politically sensitive topics: Qwen3 模型 frequently produce fals...
📄 Characterizing the dynamics of open quantum systems at the level of microscopic interactions and error mechanisms is essential for calibrating quantum hardware, designing robust simulation protocols, and developing tailored error-correction methods. Under Markovian noise/dissipation, a natural characterization approach is to identify the full Lindbladian generator that gives rise to both coherent (Hamiltonian) and dissipative dynamics. Prior protocols for learning Lindbladians from dynamical data assumed pre-specified interaction structure, which can be restrictive when the relevant noise chan...
📄 Characterizing the dynamics of open 量子 systems at the level of microscopic interactions and 错误 mechanisms is essential for calibrating 量子 硬件, designing 鲁棒 simulation protocols, and developing tailored 错误-纠正 methods. Under Markovian noise/dissipation, a natural characterization 方法 is to identify the full Lindbladian generator that gives rise to both coherent (Hamiltonian) and dissipative dynamics. Prior protocols for 学习 Lindbladians from dynamical 数据 assumed pre-specified interaction structure, which can be restrictive when the relevant noise chan...
📄 We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in d...
📄 We provide evidence of performative chain-of-thought (CoT) in reasoning 模型, where a 模型 becomes strongly confident in its final answer, but continues generating 词元 without revealing its internal belief. Our 分析 compares activation probing, early forced answering, and a CoT monitor across two large 模型 (DeepSeek-R1 671B & GPT-OSS 120B) and find 任务 difficulty-specific differences: The 模型's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in d...
📄 While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic ba...
📄 While datasets for 视频 understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we 介绍 MM-Lifelong, a 数据集 designed for 多模态 Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working 记忆 Bottleneck due to context saturation, while representative agentic ba...
📄 Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimatio...
📄 Estimating heterogeneous treatment effects (HTEs) from right-censored survival 数据 is critical in high-stakes applications such as precision medicine and individualized 策略-making. Yet, the survival 分析 setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, 评估 practices remain fragmented and inconsistent. We 介绍 SurvHTE-Bench, the first comprehensive 基准测试 for HTE estimatio...
📄 Singular statistical models-including mixtures, matrix factorization, and neural networks-violate regular asymptotics due to parameter non-identifiability and degenerate Fisher geometry. Although singular learning theory characterizes marginal likelihood behavior through invariants such as the real log canonical threshold and singular fluctuation, these quantities remain difficult to interpret operationally. At the same time, widely used criteria such as WAIC and WBIC appear disconnected from underlying singular geometry. We show that posterior tempering induces a one-parameter deformation of ...
📄 Singular statistical 模型-including mixtures, matrix factorization, and neural networks-violate regular asymptotics due to 参数 non-identifiability and degenerate Fisher geometry. Although singular 学习 理论 characterizes marginal likelihood behavior through invariants such as the real log canonical threshold and singular fluctuation, these quantities remain difficult to interpret operationally. At the same time, widely used criteria such as WAIC and WBIC appear disconnected from underlying singular geometry. We 表明 that posterior tempering induces a one-参数 deformation of ...