7大主题 × 10篇 | 伊利虾 🦐
📄 High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.
📄 高质量的多摄像头三维流传输对众多AR/VR应用中的沉浸式体验至关重要。由于实时性限制导致的视角数量有限,常使渲染图像出现信息缺失与表面不完整的问题。现有方法通常依赖简单的启发式策略进行空洞填充,易导致画面不一致或视觉伪影。我们提出一种独立于底层表征、面向应用的新型修复方法,将其作为新视角渲染后的图像后处理步骤,用于补全缺失纹理。该方法设计为兼容任意标定多摄像头系统的独立模块。为此,我们引入基于时空嵌入的多视角感知Transformer网络架构,在保持细节精度的同时确保跨帧一致性。此外,分辨率无关的设计使其能适配不同相机配置,而自适应分块选择策略则平衡推理速度与质量,实现实时性能。我们在同等实时约束条件下与前沿修复技术进行对比评估,结果表明该模型在质量与速度间达到最佳平衡,在图像与视频评价指标上均优于现有方法。
📄 We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.
📄 我们提出FaceCam系统,该系统可为单目人像视频输入生成具有可定制相机轨迹的视频。基于大型视频生成模型的相机控制方法近期取得了显著进展,但由于尺度模糊的相机表示或三维重建误差,在人像视频中常出现几何畸变和视觉伪影。为克服这些局限性,我们提出一种面向人像的尺度感知相机变换表示方法,该方法无需依赖三维先验知识即可提供确定性条件约束。我们利用多视角影棚采集数据与真实场景单目视频训练视频生成模型,并引入两种相机控制数据生成策略:合成相机运动与多镜头拼接,从而在利用静态训练相机数据的同时,在推理阶段泛化至动态连续的相机轨迹。在Ava-256数据集及多样化真实场景视频上的实验表明,FaceCam在相机可控性、视觉质量、身份特征与运动保持方面均展现出卓越性能。
📄 Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones. Its core innovation is a Remote Inference framework that visualizes the policy's predicted trajectory via Augmented Reality (AR) Visual Foresight. This immersive feedback allows collectors to proactively identify potential failures and focus data collection on the policy's weak regions without requiring a physical robot. Furthermore, we implement an asynchronous Online Finetuning pipeline that continuously updates the policy with incoming data, effectively closing the learning loop in minutes. Extensive experiments demonstrate that RoboPocket adheres to data scaling laws and doubles the data efficiency compared to offline scaling strategies, overcoming their long-standing efficiency bottleneck. Moreover, our instant iteration loop also boosts sample efficiency by up to 2$\times$ in distributed environments a small number of interactive corrections per person. Project page and videos: https://robo-pocket.github.io.
📄 模仿学习的规模化从根本上受到数据收集效率的限制。尽管手持设备已成为野外数据采集的可扩展解决方案,但它们主要以开环方式运行:操作者在不知晓底层策略弱点的情况下盲目收集演示数据,导致对关键状态分布的覆盖效率低下。相比之下,DAgger等交互式方法虽能有效解决协变量偏移问题,却依赖于实体机器人执行,成本高昂且难以规模化。为平衡这一矛盾,我们推出了RoboPocket——一个基于普通智能手机即可实现"无机器人即时策略迭代"的便携式系统。其核心创新在于远程推理框架,该框架通过增强现实(AR)视觉预见技术可视化策略的预测轨迹。这种沉浸式反馈使数据收集者能主动识别潜在失败,并将数据收集聚焦于策略薄弱区域,而无需实体机器人参与。此外,我们构建了异步在线微调管道,能够持续用新增数据更新策略,在数分钟内有效闭合学习回路。大量实验表明,RoboPocket遵循数据缩放定律,相比离线缩放策略将数据效率提升了一倍,突破了长期存在的效率瓶颈。更值得注意的是,在分布式环境中,我们的即时迭代回路仅需每人少量交互修正,即可将样本效率提升高达2倍。项目页面与视频:https://robo-pocket.github.io。
📄 Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.
📄 近期扩散模型虽能生成高质量视频,但存在运行速度缓慢的问题。这些模型所采用的大型基于Transformer的主干网络受限于时空注意力机制。本文发现,在不同输入中,大量词元间连接始终产生可忽略的分数值,且其模式在多个查询中反复出现。因此,在计算注意力时可跳过这些连接,几乎不影响生成结果。这一现象在局部词元块之间的连接中同样存在。基于此,我们提出CalibAtt——一种无需训练的方法,通过校准稀疏注意力加速视频生成。CalibAtt通过离线校准步骤识别出跨输入稳定的块级稀疏性与重复模式,并将这些模式编译为针对每层、每个注意力头和扩散时间步的优化注意力操作。在推理阶段,我们以硬件高效的方式密集计算选中的输入相关连接,同时跳过未选中的连接。在Wan 2.1 14B、Mochi 1及多种分辨率的少步蒸馏模型上进行的大量实验表明,CalibAtt可实现最高1.58倍的端到端加速效果,在保持视频生成质量与文本-视频对齐度的同时,性能优于现有无需训练的方法。
📄 We introduce group surface codes, which are a natural generalization of the $\mathbb{Z}_2$ surface code, and equivalent to quantum double models of finite groups with specific boundary conditions. We show that group surface codes can be leveraged to perform non-Clifford gates in $\mathbb{Z}_2$ surface codes, thus enabling universal computation with well-established means of performing logical Clifford gates. Moreover, for suitably chosen groups, we demonstrate that arbitrary reversible classical gates can be implemented transversally in the group surface code. We present the logical operations in terms of a set of elementary logical operations, which include transversal logical gates, a means of transferring encoded information into and out of group surface codes, and preparation and readout. By composing these elementary operations, we implement a wide variety of logical gates and provide a unified perspective on recent constructions in the literature for sliding group surface codes and preparing magic states. We furthermore use tensor networks inspired by ZX-calculus to construct spacetime implementations of the elementary operations. This spacetime perspective also allows us to establish explicit correspondences with topological gauge theories. Our work extends recent efforts in performing universal quantum computation in topological orders without the braiding of anyons, and shows how certain group surface codes allow us to bypass the restrictions set by the Bravyi-K{ö}nig theorem, which limits the computational power of topological Pauli stabilizer models.
📄 我们引入群表面码,它是$\mathbb{Z}_2$表面码的自然推广,并等价于具有特定边界条件的有限群量子双模型。我们证明,群表面码可用于在$\mathbb{Z}_2$表面码中执行非克利福德门操作,从而通过成熟的逻辑克利福德门实现方式达成通用计算。此外,对于适当选择的群,我们展示了任意可逆经典门可在群表面码中横向实现。我们通过一组基本逻辑操作来描述逻辑运算,这些操作包括横向逻辑门、将编码信息传入和传出群表面码的方法,以及制备与读取过程。通过组合这些基本操作,我们实现了多种逻辑门,并为近期文献中关于滑动群表面码和制备魔术态的构造提供了统一视角。进一步,我们利用受ZX演算启发的张量网络构建了基本操作的时空实现。这种时空视角也使我们能够与拓扑规范理论建立明确对应关系。我们的工作扩展了近期在拓扑序中不依赖任意子编织实现通用量子计算的努力,并展示了特定群表面码如何帮助我们绕过Bravyi-König定理的限制——该定理制约了拓扑泡利稳定子模型的计算能力。
📄 Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.
📄 大规模语言模型(LLM)的高效稳定训练仍是现代机器学习系统的核心挑战。为应对这一挑战,研究者提出了重参数化正交等价训练(POET)——一种通过正交等价变换优化各权重矩阵的谱保持框架。尽管POET具备优异的训练稳定性,但其原始实现因密集矩阵乘法导致内存消耗与计算开销过高。为突破这些限制,我们提出POET-X:一种可扩展且内存高效的改进方案,能以显著降低的计算成本实现正交等价变换。POET-X在保持POET泛化性与稳定性优势的同时,大幅提升了训练吞吐量与内存效率。实验表明,POET-X可在单张Nvidia H100 GPU上完成十亿参数级LLM的预训练,而AdamW等标准优化器在相同配置下会出现内存不足的问题。
📄 Continuous-variable quantum systems are central to quantum technologies, with Gaussian states playing a key role due to their broad applicability and simple description via first and second moments. Distinguishing Gaussian states requires computing their trace distance, but no analytical formula exists for general states, and numerical evaluation is difficult due to the exponential cost of representing infinite-dimensional operators. We introduce an efficient numerical method to compute the trace distance between a pure and a mixed Gaussian state, based on a generalized Lanczos algorithm that avoids explicit matrix representations and uses only moment information. The technique extends to non-Gaussian states expressible as linear combinations of Gaussian states. We also show how it can yield lower bounds on the trace distance between mixed Gaussian states, offering a practical tool for state certification and learning in continuous-variable quantum systems.
📄 连续变量量子系统是量子技术的核心,其中高斯态因其广泛的适用性以及通过一阶矩和二阶矩的简洁描述而发挥着关键作用。区分高斯态需要计算它们的迹距离,但对于一般态尚无解析公式,且由于表示无限维算符的指数级成本,数值计算十分困难。我们提出了一种高效的数值方法,用于计算纯高斯态与混合高斯态之间的迹距离。该方法基于广义Lanczos算法,避免了显式的矩阵表示,仅利用矩信息进行计算。该技术还可推广至可表示为高斯态线性组合的非高斯态。我们还展示了如何利用该方法获得混合高斯态之间迹距离的下界,为连续变量量子系统中的态认证与学习提供了实用工具。
📄 We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.
📄 我们研究了Transformer语言模型中两种反复出现的现象:大规模激活(即少数词元在个别通道中呈现极端离群值)和注意力汇聚(即某些词元无论语义相关性如何都会吸引不成比例的注意力权重)。已有研究观察到这两种现象经常同时出现,且往往涉及相同的词元,但它们的功能作用及因果关系尚不明确。通过系统实验,我们发现这种共现现象主要是现代Transformer架构设计的人工产物,且两种现象承担着相关但不同的功能。大规模激活具有全局性作用:它们会诱导出跨层持续存在的近似恒定隐表示,实质上充当了模型的隐式参数;而注意力汇聚则具有局部性作用:它们通过调节注意力头之间的输出,使单个注意力头偏向短程依赖关系。我们指出前归一化配置是实现两者共现的关键设计选择,并证明消除该配置会导致两种现象解耦。
📄 Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach perceives the environment by fusing multi-sensor point clouds with vision-based instance segmentation and persistent object tracking to maintain up-to-date semantics beyond the camera's field of view. A multi-layer safety filter is then used to modulate system inputs to achieve safe navigation using this semantic understanding of the environment. This safety filter consists of both a model predictive control layer and a control barrier function layer. Both layers utilize the PSF and flux modulation of the guidance field to introduce varying levels of conservatism and multi-agent passing norms for different obstacles in the environment. Our framework enables legged robots to navigate semantically rich, dynamic environments with context-dependent safety margins while maintaining rigorous safety guarantees.
📄 传统安全关键控制方法(如控制屏障函数)存在语义盲区,无论障碍物的情境意义如何,其在障碍物周围均表现出相同行为。这一局限导致所有障碍物被统一对待,尽管它们具有不同的语义含义。我们提出Safe-SAGE(安全交互的社会语义自适应引导框架),该统一框架通过拉普拉斯引导场调制的泊松安全函数,弥合了高层语义理解与底层安全关键控制之间的鸿沟。我们的方法通过融合多传感器点云与基于视觉的实例分割及持续目标跟踪来感知环境,从而维持摄像头视场外的最新语义信息。随后采用多层安全滤波器,基于对环境的语义理解来调制系统输入,实现安全导航。该安全滤波器包含模型预测控制层和控制屏障函数层,两者均利用泊松安全函数和引导场的通量调制机制,针对环境中不同障碍物引入差异化的保守度层级与多智能体通行规范。我们的框架使足式机器人能够在语义丰富的动态环境中,根据情境保持差异化安全边界的同时,维持严格的安全保障。
📄 To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the model through self-supervised learning to improve overall performance. Our theoretical analysis and merit-based criterion show that labeled data need only place the model within a basin of attraction, confirming that only modest numbers of inexact labels and training epochs are required. We empirically validate our simple three-stage strategy across challenging domains, including nonconvex constrained optimization, power-grid operation, and stiff dynamical systems, and show that it yields faster convergence; improved accuracy, feasibility, and optimality; and up to 59x reductions in total offline cost.
📄 为扩大优化与仿真问题的求解规模,先前研究已探索使用机器学习代理模型,以低成本将问题参数映射至对应解。常用方法(包括采用软性或硬性可行性约束的监督与自监督学习)面临固有挑战,如依赖昂贵的高质量标签或复杂的优化地形。为平衡这些方法的利弊,我们提出一种新颖框架:先收集“廉价”的不完美标签进行监督预训练,再通过自监督学习微调模型以提升整体性能。理论分析与基于性能的准则表明,标签数据仅需将模型置于吸引域内,这证实了仅需少量不精确标签和较少训练轮次即可实现目标。我们在非凸约束优化、电网运行和刚性动力系统等挑战性领域中对这一简易三阶段策略进行了实证验证,结果表明该方法能实现更快的收敛速度,提升准确性、可行性与最优性,并将离线总成本降低高达59倍。
📄 Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
📄 大型语言模型有时会产生虚假或误导性的回答。针对此问题有两种解决思路:诚实诱导——通过修改提示词或权重使模型如实作答;以及谎言检测——判断给定回答是否虚假。先前研究通常在专门训练用于说谎或隐瞒信息的模型上评估这些方法,但这些人为构建的场景可能无法反映自然发生的欺骗行为。我们转而研究中国开发者发布的开源权重大语言模型,这些模型被训练用于审查政治敏感话题:Qwen3系列模型在涉及法轮功或天安门抗议等话题时频繁输出虚假信息,但偶尔也能正确回答,表明它们具备被训练所压制的知识。以此为测试平台,我们评估了一系列诱导与谎言检测技术。在诚实诱导方面,去除对话模板的采样、少样本提示以及在通用诚实数据上的微调最能可靠提升真实回答比例。对于谎言检测,直接要求被审查模型对其自身回答进行分类的方法接近未审查模型的理论上限,而基于无关数据训练的线性探针则提供了更经济的替代方案。最强的诚实诱导技术同样适用于包括DeepSeek R1在内的前沿开源模型。值得注意的是,没有任何技术能完全消除虚假回答。我们已公开所有提示词、代码及对话记录。
📄 Effective robot autonomy requires motion generation that is safe, feasible, and reactive. Current methods are fragmented: fast planners output physically unexecutable trajectories, reactive controllers struggle with high-fidelity perception, and existing solvers fail on high-DoF systems. We present cuRoboV2, a unified framework with three key innovations: (1) B-spline trajectory optimization that enforces smoothness and torque limits; (2) a GPU-native TSDF/ESDF perception pipeline that generates dense signed distance fields covering the full workspace, unlike existing methods that only provide distances within sparsely allocated blocks, up to 10x faster and in 8x less memory than the state-of-the-art at manipulation scale, with up to 99% collision recall; and (3) scalable GPU-native whole-body computation, namely topology-aware kinematics, differentiable inverse dynamics, and map-reduce self-collision, that achieves up to 61x speedup while also extending to high-DoF humanoids (where previous GPU implementations fail). On benchmarks, cuRoboV2 achieves 99.7% success under 3kg payload (where baselines achieve only 72--77%), 99.6% collision-free IK on a 48-DoF humanoid (where prior methods fail entirely), and 89.5% retargeting constraint satisfaction (vs. 61% for PyRoki); these collision-free motions yield locomotion policies with 21% lower tracking error than PyRoki and 12x lower cross-seed variance than mink. A ground-up codebase redesign for discoverability enabled LLM coding assistants to author up to 73% of new modules, including hand-optimized CUDA kernels, demonstrating that well-structured robotics code can unlock productive human--LLM collaboration. Together, these advances provide a unified, dynamics-aware motion generation stack that scales from single-arm manipulators to full humanoids.
📄 有效的机器人自主性需要安全、可行且反应灵敏的运动生成。当前方法存在碎片化问题:快速规划器输出的轨迹在物理上无法执行,反应式控制器难以处理高精度感知,现有求解器无法应对高自由度系统。我们提出cuRoboV2统一框架,包含三项关键创新:(1) 强制平滑性与扭矩约束的B样条轨迹优化;(2) GPU原生TSDF/ESDF感知流水线,可生成覆盖完整工作空间的稠密符号距离场——与现有方法仅在稀疏分配块内提供距离相比,在操作尺度上比现有技术快10倍且内存占用减少8倍,碰撞召回率高达99%;(3) 可扩展的GPU原生全身计算(包括拓扑感知运动学、可微逆动力学及映射归约自碰撞检测),实现高达61倍加速,并能扩展至高自由度人形机器人(此前GPU方案均无法实现)。在基准测试中,cuRoboV2在3kg负载下达成99.7%成功率(基线方法仅72-77%),在48自由度人形机器人上实现99.6%无碰撞逆运动学(现有方法完全失效),重定向约束满足率达89.5%(PyRoki为61%);这些无碰撞运动生成的步态策略比PyRoki跟踪误差降低21%,比mink跨种子方差降低12倍。为提升可发现性进行的代码库彻底重构,使LLM编程助手能编写高达73%的新模块(包括手工优化的CUDA内核),证明结构良好的机器人代码能实现高效的人机协作。这些进展共同构成了从单臂机械手到完整人形机器人均可扩展的统一动力学感知运动生成框架。
📄 The growing complexity of hardware design and the widening gap between high-level specifications and register-transfer level (RTL) implementation hinder rapid prototyping and system design. We introduce NL2GDS (Natural Language to Layout), a novel framework that leverages large language models (LLMs) to translate natural language hardware descriptions into synthesizable RTL and complete GDSII layouts via the open-source OpenLane ASIC flow. NL2GDS employs a modular pipeline that captures informal design intent, generates HDL using multiple LLM engines and verifies them, and orchestrates automated synthesis and layout. Evaluations on ISCAS'85 and ISCAS'89 benchmark designs demonstrate up to 36% area reduction, 35% delay reduction, and 70% power savings compared to baseline designs, highlighting its potential to democratize ASIC design and accelerate hardware innovation.
📄 硬件设计日益复杂,高层次规范与寄存器传输级(RTL)实现之间的鸿沟不断扩大,阻碍了快速原型设计与系统开发。我们提出NL2GDS(自然语言到版图)——一种创新框架,该框架利用大语言模型将自然语言硬件描述转化为可综合的RTL,并通过开源OpenLane ASIC流程生成完整的GDSII版图。NL2GDS采用模块化流程,能够捕捉非形式化的设计意图,通过多引擎大语言模型生成硬件描述语言并进行验证,同时协调自动化综合与版图生成。在ISCAS'85和ISCAS'89基准测试集上的评估表明,相较于基线设计,该框架可实现高达36%的面积缩减、35%的延迟降低以及70%的功耗节约,彰显了其在推动ASIC设计普及与加速硬件创新方面的潜力。
📄 We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.
📄 我们通过证据揭示了推理模型中存在的表演性思维链现象:模型对其最终答案表现出高度自信,却持续生成不暴露其内部信念的标记。本研究通过激活探测、早期强制回答和思维链监控三种方法,对两个大型模型(DeepSeek-R1 671B 和 GPT-OSS 120B)进行比较分析,发现任务难度导致的显著差异:在思维链过程中,模型最终答案可从远早于监控器判断时机的激活状态中被解码,这种现象在基于简单记忆的MMLU问题上尤为明显。与此形成对比的是,在复杂的多跳推理GPQA-Diamond问题中我们观察到真正的推理过程。值得注意的是,转折点(如回溯、“顿悟”时刻)几乎只出现在探测显示信念大幅波动的响应中,这表明这些行为反映了真实的不确定性,而非习得的“推理表演”。最后,基于探测的早期退出策略在MMLU任务上减少高达80%的标记生成,在GPQA-Diamond任务上减少30%,同时保持相近准确率,这证明注意力探测可作为检测表演性推理、实现自适应计算的高效工具。
📄 Vision-Language-Action Models (VLAs) have shown remarkable progress towards embodied intelligence. While their architecture partially resembles that of Large Language Models (LLMs), VLAs exhibit higher complexity due to their multi-modal inputs/outputs and often hybrid nature of transformer and diffusion heads. This is part of the reason why insights from mechanistic interpretability in LLMs, which explain how the internal model representations relate to their output behavior, do not trivially transfer to VLA counterparts. In this work, we propose to close this gap by introducing and analyzing two main concepts: feature-observability and feature-controllability. In particular, we first study features that are linearly encoded in representation space, and show how they can be observed by means of a linear classifier. Then, we use a minimal linear intervention grounded in optimal control to accurately place internal representations and steer the VLA's output towards a desired region. Our results show that targeted, lightweight interventions can reliably steer a robot's behavior while preserving closed-loop capabilities. We demonstrate on different VLA architectures ($π_{0.5}$ and OpenVLA) through simulation experiments that VLAs possess interpretable internal structure amenable to online adaptation without fine-tuning, enabling real-time alignment with user preferences and task requirements.
📄 视觉-语言-动作模型(VLAs)在具身智能领域取得了显著进展。尽管其架构部分类似于大语言模型(LLMs),但由于多模态输入/输出以及常结合Transformer与扩散头的混合特性,VLAs展现出更高的复杂性。这也是为什么LLMs中解释内部模型表征与输出行为关联的机制可解释性研究,难以直接迁移至VLA模型的部分原因。本研究通过引入并分析两个核心概念——特征可观测性与特征可控性,致力于填补这一空白。具体而言,我们首先探究在表征空间中线性编码的特征,并展示如何通过线性分类器进行观测;随后基于最优控制理论设计最小线性干预方法,精准定位内部表征并引导VLA输出至目标区域。实验结果表明,这种定向轻量级干预能在保持闭环能力的同时可靠地引导机器人行为。我们通过仿真实验在不同VLA架构($π_{0.5}$与OpenVLA)上验证:VLAs具有可解释的内部结构,无需微调即可实现在线适应,从而能够实时对齐用户偏好与任务需求。
📄 High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.
📄 高质量的多相机三维流传输对于许多增强现实/虚拟现实应用中的沉浸式体验至关重要。由于实时性限制,视角数量有限往往导致渲染图像中出现信息缺失和表面不完整的问题。现有方法通常依赖简单的启发式算法进行空洞填充,这可能导致不一致性或视觉伪影。我们提出了一种新颖的、面向应用的修复方法,在完成新视角渲染后作为基于图像的后处理步骤,独立于底层表示来完成缺失纹理的填充。该方法设计为独立模块,兼容任何经过标定的多相机系统。为此,我们引入了一种基于Transformer的多视角感知网络架构,利用时空嵌入确保跨帧一致性的同时保留细节特征。此外,我们的分辨率无关设计可适配不同相机配置,而自适应分块选择策略在推理速度与质量之间取得平衡,实现了实时性能。我们在相同实时性约束下将本方法与最先进的修复技术进行比较评估,结果表明我们的模型在质量与速度之间达到了最佳平衡,在图像和视频评价指标上均优于现有方法。
📄 We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.
📄 我们推出FaceCam系统,该系统可为单目人像视频输入生成具有可定制相机轨迹的视频。当前基于大型视频生成模型的相机控制方法虽展现出良好前景,但由于尺度模糊的相机表征或三维重建误差,在人像视频中常出现几何畸变和视觉伪影。为突破这些限制,我们提出一种面向人像的尺度感知相机变换表征方法,无需依赖三维先验即可提供确定性条件约束。我们利用多视角影棚采集数据与真实场景单目视频训练视频生成模型,并引入两种相机控制数据生成策略:合成相机运动与多镜头拼接,从而在训练阶段充分利用静态相机数据,同时在推理阶段泛化至动态连续的相机轨迹。在Ava-256数据集及多样化真实场景视频上的实验表明,FaceCam在相机可控性、视觉质量、身份特征与运动保持方面均表现出卓越性能。
📄 Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.
📄 近期扩散模型虽能生成高质量视频,但存在运行速度缓慢的问题。这些模型所采用的大型基于Transformer的主干网络受限于时空注意力机制。本文发现,在不同输入中,大量词元间连接始终产生可忽略的注意力分数,且其模式在多个查询中反复出现。因此,跳过这些情况下的注意力计算几乎不会影响生成结果。这一现象在局部词元块间的连接中同样存在。基于此,我们提出CalibAtt——一种无需训练的方法,通过校准稀疏注意力加速视频生成。CalibAtt通过离线校准步骤识别具有跨输入稳定性的块级稀疏与重复模式,并将这些模式编译为针对每层、每个注意力头和扩散时间步的优化注意力操作。在推理阶段,我们以硬件高效的方式密集计算选中的输入相关连接,同时跳过未选中的连接。在Wan 2.1 14B、Mochi 1及多种分辨率的少步蒸馏模型上的大量实验表明,CalibAtt可实现最高1.58倍的端到端加速,在保持视频生成质量与文本-视频对齐度的同时,优于现有免训练加速方法。
📄 We introduce group surface codes, which are a natural generalization of the $\mathbb{Z}_2$ surface code, and equivalent to quantum double models of finite groups with specific boundary conditions. We show that group surface codes can be leveraged to perform non-Clifford gates in $\mathbb{Z}_2$ surface codes, thus enabling universal computation with well-established means of performing logical Clifford gates. Moreover, for suitably chosen groups, we demonstrate that arbitrary reversible classical gates can be implemented transversally in the group surface code. We present the logical operations in terms of a set of elementary logical operations, which include transversal logical gates, a means of transferring encoded information into and out of group surface codes, and preparation and readout. By composing these elementary operations, we implement a wide variety of logical gates and provide a unified perspective on recent constructions in the literature for sliding group surface codes and preparing magic states. We furthermore use tensor networks inspired by ZX-calculus to construct spacetime implementations of the elementary operations. This spacetime perspective also allows us to establish explicit correspondences with topological gauge theories. Our work extends recent efforts in performing universal quantum computation in topological orders without the braiding of anyons, and shows how certain group surface codes allow us to bypass the restrictions set by the Bravyi-K{ö}nig theorem, which limits the computational power of topological Pauli stabilizer models.
📄 我们提出了群表面码,这是$\mathbb{Z}_2$表面码的自然推广,等价于具有特定边界条件的有限群量子双模型。我们证明,群表面码可用于在$\mathbb{Z}_2$表面码中执行非克利福德门操作,从而通过成熟的逻辑克利福德门实现方式达成通用计算。此外,对于适当选择的群,我们展示了任意可逆经典门可以在群表面码中横向实现。我们通过一组基本逻辑操作来描述这些逻辑运算,包括横向逻辑门、将编码信息传入和传出群表面码的方法,以及制备和读取操作。通过组合这些基本操作,我们实现了多种逻辑门,并为文献中关于滑动群表面码和制备魔术态的最新构造提供了统一视角。我们还利用受ZX演算启发的张量网络构建了基本操作的时空实现。这种时空视角也使我们能够与拓扑规范理论建立明确的对应关系。我们的工作扩展了近期在拓扑序中不依赖任意子编织实现通用量子计算的努力,并展示了某些群表面码如何让我们绕过Bravyi-König定理的限制——该定理限制了拓扑泡利稳定子模型的计算能力。
📄 Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.
📄 大规模语言模型的高效稳定训练仍是现代机器学习系统的核心挑战。为解决这一问题,研究者提出了重参数化正交等价训练框架,该框架通过正交等价变换优化各权重矩阵,并保持谱特性不变。虽然该框架提供了强大的训练稳定性,但其原始实现因密集矩阵乘法导致内存消耗与计算开销过高。为突破这些限制,我们提出POET-X——一种可扩展且内存高效的改进方案,能以显著降低的计算成本实现正交等价变换。POET-X在保持原有框架泛化性与稳定性优势的同时,大幅提升了训练吞吐量与内存效率。实验表明,POET-X可在单张Nvidia H100 GPU上完成数十亿参数规模的语言模型预训练,而同等条件下AdamW等标准优化器则会出现内存不足的情况。
📄 We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.
📄 我们研究了Transformer语言模型中两种反复出现的现象:大规模激活(少数词元在个别通道中呈现极端异常值)和注意力汇聚(某些词元无论语义相关性如何都会吸引不成比例的注意力权重)。先前的研究观察到这两种现象经常同时出现,且往往涉及相同的词元,但它们的功能作用和因果关系尚不明确。通过系统实验,我们发现这种共现现象主要是现代Transformer架构设计的人工产物,且两种现象承担相关但不同的功能。大规模激活具有全局性作用:它们诱导出跨层持续存在的近似恒定隐藏表示,实质上充当了模型的隐式参数。注意力汇聚则具有局部性作用:它们在注意力头之间调节注意力输出,并使单个注意力头偏向短程依赖。我们确定预归一化配置是实现这两种现象共现的关键设计选择,并证明去除该配置会导致两种现象解耦。
📄 Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach perceives the environment by fusing multi-sensor point clouds with vision-based instance segmentation and persistent object tracking to maintain up-to-date semantics beyond the camera's field of view. A multi-layer safety filter is then used to modulate system inputs to achieve safe navigation using this semantic understanding of the environment. This safety filter consists of both a model predictive control layer and a control barrier function layer. Both layers utilize the PSF and flux modulation of the guidance field to introduce varying levels of conservatism and multi-agent passing norms for different obstacles in the environment. Our framework enables legged robots to navigate semantically rich, dynamic environments with context-dependent safety margins while maintaining rigorous safety guarantees.
📄 传统安全关键控制方法(如控制屏障函数)存在语义盲区,无论障碍物的情境意义如何,其在障碍物周围均表现出相同行为。这一局限导致所有障碍物被统一对待,尽管它们具有不同的语义含义。我们提出Safe-SAGE(安全交互的社会语义自适应引导框架),该统一框架通过拉普拉斯引导场调制的泊松安全函数,弥合了高层语义理解与底层安全关键控制之间的鸿沟。我们的方法通过融合多传感器点云与基于视觉的实例分割及持续目标跟踪来感知环境,从而维持摄像头视场之外的最新语义信息。随后采用多层安全过滤器,基于对环境的语义理解来调制系统输入,实现安全导航。该安全过滤器包含模型预测控制层和控制屏障函数层,两者均利用泊松安全函数和引导场的通量调制机制,为环境中不同障碍物引入差异化的保守度层级与多智能体通行规范。我们的框架使足式机器人能够在语义丰富的动态环境中,根据情境保持差异化安全边界的同时,维持严格的安全保障。
📄 The realization of quantum error correction protocols whose logical error rates are suppressed far below physical error rates relies on an intricate combination: the error-correcting code's efficiency, the syndrome extraction circuit's fault tolerance and overhead, the decoder's quality, and the device's constraints, such as physical qubit count and connectivity. This work makes two contributions towards error-corrected quantum devices. First, we introduce mirror codes, a simple yet flexible construction of LDPC stabilizer codes parameterized by a group $G$ and two subsets of $G$ whose total size bounds the check weight. These codes contain all abelian two-block group algebra codes, such as bivariate bicycle (BB) codes. At the same time, they are manifestly not CSS in general, thus deviating substantially from most prior constructions. Fixing a check weight of 6, we find $[[ 60, 4, 10 ]], [[ 36, 6, 6 ]], [[ 48, 8, 6 ]]$, and $[[ 85, 8, 9 ]]$ codes, all of which are not CSS; we also find several weight-7 codes with $kd > n$. Next, we construct syndrome extraction circuits that trade overhead for provable fault tolerance. These circuits use 1-2, 3, and 6 ancillae per check, and respectively are partially fault-tolerant (FT), provably FT on weight-6 CSS codes, and provably FT on \emph{all} weight-6 stabilizer codes. Using our constructions, we perform end-to-end quantum memory experiments on several representative mirror codes under circuit-level noise. We achieve an error pseudothreshold on the order of $0.2\%$, approximately matching that of the $[[ 144, 12, 12 ]]$ BB code under the same model. These findings position mirror codes as a versatile candidate for fault-tolerant quantum memory, especially on smaller-scale devices in the near term.
📄 量子纠错协议的逻辑错误率能否被抑制到远低于物理错误率,取决于多方面的复杂结合:纠错码的效率、综合征提取电路的容错性与开销、解码器的质量,以及设备的限制条件(如物理量子比特数量和连接性)。本研究在纠错量子设备方面做出了两项贡献。首先,我们引入了镜像码——一种简单而灵活的LDPC稳定子码构造,其参数由群$G$及$G$的两个子集确定,子集总规模限定了校验权重。这类编码包含所有阿贝尔双区块群代数码,例如双变量自行车码。同时,它们通常明显不属于CSS类型,从而与大多数现有构造有显著差异。在固定校验权重为6的条件下,我们发现了$[[ 60, 4, 10 ]]$、$[[ 36, 6, 6 ]]$、$[[ 48, 8, 6 ]]$和$[[ 85, 8, 9 ]]$等非CSS编码;我们还发现了多个满足$kd > n$的权重-7编码。其次,我们构建了可在开销与可证明容错性之间进行权衡的综合征提取电路。这些电路对每个校验分别使用1-2、3和6个辅助量子比特,相应地实现部分容错、在权重-6 CSS编码上可证明容错,以及在所有权重-6稳定子编码上可证明容错。基于这些构造,我们在电路级噪声模型下对多个代表性镜像码进行了端到端量子存储实验。实验获得的错误伪阈值约为$0.2\%$,与相同模型下$[[ 144, 12, 12 ]]$双变量自行车码的表现基本相当。这些发现表明镜像码可作为容错量子存储的通用候选方案,尤其适用于近期较小规模的量子设备。
📄 To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the model through self-supervised learning to improve overall performance. Our theoretical analysis and merit-based criterion show that labeled data need only place the model within a basin of attraction, confirming that only modest numbers of inexact labels and training epochs are required. We empirically validate our simple three-stage strategy across challenging domains, including nonconvex constrained optimization, power-grid operation, and stiff dynamical systems, and show that it yields faster convergence; improved accuracy, feasibility, and optimality; and up to 59x reductions in total offline cost.
📄 为了扩展优化和仿真问题的求解规模,先前的研究探索了机器学习代理模型,这些模型能够以低成本将问题参数映射到相应的解。常用的方法,包括通过软性或硬性可行性约束进行的监督学习和自监督学习,都面临着固有的挑战,例如依赖昂贵的高质量标签或复杂的优化地形。为了平衡这些方法的利弊,我们提出了一个新颖的框架:首先收集“廉价”的不完美标签,然后进行监督预训练,最后通过自监督学习对模型进行微调,以提升整体性能。我们的理论分析和基于性能的准则表明,标记数据只需将模型置于吸引域内即可,这证实了仅需少量不精确标签和训练周期即可实现目标。我们在多个具有挑战性的领域(包括非凸约束优化、电网运行和刚性动力系统)中实证验证了这一简单的三阶段策略,结果表明该策略能够实现更快的收敛速度,提高准确性、可行性和最优性,并将总体离线成本降低高达59倍。
📄 Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
📄 大型语言模型有时会产生虚假或误导性的回答。针对此问题有两种主要方法:诚实诱导——通过修改提示或权重使模型如实作答;以及谎言检测——判断给定回答是否虚假。先前的研究通常在专门训练用于说谎或隐瞒信息的模型上评估这些方法,但这些人为构建的场景可能与自然发生的欺骗行为不符。我们转而研究中国开发者发布的开源权重大语言模型,这些模型被训练用于审查政治敏感话题:例如Qwen3系列模型在涉及法轮功或天安门抗议等话题时经常输出虚假信息,但偶尔也能给出正确答案,这表明它们具备被训练所压制的知识。以此为测试平台,我们评估了一系列诱导与谎言检测技术。在诚实诱导方面,不使用对话模板的采样、少样本提示以及在通用诚实数据上的微调最能可靠提升真实回答的比例。对于谎言检测,直接要求被审查模型对其自身回答进行分类的方法接近未经审查模型的理论上限,而基于无关数据训练的线性探针则提供了一种更经济的替代方案。最强的诚实诱导技术同样适用于前沿开源权重模型(包括DeepSeek R1)。值得注意的是,没有任何技术能完全消除虚假回答。我们已公开所有提示词、代码及对话记录。
📄 Characterizing the dynamics of open quantum systems at the level of microscopic interactions and error mechanisms is essential for calibrating quantum hardware, designing robust simulation protocols, and developing tailored error-correction methods. Under Markovian noise/dissipation, a natural characterization approach is to identify the full Lindbladian generator that gives rise to both coherent (Hamiltonian) and dissipative dynamics. Prior protocols for learning Lindbladians from dynamical data assumed pre-specified interaction structure, which can be restrictive when the relevant noise channels or control imperfections are not known in advance. In this paper, we present the first sample-efficient protocol for learning sparse Lindbladians without assuming any a priori structure or locality. Our protocol is ancilla-free, uses only product-state preparations and Pauli-basis measurements, and achieves near-optimal time resolution, making it compatible with near-term experimental capabilities. The final sample complexity depends on linear-system conditioning, which we find empirically to be moderate for a broad class of physically motivated models. Together, this provides a systematic route to scalable characterization of open-system quantum dynamics, especially in settings where the error mechanisms of interest are unknown.
📄 在微观相互作用和误差机制的层面上表征开放量子系统的动力学,对于校准量子硬件、设计鲁棒的模拟协议以及开发定制化的纠错方法至关重要。在马尔可夫噪声/耗散条件下,一种自然的表征方法是识别产生相干(哈密顿量)和耗散动力学的完整林布拉德生成元。先前从动力学数据中学习林布拉德生成元的协议假设了预先指定的相互作用结构,这在相关噪声通道或控制缺陷未知时可能具有限制性。本文提出了首个无需假设任何先验结构或局域性、且样本效率高的稀疏林布拉德生成元学习协议。该协议无需辅助量子比特,仅使用乘积态制备和泡利基测量,并实现了接近最优的时间分辨率,使其与近期实验能力兼容。最终样本复杂度取决于线性系统的条件数,我们通过经验发现,对于广泛类别的物理动机模型,该条件数处于中等水平。这共同为开放系统量子动力学的可扩展表征提供了一条系统化路径,特别是在目标误差机制未知的场景中。
📄 We study the local limits of uniform random triangulations with boundaries in the regime where the genus is proportional to the number of faces. Budzinski and Louf proved in 2020 that when there are no boundaries, the local limits exist and are the Planar Stochastic Hyperbolic Triangulation (PSHT) introduced in PSHT. We show that when the triangulations considered have size n and boundaries with total length p that tends to infinity with n and p=o(n), the local limits around a typical boundary edge are the half-plane hyperbolic triangulations defined by Angel and Ray. This provides, for the first time, a construction of these hyperbolic half-plane triangulations as local limits of large genus triangulations. We also prove that under the condition p = o(n), the local limit when rooted on a uniformly chosen oriented edge is given by the PSHT. Contrary to the proof of Budzinski and Louf, the latter does not rely on the Goulden-Jackson recurrence relation, but only on coarse combinatorial estimates. Thus, we expect that the proof can be adapted to local limits in similar models.
📄 我们研究了在亏格与面数成正比的条件下,带边界均匀随机三角剖分的局部极限。Budzinski和Louf于2020年证明,当不存在边界时,局部极限存在且为PSHT中引入的平面随机双曲三角剖分(PSHT)。我们证明,当所考虑的三角剖分具有规模n且边界总长度p满足p随n趋于无穷大且p=o(n)时,围绕典型边界边的局部极限是由Angel和Ray定义的双曲半平面三角剖分。这首次提供了将这些双曲半平面三角剖分构造为大亏格三角剖分的局部极限。我们还证明,在p=o(n)条件下,以均匀选择的有向边为根时的局部极限由PSHT给出。与Budzinski和Louf的证明不同,后者不依赖于Goulden-Jackson递推关系,而仅依赖于粗略的组合估计。因此,我们预期该证明可适用于类似模型中的局部极限研究。
📄 The growing complexity of hardware design and the widening gap between high-level specifications and register-transfer level (RTL) implementation hinder rapid prototyping and system design. We introduce NL2GDS (Natural Language to Layout), a novel framework that leverages large language models (LLMs) to translate natural language hardware descriptions into synthesizable RTL and complete GDSII layouts via the open-source OpenLane ASIC flow. NL2GDS employs a modular pipeline that captures informal design intent, generates HDL using multiple LLM engines and verifies them, and orchestrates automated synthesis and layout. Evaluations on ISCAS'85 and ISCAS'89 benchmark designs demonstrate up to 36% area reduction, 35% delay reduction, and 70% power savings compared to baseline designs, highlighting its potential to democratize ASIC design and accelerate hardware innovation.
📄 硬件设计日益复杂,高层次规范与寄存器传输级(RTL)实现之间的鸿沟不断扩大,阻碍了快速原型设计与系统开发。我们提出NL2GDS(自然语言到版图)——一种创新框架,该框架利用大语言模型将自然语言硬件描述转化为可综合的RTL,并通过开源OpenLane ASIC流程生成完整的GDSII版图。NL2GDS采用模块化流程:捕捉非形式化设计意图,通过多引擎大语言模型生成硬件描述语言并进行验证,协调自动化综合与版图生成。在ISCAS'85和ISCAS'89基准测试上的评估表明,相较于基线设计,该框架可实现高达36%的面积缩减、35%的延迟降低和70%的功耗节约,彰显了其在普及ASIC设计和加速硬件创新方面的巨大潜力。
📄 We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.
📄 我们通过证据揭示了推理模型中存在的表演性思维链现象:模型对其最终答案表现出高度自信,却持续生成不暴露其内部信念的标记。通过对比分析两个大模型(DeepSeek-R1 671B 和 GPT-OSS 120B)中的激活探测、早期强制回答与思维链监控机制,我们发现任务难度导致显著差异:相较于监控机制,模型的最终答案能更早地从思维链激活中被解码——这在基于简单记忆的MMLU问题上尤为明显。而在复杂的多跳推理GPQA-Diamond问题中,模型则展现出真正的推理过程。值得注意的是,转折点(如回溯、“顿悟”时刻)几乎只出现在探测显示信念大幅波动的响应中,表明这些行为反映了真实的不确定性,而非习得的“推理表演”。最后,基于探测的早期退出策略在MMLU上减少80%的标记生成,在GPQA-Diamond上减少30%,同时保持相近准确率,这证明注意力探测可作为检测表演性推理、实现自适应计算的高效工具。
📄 Vision-Language-Action Models (VLAs) have shown remarkable progress towards embodied intelligence. While their architecture partially resembles that of Large Language Models (LLMs), VLAs exhibit higher complexity due to their multi-modal inputs/outputs and often hybrid nature of transformer and diffusion heads. This is part of the reason why insights from mechanistic interpretability in LLMs, which explain how the internal model representations relate to their output behavior, do not trivially transfer to VLA counterparts. In this work, we propose to close this gap by introducing and analyzing two main concepts: feature-observability and feature-controllability. In particular, we first study features that are linearly encoded in representation space, and show how they can be observed by means of a linear classifier. Then, we use a minimal linear intervention grounded in optimal control to accurately place internal representations and steer the VLA's output towards a desired region. Our results show that targeted, lightweight interventions can reliably steer a robot's behavior while preserving closed-loop capabilities. We demonstrate on different VLA architectures ($π_{0.5}$ and OpenVLA) through simulation experiments that VLAs possess interpretable internal structure amenable to online adaptation without fine-tuning, enabling real-time alignment with user preferences and task requirements.
📄 视觉-语言-动作模型(VLAs)在具身智能领域取得了显著进展。尽管其架构部分类似于大语言模型(LLMs),但由于多模态输入/输出以及常结合Transformer与扩散头的混合特性,VLAs展现出更高的复杂性。这也部分解释了为何LLMs中机制可解释性研究(即解释内部模型表征如何关联输出行为)的洞见难以直接迁移至VLA模型。本研究通过引入并分析两个核心概念——特征可观测性与特征可控性,致力于填补这一空白。具体而言,我们首先探究表征空间中线性编码的特征,并展示如何通过线性分类器进行观测;随后基于最优控制理论设计最小线性干预方法,精准定位内部表征并引导VLA输出至目标区域。实验结果表明,定向轻量级干预能可靠引导机器人行为,同时保持其闭环控制能力。通过对不同VLA架构($π_{0.5}$与OpenVLA)的仿真实验验证,我们发现VLAs具有可解释的内部结构,支持无需微调的在线自适应,能够实时对齐用户偏好与任务需求。
📄 Scaling imitation learning is fundamentally constrained by the efficiency of data collection. While handheld interfaces have emerged as a scalable solution for in-the-wild data acquisition, they predominantly operate in an open-loop manner: operators blindly collect demonstrations without knowing the underlying policy's weaknesses, leading to inefficient coverage of critical state distributions. Conversely, interactive methods like DAgger effectively address covariate shift but rely on physical robot execution, which is costly and difficult to scale. To reconcile this trade-off, we introduce RoboPocket, a portable system that enables Robot-Free Instant Policy Iteration using single consumer smartphones. Its core innovation is a Remote Inference framework that visualizes the policy's predicted trajectory via Augmented Reality (AR) Visual Foresight. This immersive feedback allows collectors to proactively identify potential failures and focus data collection on the policy's weak regions without requiring a physical robot. Furthermore, we implement an asynchronous Online Finetuning pipeline that continuously updates the policy with incoming data, effectively closing the learning loop in minutes. Extensive experiments demonstrate that RoboPocket adheres to data scaling laws and doubles the data efficiency compared to offline scaling strategies, overcoming their long-standing efficiency bottleneck. Moreover, our instant iteration loop also boosts sample efficiency by up to 2$\times$ in distributed environments a small number of interactive corrections per person. Project page and videos: https://robo-pocket.github.io.
📄 模仿学习的规模化从根本上受到数据收集效率的限制。尽管手持式界面已成为野外数据采集的可扩展解决方案,但它们主要以开环方式运行:操作者盲目收集演示数据,不了解底层策略的弱点,导致对关键状态分布的覆盖效率低下。相反,像DAgger这样的交互式方法能有效解决协变量偏移问题,但依赖于物理机器人的执行,成本高昂且难以扩展。为了调和这一矛盾,我们推出了RoboPocket——一个便携式系统,能够仅使用单个消费级智能手机实现无机器人即时策略迭代。其核心创新在于远程推理框架,该框架通过增强现实(AR)视觉预见技术可视化策略的预测轨迹。这种沉浸式反馈使收集者能够主动识别潜在故障,并将数据收集集中在策略的薄弱区域,而无需物理机器人。此外,我们实现了异步在线微调流水线,能够持续用新数据更新策略,在数分钟内有效闭合学习循环。大量实验表明,RoboPocket遵循数据缩放定律,与离线缩放策略相比数据效率提升了一倍,突破了其长期存在的效率瓶颈。此外,在分布式环境中,我们的即时迭代循环还通过每人少量交互校正,将样本效率提升了高达2倍。项目页面与视频:https://robo-pocket.github.io。
📄 Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.
📄 大规模语言模型(LLM)的高效稳定训练仍是现代机器学习系统的核心挑战。为应对这一挑战,研究者提出了重参数化正交等价训练(POET)——一种通过正交等价变换优化各权重矩阵的谱保持框架。尽管POET具备优异的训练稳定性,但其原始实现因密集矩阵乘法导致内存消耗与计算开销过高。为突破这些限制,我们提出POET-X:一种可扩展且内存高效的改进方案,能以显著降低的计算成本实现正交等价变换。POET-X在保持POET泛化性与稳定性优势的同时,大幅提升了训练吞吐量与内存效率。实验表明,POET-X可在单张Nvidia H100 GPU上完成数十亿参数LLM的预训练,而AdamW等标准优化器在相同配置下会出现内存不足问题。
📄 Continuous-variable quantum systems are central to quantum technologies, with Gaussian states playing a key role due to their broad applicability and simple description via first and second moments. Distinguishing Gaussian states requires computing their trace distance, but no analytical formula exists for general states, and numerical evaluation is difficult due to the exponential cost of representing infinite-dimensional operators. We introduce an efficient numerical method to compute the trace distance between a pure and a mixed Gaussian state, based on a generalized Lanczos algorithm that avoids explicit matrix representations and uses only moment information. The technique extends to non-Gaussian states expressible as linear combinations of Gaussian states. We also show how it can yield lower bounds on the trace distance between mixed Gaussian states, offering a practical tool for state certification and learning in continuous-variable quantum systems.
📄 连续变量量子系统是量子技术的核心,其中高斯态因其广泛的适用性以及通过一阶矩和二阶矩的简洁描述而发挥着关键作用。区分高斯态需要计算它们的迹距离,但对于一般态尚无解析公式,且由于表示无限维算符的指数级成本,数值计算十分困难。我们提出了一种高效的数值方法,用于计算纯高斯态与混合高斯态之间的迹距离。该方法基于广义兰乔斯算法,避免了显式的矩阵表示,仅利用矩信息进行计算。该技术可推广至可表示为高斯态线性组合的非高斯态。我们还展示了如何利用该方法获得混合高斯态之间迹距离的下界,为连续变量量子系统中的态认证与学习提供了实用工具。
📄 To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the model through self-supervised learning to improve overall performance. Our theoretical analysis and merit-based criterion show that labeled data need only place the model within a basin of attraction, confirming that only modest numbers of inexact labels and training epochs are required. We empirically validate our simple three-stage strategy across challenging domains, including nonconvex constrained optimization, power-grid operation, and stiff dynamical systems, and show that it yields faster convergence; improved accuracy, feasibility, and optimality; and up to 59x reductions in total offline cost.
📄 为了扩展优化和仿真问题的求解规模,先前的研究探索了机器学习代理模型,这些模型能够以低成本将问题参数映射到相应的解。常用的方法,包括通过软性或硬性可行性约束进行的监督学习和自监督学习,都面临着固有的挑战,例如依赖昂贵的高质量标签或复杂的优化地形。为了平衡这些方法的利弊,我们提出了一个新颖的框架:首先收集“廉价”的不完美标签,然后进行监督预训练,最后通过自监督学习对模型进行微调,以提升整体性能。我们的理论分析和基于性能的准则表明,标记数据只需将模型置于吸引域内即可,这证实了仅需少量不精确标签和训练周期即可实现目标。我们在多个具有挑战性的领域(包括非凸约束优化、电网运行和刚性动力系统)中实证验证了这一简单的三阶段策略,结果表明该策略能够实现更快的收敛速度,提高准确性、可行性和最优性,并将离线总成本降低高达59倍。
📄 Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
📄 大型语言模型有时会产生虚假或误导性回应。针对此问题有两种解决路径:诚实诱导——通过修改提示或权重使模型如实作答;以及谎言检测——对给定回答进行真伪分类。既往研究通常在专门训练用于说谎或隐瞒信息的模型上评估这些方法,但这类人为构建场景可能与自然发生的欺骗行为存在差异。我们转而研究中国开发者发布的开源权重大语言模型,这些模型被训练用于审查政治敏感话题:Qwen3系列模型在涉及法轮功或天安门抗议等议题时频繁输出虚假信息,但偶尔也能给出正确答案,表明其实际掌握着被训练要求压制的知识。以此为测试平台,我们系统评估了多种诱导与谎言检测技术。在诚实诱导方面,去除对话模板的采样、少样本提示以及在通用诚实数据上的微调最能稳定提升真实回答率。对于谎言检测,直接要求被审查模型对其自身回答进行分类的方法接近未经审查模型的理论上限,而基于无关数据训练的线性探针则提供了更经济的替代方案。最有效的诚实诱导技术同样适用于包括DeepSeek R1在内的前沿开源权重模型。值得注意的是,所有技术均未能完全消除虚假回答。我们已公开全部提示词、代码及对话记录。
📄 Characterizing the dynamics of open quantum systems at the level of microscopic interactions and error mechanisms is essential for calibrating quantum hardware, designing robust simulation protocols, and developing tailored error-correction methods. Under Markovian noise/dissipation, a natural characterization approach is to identify the full Lindbladian generator that gives rise to both coherent (Hamiltonian) and dissipative dynamics. Prior protocols for learning Lindbladians from dynamical data assumed pre-specified interaction structure, which can be restrictive when the relevant noise channels or control imperfections are not known in advance. In this paper, we present the first sample-efficient protocol for learning sparse Lindbladians without assuming any a priori structure or locality. Our protocol is ancilla-free, uses only product-state preparations and Pauli-basis measurements, and achieves near-optimal time resolution, making it compatible with near-term experimental capabilities. The final sample complexity depends on linear-system conditioning, which we find empirically to be moderate for a broad class of physically motivated models. Together, this provides a systematic route to scalable characterization of open-system quantum dynamics, especially in settings where the error mechanisms of interest are unknown.
📄 在微观相互作用和误差机制的层面上表征开放量子系统的动力学,对于校准量子硬件、设计稳健的模拟协议以及开发定制化的纠错方法至关重要。在马尔可夫噪声/耗散条件下,一种自然的表征方法是识别完整的林布拉德生成元,该生成元同时产生相干(哈密顿量)和耗散动力学。先前从动力学数据中学习林布拉德生成元的协议假设了预先指定的相互作用结构,这在相关噪声通道或控制缺陷未知时可能具有限制性。本文提出了首个无需假设任何先验结构或局域性、且样本效率高的稀疏林布拉德生成元学习协议。该协议无需辅助量子比特,仅使用乘积态制备和泡利基测量,并实现了接近最优的时间分辨率,使其与近期实验能力兼容。最终样本复杂度取决于线性系统的条件数,我们通过实验发现,在广泛的物理动机模型中这一条件数处于中等水平。这共同为开放系统量子动力学的可扩展表征提供了一条系统化路径,特别是在目标误差机制未知的场景中。
📄 We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.
📄 我们通过证据揭示了推理模型中存在的表演性思维链现象:模型对其最终答案表现出高度自信,却持续生成不暴露其内部信念的标记。通过对比分析两个大模型(DeepSeek-R1 671B 和 GPT-OSS 120B)的激活探测、早期强制回答和思维链监控机制,我们发现任务难度导致显著差异:相较于监控机制,模型最终答案可从思维链中更早阶段的激活状态被解码,这在基于简单记忆的MMLU问题上尤为明显。而在复杂的多跳推理GPQA-Diamond问题中,模型则展现出真正的推理过程。值得注意的是,转折点(如回溯、“顿悟”时刻)几乎只出现在探测显示信念大幅波动的响应中,表明这些行为反映了真实的不确定性,而非习得的“推理表演”。最后,基于探测的早期退出机制在MMLU上减少高达80%的标记生成,在GPQA-Diamond上减少30%,同时保持相近准确率,这证明注意力探测可作为检测表演性推理、实现自适应计算的高效工具。
📄 While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.
📄 尽管视频理解数据集已扩展至小时级时长,但其内容通常由密集拼接的片段组成,与自然、无脚本的日常生活存在差异。为弥合这一差距,我们推出了MM-Lifelong数据集,专为多模态终身理解而设计。该数据集包含181.1小时的影像素材,按日、周、月三级时间尺度构建,以捕捉不同时间密度下的生活模式。大量实验揭示了当前范式的两大关键缺陷:端到端多模态大语言模型因上下文饱和而遭遇工作记忆瓶颈,而代表性智能体基线方法在稀疏的月尺度时间线中会出现全局定位崩溃。为此,我们提出递归多模态智能体(ReMA),通过动态记忆管理迭代更新递归信念状态,其性能显著优于现有方法。最后,我们设计了可分离时间偏差与领域偏差的数据集划分方案,为未来监督学习和分布外泛化研究奠定了严谨基础。
📄 Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE-Bench .
📄 从右删失生存数据中估计异质性处理效应(HTE)在精准医疗和个性化政策制定等高风险应用中至关重要。然而,由于删失、未观测的反事实结果以及复杂的识别假设,生存分析环境为HTE估计带来了独特挑战。尽管从因果生存森林到生存元学习器及结果插补方法等领域已取得进展,但评估实践仍存在碎片化和不一致的问题。我们推出了SurvHTE-Bench,这是首个针对删失结果HTE估计的综合基准。该基准涵盖:(i)一套模块化的合成数据集,包含已知的真实效应,系统性地改变因果假设和生存动态;(ii)半合成数据集,将真实世界协变量与模拟处理和结果相结合;(iii)来自双胞胎研究(已知真实效应)和HIV临床试验的真实世界数据集。在合成、半合成和真实世界场景中,我们首次对不同条件和现实假设违反情况下的生存HTE方法进行了严格比较。SurvHTE-Bench为因果生存方法的公平、可复现和可扩展评估奠定了基础。本基准的数据和代码可在以下网址获取:https://github.com/Shahriarnz14/SurvHTE-Bench。
📄 Singular statistical models-including mixtures, matrix factorization, and neural networks-violate regular asymptotics due to parameter non-identifiability and degenerate Fisher geometry. Although singular learning theory characterizes marginal likelihood behavior through invariants such as the real log canonical threshold and singular fluctuation, these quantities remain difficult to interpret operationally. At the same time, widely used criteria such as WAIC and WBIC appear disconnected from underlying singular geometry. We show that posterior tempering induces a one-parameter deformation of the posterior distribution whose associated observables generate a hierarchy of thermodynamic response functions. A universal covariance identity links derivatives of tempered expectations to posterior fluctuations, placing WAIC, WBIC, and singular fluctuation within a unified response framework. Within this framework, classical quantities from singular learning theory acquire natural thermodynamic interpretations: RLCT governs the leading free-energy slope, singular fluctuation corresponds to curvature of the tempered free energy, and WAIC measures predictive fluctuation. We formalize an observable algebra that quotients out non-identifiable directions, allowing structurally meaningful order parameters to be constructed in singular models. Across canonical singular examples-including symmetric Gaussian mixtures, reduced-rank regression, and overparameterized neural networks-we empirically demonstrate phase-transition-like behavior under tempering. Order parameters collapse, susceptibilities peak, and complexity measures align with structural reorganization in posterior geometry. Our results suggest that thermodynamic response theory provides a natural organizing framework for interpreting complexity, predictive variability, and structural reorganization in singular Bayesian learning.
📄 奇异统计模型——包括混合模型、矩阵分解和神经网络——由于参数不可识别性和退化的费希尔几何,违反了正则渐近理论。尽管奇异学习理论通过实对数典范阈值和奇异涨落等不变量刻画了边缘似然的行为,但这些量在操作上仍难以解释。与此同时,广泛使用的准则如WAIC和WBIC似乎与底层的奇异几何脱节。我们证明,后验回火诱导了后验分布的单参数形变,其相关可观测量生成了一组热力学响应函数的层级结构。一个普适的协方差恒等式将回火期望的导数与后验涨落联系起来,从而将WAIC、WBIC和奇异涨落纳入统一的响应框架。在此框架内,奇异学习理论中的经典量获得了自然的热力学解释:实对数典范阈值主导自由能的主要斜率,奇异涨落对应回火自由能的曲率,而WAIC度量预测涨落。我们形式化了一个可观测量代数,该代数商去了不可识别方向,使得在奇异模型中能够构造具有结构意义的序参量。在一系列典型奇异示例中——包括对称高斯混合模型、降秩回归和过参数化神经网络——我们通过实验展示了回火下类似相变的行为:序参量坍缩、敏感性峰值出现,且复杂度度量与后验几何的结构重组相一致。我们的结果表明,热力学响应理论为解释奇异贝叶斯学习中的复杂度、预测变异性和结构重组提供了一个自然的组织框架。
📄 Real-time reconstruction of conditional quantum states from continuous measurement records is a fundamental requirement for quantum feedback control, yet standard stochastic master equation (SME) solvers require exact model specification, known system parameters, and are sensitive to parameter mismatch. While neural sequence models can fit these stochastic dynamics, the unconstrained predictors can violate physicality such as positivity or trace constraints, leading to unstable rollouts and unphysical estimates. We propose a Kraus-structured output layer that converts the hidden representation of a generic sequence backbone into a completely positive trace preserving (CPTP) quantum operation, yielding physically valid state updates by construction. We instantiate this layer across diverse backbones, RNN, GRU, LSTM, TCN, ESN and Mamba; including Neural ODE as a comparative baseline, on stochastic trajectories characterized by parameter drift. Our evaluation reveals distinct trade-offs between gating mechanisms, linear recurrence, and global attention. Across all models, Kraus-LSTM achieves the strongest results, improving state estimation quality by 7% over its unconstrained counterpart while guaranteeing physically valid predictions in non-stationary regimes.
📄 从连续测量记录中实时重构条件量子态是量子反馈控制的基本要求,然而标准的随机主方程(SME)求解器需要精确的模型设定和已知的系统参数,且对参数失配敏感。虽然神经序列模型可以拟合这些随机动力学,但无约束的预测器可能违反物理性(如正定性或迹约束),导致不稳定的推演和非物理估计。我们提出了一种克劳斯结构输出层,可将通用序列主干网络的隐藏表示转换为完全正定保迹(CPTP)量子操作,从而通过构造得到物理有效的状态更新。我们在多种主干网络(RNN、GRU、LSTM、TCN、ESN和Mamba)中实例化了该层,并以神经ODE作为对比基线,在参数漂移的随机轨迹上进行测试。评估结果表明,门控机制、线性递归和全局注意力之间存在不同的权衡。在所有模型中,克劳斯-LSTM取得了最佳结果,在非稳态条件下将状态估计质量提升了7%,同时保证了物理有效的预测。
📄 Quantum memory is a scarce and costly resource, yet little is known about which learning tasks remain feasible under severe memory constraints. We study the problem of computing global properties of quantum sequences when quantum systems must be measured individually, without storing or jointly processing them. In our setting, a bit string $x \in \{0,1\}^n$ is encoded into an $n$-qubit product state $|ψ_{x_1}\rangle \otimes \cdots \otimes |ψ_{x_n}\rangle$, and the goal is to infer $f(x) \in \{0,1\}$ from measurements of this quantum encoding. We consider a simple local strategy, which we call the greedy strategy, that applies the same optimal single-system measurement independently to each subsystem and then infers $f(x)$ from the outcomes. Our main result gives a complete characterization of when the greedy strategy is optimal: it achieves the same maximum success probability as an unrestricted global measurement if and only if the target Boolean function is affine (in all but finitely many cases). We establish a universal performance guarantee for general Boolean functions, showing that the success probability of the greedy strategy is always at least the square of the optimal global success probability, in direct analogy with the Barnum-Knill bound for the pretty good measurement. These results demonstrate that even under extreme memory constraints, simple local measurement strategies can remain provably competitive for learning global properties of quantum sequences.
📄 量子存储器是一种稀缺且昂贵的资源,然而在严格的内存限制下哪些学习任务仍然可行,目前尚知之甚少。我们研究了在必须单独测量量子系统、而不存储或联合处理它们的情况下,如何计算量子序列的全局性质的问题。在我们的设定中,一个比特串 $x \in \{0,1\}^n$ 被编码为一个 $n$ 量子比特的乘积态 $|ψ_{x_1}\rangle \otimes \cdots \otimes |ψ_{x_n}\rangle$,目标是通过测量这个量子编码来推断 $f(x) \in \{0,1\}$。我们考虑一种简单的局部策略,称之为贪婪策略,该策略对每个子系统独立应用相同的最优单系统测量,然后根据测量结果推断 $f(x)$。我们的主要结果完整刻画了贪婪策略何时是最优的:当且仅当目标布尔函数是仿射的(除了有限个例外情况),它能够达到与无限制全局测量相同的最大成功概率。我们为一般布尔函数建立了一个普适的性能保证,表明贪婪策略的成功概率始终至少是最优全局成功概率的平方,这与“相当好测量”的 Barnum-Knill 界直接类似。这些结果表明,即使在极端的内存限制下,简单的局部测量策略对于学习量子序列的全局性质仍然可以保持可证明的竞争力。
📄 Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/
📄 当前的视频生成模型无法模拟三维动作(如力与机器人操控)的物理效应,因为它们缺乏对动作如何影响三维场景的结构性理解。我们提出了RealWonder——首个基于单张图像、以动作为条件的实时视频生成系统。我们的核心思路是将物理模拟作为中间桥梁:不直接编码连续动作,而是通过物理模拟将其转化为视频模型能够处理的视觉表征(光流与RGB图像)。RealWonder整合了三个组件:单图像三维重建、物理模拟,以及仅需4步扩散过程的蒸馏视频生成器。该系统在480×832分辨率下达到13.2帧/秒的生成速度,支持对刚性物体、可变形体、流体和颗粒材料进行力作用、机器人操作及相机控制的交互式探索。我们预见RealWonder将为视频模型在沉浸式体验、AR/VR和机器人学习领域的应用开辟新机遇。代码与模型权重已在项目网站开源:https://liuwei283.github.io/RealWonder/
📄 Contact-rich micromanipulation in microfluidic flow is challenging because small disturbances can break pushing contact and induce large lateral drift. We study planar cell pushing with a magnetic rolling microrobot that tracks a waypoint-sampled reference curve under time-varying Poiseuille flow. We propose a hybrid controller that augments a nominal MPC with a learned residual policy trained by SAC. The policy outputs a bounded 2D velocity correction that is contact-gated, so residual actions are applied only during robot--cell contact, preserving reliable approach behavior and stabilizing learning. All methods share the same actuation interface and speed envelope for fair comparisons. Experiments show improved robustness and tracking accuracy over pure MPC and PID under nonstationary flow, with generalization from a clover training curve to unseen circle and square trajectories. A residual-bound sweep identifies an intermediate correction limit as the best trade-off, which we use in all benchmarks.
📄 微流体流动中的接触密集型微操作具有挑战性,因为微小扰动可能破坏推动接触并引发显著的横向漂移。本研究利用磁控滚动微型机器人在时变泊肃叶流下跟踪路径点采样的参考曲线,探索平面细胞推动操作。我们提出了一种混合控制器,通过SAC训练得到的残差策略对名义模型预测控制进行增强。该策略输出有界的二维速度校正量,且校正动作受接触门控机制约束——仅在机器人-细胞接触时施加残差动作,从而保持可靠的接近行为并稳定学习过程。所有方法均采用相同的驱动接口与速度范围以确保公平比较。实验表明,在非稳态流动下,该方法相较于纯模型预测控制与PID控制展现出更强的鲁棒性与跟踪精度,并能够从训练所用的三叶草曲线泛化至未见的圆形与方形轨迹。通过对残差边界进行扫描分析,确定中等程度的校正限制为最佳平衡点,该设定被应用于所有基准测试中。
📄 Machine-learned interatomic potentials (MLIPs) promise to provide near density-functional theory accuracy at a fraction of the computational cost, offering a transformative route toward genuinely predictive chemistry. Yet their predictive validity beyond the training regime remains largely untested experimentally. Here we use pressure-dependent broadband inelastic neutron spectroscopy (INS) as a direct experimental probe of MLIP transferability. Employing a newly developed high-pressure superalloy clamp cell, we measure INS spectra of crystalline 2,5-diiodothiophene at 10~K under ambient conditions and at 1.5~GPa. A MACE-based MLIP, fine-tuned on targeted DFT data, reproduces the experimental spectra across 0--1200~cm$^{-1}$ at both pressures and remains thermodynamically stable under rigorous molecular dynamics validation at 300~K. The model captures systematic pressure-induced blue shifts arising from steric stiffening and reproduces an anomalous red shift at 453~cm$^{-1}$ driven by pressure-modified intermolecular interactions, providing direct validation of its many-body character. This constitutes the first experimental demonstration of MLIP transferability across distinct thermodynamic states using neutron spectroscopy, and establishes high-pressure INS as a stringent benchmark for predictive machine-learned potentials.
📄 机器学习原子间势(MLIPs)有望以远低于传统计算成本的代价,提供接近密度泛函理论精度的结果,为真正实现预测性化学开辟了变革性路径。然而,其在训练范围之外的预测有效性仍缺乏实验验证。本研究采用压力依赖的宽带非弹性中子散射(INS)作为MLIP可迁移性的直接实验探针。利用新开发的高压超合金夹持腔,我们测量了2,5-二碘噻吩晶体在10K温度下常压及1.5GPa高压条件下的INS谱。基于MACE架构的MLIP模型通过针对性DFT数据微调后,在两种压力下均能准确复现0-1200 cm$^{-1}$范围内的实验谱图,并在300K严格分子动力学验证中保持热力学稳定性。该模型成功捕捉了由空间位阻强化引起的系统性压力蓝移,并重现了453 cm$^{-1}$处由压力调控分子间相互作用驱动的反常红移,直接验证了其多体相互作用特性。这项工作首次通过中子散射实验证明了MLIP在不同热力学状态间的可迁移性,确立了高压INS作为预测性机器学习势函数的严格基准测试方法。
📄 While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.
📄 尽管视频理解数据集已扩展至小时级时长,但其内容通常由密集拼接的片段组成,与自然、无脚本的日常生活存在差异。为弥合这一差距,我们推出了MM-Lifelong数据集,专为多模态终身理解而设计。该数据集包含181.1小时的影像素材,按日、周、月的时间尺度进行结构化组织,以捕捉不同时间密度下的生活场景。大量实验评估揭示了当前范式的两大关键缺陷:端到端多模态大语言模型因上下文饱和而遭遇工作记忆瓶颈,而代表性智能体基线方法在稀疏的月度时间线导航中会出现全局定位崩溃。针对这些问题,我们提出了递归多模态智能体(ReMA),它通过动态记忆管理迭代更新递归信念状态,显著超越了现有方法。最后,我们设计了可分离时间偏差与领域偏差的数据集划分方案,为未来监督学习和分布外泛化研究奠定了严谨基础。
📄 Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model's internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.
📄 幻觉问题始终是视觉语言模型面临的一项持续挑战,这类模型常会描述不存在的物体或捏造事实。现有的检测方法通常在文本生成后运行,导致干预成本高昂且时机滞后。我们研究是否能在生成任何词元之前,通过单次前向传播探查模型内部表征来预测幻觉风险。在涵盖多样化视觉语言任务及八种现代视觉语言模型(包括Llama-3.2-Vision、Gemma-3、Phi-4-VL和Qwen2.5-VL)的实验中,我们考察了三类内部表征:(一)未经过多模态融合的纯视觉特征;(二)文本解码器内的视觉词元表征;(三)生成前融合视觉与文本信息的查询词元表征。基于这些表征训练的探查器无需解码即可实现强大的幻觉检测性能,在Gemma-3-12B、Phi-4-VL 5.6B和Molmo 7B上最高达到0.93的AUROC值。对于多数模型,后期查询词元状态的预测性最强,而少数架构中纯视觉特征或中间层特征占主导地位(例如Qwen2.5-VL-7B使用纯视觉特征时AUROC约0.79)。这些结果表明:(1)幻觉风险可在生成前被检测;(2)最具信息量的层级和模态因架构而异;(3)轻量级探查器有望实现早期弃权、选择性路由和自适应解码,从而提升安全性与效率。
📄 Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.
📄 建立共同基础——即一套共享的信念和相互认可的事实——是协作的根本,但对当前人工智能系统而言仍是一项挑战,尤其是在多模态、多方参与的场景中,协作各方往往掌握不同的信息。我们提出了分布式部分信息谜题(DPIP),这是一种在认知不对称条件下引发丰富多模态交流的协作建构任务。我们构建了记录这些互动的多模态数据集,该数据集经过标注并在语音、手势与动作模态间实现时间对齐,以支持对命题内容与信念动态的推理。随后,我们评估了两种建模共同基础(CG)的范式:(1)采用前沿的大型语言模型(LLMs),通过多模态信息更新推断共享信念;(2)基于动态认知逻辑(DEL)构建的公理化流程,以增量方式执行相同任务。在已标注的DPIP数据上的实验结果表明,该任务对现代LLMs同时追踪任务进展与信念状态的能力构成了挑战。
📄 We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.
📄 我们专注于基于密集意图描述检索美甲设计图像的任务,这类描述代表了用户对美甲设计多层次、细粒度的意图。该任务具有挑战性,因为此类描述不仅涉及无限制的手绘元素和预制装饰物,还包含视觉特征、主题风格及整体印象。除文字描述外,我们假设用户可通过调色板指定零到多种颜色进行配色查询,从而表达微妙且连续的色彩层次。现有的视觉-语言基础模型往往难以有效融合此类描述与配色信息。为此,我们提出NaiLIA——一种面向美甲设计图像的多模态检索方法,能够在检索过程中全面对齐密集意图描述与配色查询。该方法引入基于置信度的松弛损失函数,可驱动未标注图像与描述语义对齐。为评估NaiLIA,我们构建了包含10,625张图像的基准数据集,这些图像采集自多元文化背景人群,并由超过200名标注者提供了长文本密集意图描述。实验结果表明,NaiLIA在检索性能上显著优于现有标准方法。
📄 Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate sarcasm detection as structured reasoning and adopt a dual-track distillation strategy: high-quality teacher trajectories initialize the student model, while the full set of trajectories trains a generative reward model (GenRM) to evaluate reasoning quality. The student is optimized with group relative policy optimization (GRPO) using decoupled rewards for accuracy and reasoning quality. On MUStARD++, SarcasmMiner increases F1 from 59.83% (zero-shot), 68.23% (supervised finetuning) to 70.22%. These findings suggest that reasoning-aware reward modeling enhances both performance and multimodal grounding.
📄 多模态讽刺检测需要通过跨模态推理,化解文本、声学和视觉线索之间的语用不一致性。为实现基于基础模型的稳健讽刺推理,我们提出SarcasmMiner——一个基于强化学习的后训练框架,旨在抑制多模态推理中的幻觉现象。我们将讽刺检测重构为结构化推理任务,并采用双轨蒸馏策略:高质量教师轨迹初始化学生模型,而完整轨迹集则用于训练生成式奖励模型(GenRM)以评估推理质量。通过解耦的准确性与推理质量奖励,采用群体相对策略优化(GRPO)对学生模型进行优化。在MUStARD++数据集上,SarcasmMiner将F1分数从零样本学习的59.83%、监督微调的68.23%提升至70.22%。这些结果表明,推理感知的奖励建模能同时增强模型性能与多模态语义基础。
📄 We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (multiling.cloud), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.
📄 我们推出“多语言云语料库”,这是孟加拉国首个国家级、平行、多模态的少数民族及原住民语言数据集。尽管孟加拉国拥有约40种分属四大语系的少数民族语言,其中14种被列为濒危语言,但这些以口语为主、在计算语言学上属于“零资源”的语言长期缺乏跨语系的系统性数字语料库。本语料库包含85,792条结构化文本条目(每条含孟加拉语刺激文本、英语译文及国际音标转写)与约107小时的转写音频,覆盖藏缅、印欧、南亚和达罗毗荼四大语系的42种语言变体,以及两种未分类语系语言。数据通过为期90天的系统性田野调查采集,覆盖孟加拉国9个地区,动员16名采集员、77名发音人及43名校验员,采用包含2,224个独立项目的三层级结构化采集模板:孤立词汇项(22个语义域的475个词)、语法结构(21个类别共887个句子,含动词变位范式)和引导性话语(46个对话场景的862个提示句)。田野调查后处理包括10名语言学家完成的音标转写及6名评审员的独立审核。完整数据集通过“多语言云”平台(multiling.cloud)公开,提供所有语言变体的可检索标注音频与文本数据。本文详述语料库设计、田野调查方法、数据集结构及分语言覆盖情况,并探讨其对濒危语言记录、低资源自然语言处理及语言多样化发展中国家数字保存的启示。
📄 Recent studies have demonstrated that incorporating auxiliary information, such as speaker voiceprint or visual cues, can substantially improve Speech Enhancement (SE) performance. However, single-channel methods often yield suboptimal results in low signal-to-noise ratio (SNR) conditions, when there is high reverberation, or in complex scenarios involving dynamic speakers, overlapping speech, or non-stationary noise. To address these issues, we propose a novel Visual-Informed Neural Beamforming Network (VI-NBFNet), which integrates microphone array signal processing and deep neural networks (DNNs) using multimodal input features. The proposed network leverages a pretrained visual speech recognition model to extract lip movements as input features, which serve for voice activity detection (VAD) and target speaker identification. The system is intended to handle both static and moving speakers by introducing a supervised end-to-end beamforming framework equipped with an attention mechanism. The experimental results demonstrated that the proposed audiovisual system has achieved better SE performance and robustness for both stationary and dynamic speaker scenarios, compared to several baseline methods.
📄 近期研究表明,融入说话人声纹或视觉线索等辅助信息可显著提升语音增强性能。然而,单通道方法在低信噪比、高混响环境,或涉及动态说话人、重叠语音、非平稳噪声的复杂场景中往往表现欠佳。为解决这些问题,我们提出一种新颖的视觉引导神经波束成形网络,该网络通过多模态输入特征融合了麦克风阵列信号处理与深度神经网络。该网络利用预训练的视觉语音识别模型提取唇部运动特征,用于语音活动检测和目标说话人识别。通过引入配备注意力机制的监督式端到端波束成形框架,系统能够同时处理静态与动态说话人场景。实验结果表明,相较于多种基线方法,所提出的视听系统在静态与动态说话人场景中均实现了更优的语音增强性能和鲁棒性。
📄 Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model's evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5\% to 37.1\% on Encyclopedic VQA and from 40.1\% to 44.1\% on InfoSeek. The project page is available at https://artanic30.github.io/project_pages/WikiR1/.
📄 基于知识的视觉问答(KB-VQA)要求模型通过整合外部知识来回答关于图像的问题,由于知识检索的噪声以及知识库本身具有结构化、百科全书式的特性,这一任务面临重大挑战。这些特点造成了与预训练多模态大语言模型(MLLMs)之间的分布差异,使得在后续训练阶段难以实现有效的推理和领域适应。本文提出 **Wiki-R1**,一种基于数据生成的课程强化学习框架,系统性地激励 MLLMs 在 KB-VQA 任务中进行推理。Wiki-R1 构建了一系列与模型能力演进相匹配的训练分布,从而弥合了从预训练到 KB-VQA 目标分布之间的差距。我们引入了 **可控课程数据生成**,通过操纵检索器生成具有指定难度级别的样本,以及一种 **课程采样策略**,该策略选择在强化学习更新过程中可能产生非零优势的信息丰富样本。样本难度通过观测到的奖励进行估计,并传播到未观测样本中以指导学习。在两个 KB-VQA 基准测试(Encyclopedic VQA 和 InfoSeek)上的实验表明,Wiki-R1 取得了新的最先进成果:在 Encyclopedic VQA 上将准确率从 35.5% 提升至 37.1%,在 InfoSeek 上从 40.1% 提升至 44.1%。项目页面详见 https://artanic30.github.io/project_pages/WikiR1/。
📄 Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.
📄 近年来,大语言模型(LLMs)的进展为多模态推理开辟了新途径。然而,现有方法大多仍依赖预训练的视觉语言模型(VLMs)分别编码图像-文本对,忽略了现实世界多模态数据天然形成的关系结构。这促使我们在多模态图(MMGs)上进行推理,其中每个节点都具有文本和视觉属性,边则提供结构线索。要在保持图拓扑的同时,基于LLM对此类异质多模态信号进行推理,面临两大关键挑战:解决跨模态一致性弱的问题以及处理异质模态偏好。为此,我们提出Mario——一个统一框架,能同时应对上述两个挑战,实现基于LLM的高效多模态图推理。Mario包含两个创新阶段:首先,采用图条件化VLM设计,通过图拓扑引导的细粒度跨模态对比学习,联合优化文本与视觉特征;其次,引入模态自适应图指令调优机制,将对齐后的多模态特征组织为图感知的指令视图,并利用可学习路由器为每个节点及其邻域筛选出对LLM最有效的信息模态配置。在多个多模态图基准测试上的大量实验表明,无论是监督学习还是零样本场景下的节点分类与链接预测任务,Mario均持续优于当前最先进的图模型。代码将在https://github.com/sunyuanfu/Mario公开。
📄 Time series forecasting has witnessed an increasing demand across diverse industrial applications, where accurate predictions are pivotal for informed decision-making. Beyond numerical time series data, reliable forecasting in practical scenarios requires integrating diverse exogenous factors. Such exogenous information is often multi-dimensional or even multimodal, introducing heterogeneous interactions that unimodal time series models struggle to capture. In this paper, we delve into an aviation maintenance scenario and identify three distinct types of exogenous factors that influence temporal dynamics through distinct interaction modes. Based on this empirical insight, we propose Aura, a universal framework that explicitly organizes and encodes heterogeneous external information according to its interaction mode with the target time series. Specifically, Aura utilizes a tailored tripartite encoding mechanism to embed heterogeneous features into well-established time series models, ensuring seamless integration of non-sequential context. Extensive experiments on a large-scale, three-year industrial dataset from China Southern Airlines, covering the Boeing 777 and Airbus A320 fleets, demonstrate that Aura consistently achieves state-of-the-art performance across all baselines and exhibits superior adaptability. Our findings highlight Aura's potential as a general-purpose enhancement for aviation safety and reliability.
📄 时间序列预测在各类工业应用中的需求日益增长,其预测准确性对科学决策至关重要。在实际场景中,可靠的预测不仅需要数值时间序列数据,还需整合多样化的外生因素。这类外生信息往往具有多维甚至多模态特性,会引入异质性交互关系,而单模态时间序列模型难以有效捕捉此类复杂关联。本文以航空维修场景为切入点,识别出三种通过不同交互模式影响时序动态的异质外生因素。基于这一实证发现,我们提出了Aura——一个通用框架,能够根据外生信息与目标时间序列的交互模式,显式地组织并编码异质外部信息。具体而言,Aura采用定制化的三方编码机制,将异质特征嵌入成熟的时间序列模型,确保非序列上下文信息的无缝融合。基于中国南方航空波音777和空客A320机队长达三年的大规模工业数据集实验表明,Aura在所有基线模型中均取得最先进的预测性能,并展现出卓越的适应能力。本研究凸显了Aura作为通用增强框架在提升航空安全与可靠性方面的潜力。
📄 In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.
📄 在现实世界的多模态应用中,系统通常需要理解用户任意组合和交错的多模态输入,同时还需生成任意交错的多媒体形式输出。这种能力定义了在统一理解与生成范式下任意到任意交错多模态学习的目标,为推进多模态大语言模型(MLLMs)的发展带来了新的挑战与机遇。为促进和评估这一能力,本文提出了UniM基准数据集——首个统一的任意到任意交错多模态数据集。UniM包含31,000个高质量实例,涵盖30个领域和7种代表性模态:文本、图像、音频、视频、文档、代码和3D,每个实例均需要多种交织的推理与生成能力。我们进一步推出UniM评估套件,从三个维度评估模型性能:语义正确性与生成质量、响应结构完整性以及交错连贯性。此外,我们提出了UniMA模型作为基线,这是一种具备可追溯推理能力的智能体模型,专为结构化交错生成而设计。综合实验表明UniM任务具有较高难度,同时揭示了推进统一任意到任意多模态智能发展的关键挑战与方向。项目页面详见:https://any2any-mllm.github.io/unim。
📄 This study presents an advanced system for detecting blue lights on emergency vehicles, developed using ABLDataset, a curated dataset that includes images of European emergency vehicles under various climatic and geographic conditions. The system employs a configuration of four fisheye cameras, each with a 180-degree horizontal field of view, mounted on the sides of the vehicle. A calibration process enables the azimuthal localization of the detections. Additionally, a comparative analysis of major deep neural network algorithms was conducted, including YOLO (v5, v8, and v10), RetinaNet, Faster R-CNN, and RT-DETR. RT-DETR was selected as the base model and enhanced through the incorporation of a color attention block, achieving an accuracy of 94.7 percent and a recall of 94.1 percent on the test set, with field test detections reaching up to 70 meters. Furthermore, the system estimates the approach angle of the emergency vehicle relative to the center of the car using geometric transformations. Designed for integration into a multimodal system that combines visual and acoustic data, this system has demonstrated high efficiency, offering a promising approach to enhancing Advanced Driver Assistance Systems (ADAS) and road safety.
📄 本研究提出了一种用于检测紧急车辆蓝色警示灯的先进系统,该系统基于ABLDataset开发——这是一个包含欧洲紧急车辆在不同气候与地理条件下图像的精选数据集。系统采用四台鱼眼摄像头配置,每台摄像头水平视场角为180度,安装在车辆侧面。通过校准过程实现了检测目标的方位角定位。此外,研究对主流深度神经网络算法进行了比较分析,包括YOLO(v5、v8和v10)、RetinaNet、Faster R-CNN和RT-DETR。最终选择RT-DETR作为基础模型,并通过引入颜色注意力模块进行增强,在测试集上实现了94.7%的准确率和94.1%的召回率,现场测试检测距离可达70米。该系统还通过几何变换估算紧急车辆相对于本车中心的接近角度。该设计可集成到结合视觉与声学数据的多模态系统中,已展现出高效性能,为增强高级驾驶辅助系统(ADAS)和提升道路安全提供了具有前景的解决方案。
📄 This exploratory pilot study investigates the impact of haptic perception --specifically tactile sensitivity (touch) and kinaesthetic intensity (movement)-- on learning, operationalized as information retention (immediate recall) through handwriting. Participants (N=20) were randomly assigned to one of four experimental groups in a 2x2 factorial design, manipulating touch (via glove use) and movement (via increased writing pressure). Information retention was measured using an immediate recall test, while mental effort (reaction time in a secondary task) and perceived workload (NASA-TLX) were examined as mediating variables. Bayesian binomial regression revealed moderate evidence that increased writing pressure negatively influenced recall (85-88% probability of negative effect), whereas glove use alone demonstrated no clear effect. Bayesian mediation analysis found no strong evidence that mental effort or perceived workload mediated these effects, as all 95% credible intervals included zero, indicating substantial uncertainty. These findings suggest that increased Kinaesthetic demands may slightly impair immediate recall, independent of perceived workload or mental effort. Importantly, the manipulation of touch alone does not appear to influence information retention. The study contributes to understanding the nuanced relationship between embodied interactions and cognitive outcomes, with implications for designing sensor-based multimodal learning environments.
📄 这项探索性先导研究考察了触觉感知——具体包括触觉敏感性(触摸)与动觉强度(运动)——对学习的影响,其中学习通过手写过程中的信息保持(即时回忆)来操作化定义。研究采用2×2因子设计,将参与者(N=20)随机分配至四个实验组之一,通过佩戴手套(操控触觉)与增加书写压力(操控运动)进行变量控制。信息保持通过即时回忆测试进行测量,同时将心理努力(次要任务反应时)与感知负荷(NASA-TLX量表)作为中介变量进行考察。贝叶斯二项回归分析显示,有中等程度证据表明增加书写压力对回忆产生负面影响(负面效应的概率为85-88%),而单独使用手套则未表现出明确影响。贝叶斯中介分析发现,心理努力与感知负荷均未对这些效应产生显著中介作用,所有95%可信区间均包含零值,表明存在较大不确定性。这些发现提示,动觉需求的增加可能轻微损害即时回忆,且这种影响独立于感知负荷或心理努力。值得注意的是,单独操控触觉似乎并不影响信息保持。本研究有助于深化对具身交互与认知结果之间复杂关系的理解,并为设计基于传感器的多模态学习环境提供了启示。
📄 Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.
📄 大型多模态模型(LMMs)在视觉语言理解方面已展现出强大的性能,然而许多现有方法依赖于大规模架构和粗粒度监督,这限制了其生成详细图像描述的能力。本研究提出VisionPangu——一个拥有17亿参数的紧凑型多模态模型,旨在通过高效的多模态对齐和高质量监督来提升细节化图像描述能力。该模型通过轻量级MLP投影器,将InternVL衍生的视觉编码器与OpenPangu-Embedded语言主干网络相结合,并采用受LLaVA启发的指令微调流程。通过引入DOCCI数据集中人工撰写的密集描述,VisionPangu在不依赖激进模型扩增的情况下,显著提升了语义连贯性与描述丰富度。实验结果表明,紧凑型多模态模型能够生成更具结构性和细节的描述,同时保持具有竞争力的性能表现。代码与模型权重已公开于:https://www.modelscope.cn/models/asdfgh007/visionpangu。
📄 Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, δ)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.
📄 视觉语言模型正越来越多地应用于医学影像和个人照片等敏感领域,然而现有的上下文学习差分隐私方法仅限于少样本、纯文本场景,因为隐私成本会随处理的标记数量增加而上升。我们提出了差分隐私多模态任务向量(DP-MTV)框架,这是首个支持多模态上下文学习在正式$(\varepsilon, \delta)$-差分隐私约束下实现多样本学习的方案。该框架通过将数百个示例聚合为激活空间中的紧凑任务向量,将私有数据分割为互不相交的数据块,应用逐层裁剪以限制敏感度,并对聚合结果添加校准噪声,仅需单次噪声添加即可支持无限次推理查询。我们在三种视觉语言模型架构的八个基准测试上进行评估,支持在有或无辅助数据的情况下部署。在$\varepsilon=1.0$的隐私约束下,DP-MTV在VizWiz数据集上达到50%准确率,优于零样本学习的35%,并接近非隐私方法55%的性能,在有效的隐私保护条件下保留了上下文学习带来的大部分性能增益。
📄 We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.
📄 我们推出了FaceCam系统,该系统能够根据可定制的相机轨迹,为单目人像视频输入生成动态视频。尽管当前基于大型视频生成模型的相机控制方法已取得显著进展,但由于尺度模糊的相机表示或三维重建误差,其人像视频常出现几何失真和视觉伪影。为突破这些限制,我们提出了一种面向人像的尺度感知相机变换表示方法,该方法无需依赖三维先验知识即可提供确定性条件约束。我们利用多视角工作室采集数据与真实场景单目视频共同训练视频生成模型,并引入了两种相机控制数据生成策略:合成相机运动与多镜头拼接,从而在利用静态训练相机数据的同时,实现了推理阶段对动态连续相机轨迹的泛化能力。在Ava-256数据集及多样化真实场景视频上的实验表明,FaceCam在相机可控性、视觉质量、身份特征与运动保持方面均展现出卓越性能。
📄 Effective robot autonomy requires motion generation that is safe, feasible, and reactive. Current methods are fragmented: fast planners output physically unexecutable trajectories, reactive controllers struggle with high-fidelity perception, and existing solvers fail on high-DoF systems. We present cuRoboV2, a unified framework with three key innovations: (1) B-spline trajectory optimization that enforces smoothness and torque limits; (2) a GPU-native TSDF/ESDF perception pipeline that generates dense signed distance fields covering the full workspace, unlike existing methods that only provide distances within sparsely allocated blocks, up to 10x faster and in 8x less memory than the state-of-the-art at manipulation scale, with up to 99% collision recall; and (3) scalable GPU-native whole-body computation, namely topology-aware kinematics, differentiable inverse dynamics, and map-reduce self-collision, that achieves up to 61x speedup while also extending to high-DoF humanoids (where previous GPU implementations fail). On benchmarks, cuRoboV2 achieves 99.7% success under 3kg payload (where baselines achieve only 72--77%), 99.6% collision-free IK on a 48-DoF humanoid (where prior methods fail entirely), and 89.5% retargeting constraint satisfaction (vs. 61% for PyRoki); these collision-free motions yield locomotion policies with 21% lower tracking error than PyRoki and 12x lower cross-seed variance than mink. A ground-up codebase redesign for discoverability enabled LLM coding assistants to author up to 73% of new modules, including hand-optimized CUDA kernels, demonstrating that well-structured robotics code can unlock productive human--LLM collaboration. Together, these advances provide a unified, dynamics-aware motion generation stack that scales from single-arm manipulators to full humanoids.
📄 有效的机器人自主性需要安全、可行且反应灵敏的运动生成。现有方法较为零散:快速规划器输出的轨迹往往物理上不可执行,反应式控制器难以处理高精度感知,而现有求解器无法应对高自由度系统。我们提出cuRoboV2——一个具有三项关键创新的统一框架:(1)采用B样条轨迹优化,确保平滑性并满足扭矩限制;(2)构建GPU原生的TSDF/ESDF感知流水线,生成覆盖整个工作空间的稠密符号距离场(现有方法仅能在稀疏分配的区块内提供距离信息),在操作尺度上比现有最优方法快10倍、内存占用减少8倍,碰撞召回率高达99%;(3)可扩展的GPU原生全身计算模块(包括拓扑感知运动学、可微逆动力学及映射归约自碰撞检测),实现最高61倍加速,并能扩展至高自由度人形机器人(此前GPU方案均无法实现)。在基准测试中,cuRoboV2在3kg负载下成功率高达99.7%(基线方法仅为72-77%),在48自由度人形机器人上实现99.6%的无碰撞逆运动学求解(现有方法完全失效),重定向约束满足率达89.5%(PyRoki仅为61%);这些无碰撞运动生成的步态策略跟踪误差比PyRoki降低21%,跨种子方差比mink降低12倍。通过彻底重构代码库以提升可探索性,大语言模型编程助手可完成高达73%的新模块开发(包括手动优化的CUDA内核),证明结构良好的机器人代码能实现高效的人机协作。这些进展共同构成了一个统一的、具备动力学感知的运动生成框架,可扩展从单臂机械手到完整人形机器人的各类系统。
📄 The growing complexity of hardware design and the widening gap between high-level specifications and register-transfer level (RTL) implementation hinder rapid prototyping and system design. We introduce NL2GDS (Natural Language to Layout), a novel framework that leverages large language models (LLMs) to translate natural language hardware descriptions into synthesizable RTL and complete GDSII layouts via the open-source OpenLane ASIC flow. NL2GDS employs a modular pipeline that captures informal design intent, generates HDL using multiple LLM engines and verifies them, and orchestrates automated synthesis and layout. Evaluations on ISCAS'85 and ISCAS'89 benchmark designs demonstrate up to 36% area reduction, 35% delay reduction, and 70% power savings compared to baseline designs, highlighting its potential to democratize ASIC design and accelerate hardware innovation.
📄 硬件设计日益复杂,高层次规范与寄存器传输级(RTL)实现之间的鸿沟不断扩大,这阻碍了快速原型设计与系统开发的进程。我们提出NL2GDS(自然语言到版图)——一种创新框架,该框架利用大语言模型(LLM)将自然语言硬件描述转化为可综合的RTL,并通过开源OpenLane ASIC流程生成完整的GDSII版图。NL2GDS采用模块化流程:捕捉非形式化的设计意图、使用多LLM引擎生成硬件描述语言并进行验证、协调自动化综合与版图生成。在ISCAS'85和ISCAS'89基准设计上的评估显示,相较于基线设计,该框架可实现高达36%的面积缩减、35%的延迟降低以及70%的功耗节约,彰显了其在推动ASIC设计普及与加速硬件创新方面的潜力。
📄 While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.
📄 尽管视频理解数据集已扩展至小时级时长,但这些数据集通常由密集拼接的片段组成,与自然、无脚本的日常生活存在差异。为弥合这一差距,我们推出了MM-Lifelong数据集,专为多模态终身理解而设计。该数据集包含181.1小时的影像素材,按日、周、月的时间尺度进行结构化组织,以捕捉不同时间密度下的信息。大量评估揭示了当前范式的两大关键缺陷:端到端多模态大语言模型因上下文饱和而遭遇工作记忆瓶颈,而代表性智能体基线方法在稀疏的月尺度时间线中导航时会出现全局定位崩溃。为解决这些问题,我们提出了递归多模态智能体(ReMA),该模型通过动态记忆管理迭代更新递归信念状态,其性能显著优于现有方法。最后,我们设计了可分离时间偏差与领域偏差的数据集划分方案,为未来监督学习和分布外泛化研究提供了严谨的基础。
📄 Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE-Bench .
📄 从右删失生存数据中估计异质性处理效应(HTE)在精准医疗和个性化政策制定等高风险应用中至关重要。然而,由于删失、未观测的反事实以及复杂的识别假设,生存分析环境为HTE估计带来了独特挑战。尽管从因果生存森林到生存元学习器和结果插补方法等领域已取得进展,但评估实践仍然零散且不一致。我们推出了SurvHTE-Bench——首个针对删失结果的HTE估计综合基准。该基准涵盖:(i)一套模块化的合成数据集,包含已知真实效应,系统性地改变因果假设和生存动态;(ii)半合成数据集,将真实世界协变量与模拟处理和结果相结合;(iii)来自双胞胎研究(已知真实效应)和HIV临床试验的真实世界数据集。在合成、半合成和真实世界场景中,我们首次对不同条件下及现实假设违反情况下的生存HTE方法进行了严格比较。SurvHTE-Bench为因果生存方法的公平、可复现和可扩展评估奠定了基础。本基准的数据和代码可在以下网址获取:https://github.com/Shahriarnz14/SurvHTE-Bench。
📄 We investigate the quantum algorithm of Babbush et al. (arXiv:2303.13012v3) for simulating coupled harmonic oscillators, which promises exponential speedups over classical methods. Focusing on linearly connected oscillator chains, we bridge the gap between theory and implementation by developing and comparing three concrete realizations of the algorithm. First, we implement a sparse initial state preparation combined with product-formula (Suzuki-Trotter) Hamiltonian simulation. Second, we implement a fully quantum, oracle-based framework in which classical data are accessed via oracles, the Hamiltonian is block-encoded, and time evolution is performed using QSVT-based Hamiltonian simulation. Third, we propose an efficient alternative that combines the sparse state-preparation routine of the first approach with the oracle and block-encoding-based simulation pipeline of the second. We provide these implementations on Classiq, a high-level quantum design platform and provide appropriate resource benchmarks. Our simulation results show that the complex initial state preparation proposed by Babbush et al. can be circumvented at least in the linear-chain case. Finally, we illustrate two physical applications-extracting normal modes and simulating coarse-grained energy propagation-demonstrating how the algorithm connects to measurable observables. Our results clarify the resource requirements of the algorithm and provide concrete pathways toward practical quantum advantage.
📄 我们研究了Babbush等人(arXiv:2303.13012v3)提出的用于模拟耦合谐振子的量子算法,该算法有望实现相对于经典方法的指数级加速。聚焦于线性耦合的谐振子链,我们通过开发并比较该算法的三种具体实现方案,弥合了理论与实现之间的鸿沟。首先,我们实现了稀疏初始态制备与乘积公式(Suzuki-Trotter)哈密顿模拟的结合。其次,我们实现了一个完全基于量子预言机的框架,其中经典数据通过预言机访问,哈密顿量被块编码,并采用基于量子奇异值变换的哈密顿模拟进行时间演化。第三,我们提出了一种高效替代方案,将第一种方法的稀疏态制备流程与第二种方法的预言机及块编码模拟管线相结合。我们在高级量子设计平台Classiq上提供了这些实现,并给出了相应的资源评估标准。模拟结果表明,至少在谐振子线性链情形下,可以规避Babbush等人提出的复杂初始态制备过程。最后,我们通过两个物理应用示例——提取简正模态和模拟粗粒度能量传播,展示了该算法如何与可观测物理量建立联系。我们的研究明确了该算法的资源需求,并为实现实用化量子优势提供了具体路径。
📄 Quantum-gas microscopes provide direct access to the phases of the Hubbard model, bringing microscopic insight into the complex competition between interactions, SU(2) magnetism, and doping. Alkaline-earth(-like) fermions extend this spin-1/2 paradigm by realizing higher symmetries and giving access to SU(N) Hubbard models, with rich phase diagrams to be unveiled. Despite its fundamental interest, a microscopic exploration of SU(N) quantum systems has remained elusive. Here we report the realization of a quantum-gas microscope for fermionic $^{87}$Sr. Our imaging scheme, based on cooling and fluorescence on the narrow intercombination line at 689 nm, enables spin-resolved single-atom detection. By implementing a spin-selective optical pumping protocol, we determine the occupation of each of the 10 spin states in a single experimental realization, a crucial capability for probing site-resolved magnetic correlations. We benchmark our method by observing single-particle Larmor precession across the full spin-9/2 ground-state manifold. These results establish $^{87}$Sr quantum-gas microscopy as a powerful approach to study exotic magnetism in the SU(N) Fermi-Hubbard model, and provide a new detection tool for studies in quantum simulation, computation, and metrology.
📄 量子气体显微镜为直接观测哈伯德模型的相提供了途径,使人们能够从微观层面理解相互作用、SU(2)磁性与掺杂之间复杂的竞争关系。类碱土金属费米子通过实现更高对称性,将这一自旋-1/2范式拓展至SU(N)哈伯德模型,展现出有待揭示的丰富相图。尽管SU(N)量子系统具有重要的基础研究价值,其微观探测始终面临挑战。本文报道了基于费米子$^{87}$Sr的量子气体显微镜的实现。我们的成像方案基于689 nm窄线宽禁戒跃迁线的冷却与荧光收集,实现了自旋分辨的单原子探测。通过采用自旋选择的光泵浦方案,我们可在单次实验中确定全部10个自旋态的占据数,这是探测格点尺度磁关联的关键能力。我们通过观测跨越整个自旋-9/2基态流形的单粒子拉莫尔进动,验证了该方法的可靠性。这些成果确立了$^{87}$Sr量子气体显微镜作为研究SU(N)费米-哈伯德模型中奇异磁性的有力工具,并为量子模拟、计算与计量学研究提供了新的探测手段。
📄 Correlated noise is a critical failure mode in quantum error correction (QEC), as temporal memory and spatial structure concentrate faults into error bursts that undermine standard threshold assumptions. Yet, a fundamental gap persists between the stochastic Pauli models ubiquitous in QEC and the microscopic, non-Markovian descriptions of physical device dynamics. We close this gap by introducing \emph{Spatiotemporal Pauli Processes} (SPPs). By applying a multi-time Pauli twirl -- operationally realised by Pauli-frame randomisation -- to a general process tensor, we map arbitrary multi-time, non-Markovian dynamics to a multi-time Pauli process. This process is represented by a process-separable comb, or equivalently, a well-defined joint probability distribution over Pauli trajectories in spacetime. We show that SPPs inherit efficient tensor network representations whose bond dimensions are bounded by the environment's Liouville-space dimension. To interpret these structures, we develop transfer operator diagnostics linking spectra to correlation decay, and exact hidden Markov representations for suitable classes of SPPs. We demonstrate the framework via surface code memory and stability simulations of up to distance \(19\) for (i) a temporally correlated ``storm'' model that tunes correlation length at fixed marginal error rates, and (ii) a genuinely spatiotemporal 2D quantum cellular automaton bath that maps exactly to a nonlinear probabilistic cellular automaton under twirling. Tuning coherent bath interactions drives the system into a pseudo-critical regime, exhibiting critical slowing down and macroscopic error avalanches that cause a complete breakdown of surface code distance scaling. Together, these results justify SPPs as an operationally grounded, scalable toolkit for modelling, diagnosing, and benchmarking correlated noise in QEC.
📄 相关噪声是量子纠错(QEC)中的关键失效模式,因为时间记忆与空间结构会将故障集中为错误爆发,从而破坏标准阈值假设。然而,QEC中普遍采用的随机泡利模型与物理器件动力学的微观非马尔可夫描述之间,始终存在根本性差距。我们通过引入**时空泡利过程**(SPPs)来弥合这一差距。通过对一般过程张量施加多时间泡利旋转(通过泡利框架随机化操作实现),我们将任意多时间、非马尔可夫动力学映射为多时间泡利过程。该过程可由过程可分离梳表示,或等价地由时空泡利轨迹上的明确定义联合概率分布描述。我们证明SPPs继承了高效的张量网络表示,其键维数受环境刘维尔空间维度的限制。为解释这些结构,我们开发了将谱与关联衰减相联系的转移算子诊断方法,并为特定类别的SPPs构建了精确隐马尔可夫表示。我们通过表面码存储模拟和稳定性模拟(距离达19)来演示该框架,包括:(i)在固定边缘错误率下调节关联长度的时序相关“风暴”模型;(ii)真正时空二维量子元胞自动机浴,其在旋转下精确映射为非线性概率元胞自动机。调节相干浴相互作用会使系统进入伪临界区域,表现出临界减速和宏观错误雪崩,导致表面码距离标度律完全崩溃。这些结果共同证明,SPPs可作为一套基于操作、可扩展的工具包,用于建模、诊断和基准测试QEC中的相关噪声。
📄 Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural representation of volumetric scene properties that enable novel-view rendering and geometry reconstruction, offering a promising avenue for hyperspectral 3D scene reconstruction. We explore the possibility of using NeRFs to create 3D scene reconstructions from LWIR HSI and demonstrate that the model can be used for the basic downstream analysis task of gas plume detection. The physics-based DIRSIG software suite was used to generate a synthetic multi-view LWIR HSI dataset of a simple facility with a strong sulfur hexafluoride gas plume. Our method, built on the standard Mip-NeRF architecture, combines state-of-the-art methods for hyperspectral NeRFs and sparse-view NeRFs, along with a novel adaptive weighted MSE loss. Our final NeRF method requires around 50% fewer training images than the standard Mip-NeRF and achieves an average PSNR of 39.8 dB with as few as 30 training images. Gas plume detection applied to NeRF-rendered test images using the adaptive coherence estimator achieves an average AUC of 0.821 when compared with detection masks generated from ground-truth test images.
📄 高光谱图像(HSI)在环境监测到国家安全等领域具有广泛应用,可用于材料检测与识别。长波红外(LWIR)高光谱图像能够用于气体羽流检测与分析。通常情况下,仅能获取少量感兴趣场景的图像并进行独立分析。若能将多幅图像信息融合为统一连贯的表征,则可通过提供更丰富的场景几何结构与光谱特性上下文信息,从而提升分析效能。神经辐射场(NeRFs)通过创建场景体属性的潜在神经表征,实现了新视角渲染与几何重建,为高光谱三维场景重建提供了可行路径。本研究探索利用NeRFs从长波红外高光谱图像构建三维场景重建的可能性,并证明该模型可应用于气体羽流检测这一基础下游分析任务。基于物理原理的DIRSIG软件套件被用于生成包含强六氟化硫气体羽流的简易设施合成多视角长波红外高光谱数据集。我们基于标准Mip-NeRF架构构建的方法,融合了高光谱NeRF与稀疏视角NeRF的前沿技术,并引入新型自适应加权均方误差损失函数。最终提出的NeRF方法所需训练图像数量较标准Mip-NeRF减少约50%,仅需30张训练图像即可实现平均39.8 dB的峰值信噪比。应用自适应相干估计器对NeRF渲染测试图像进行气体羽流检测时,与真实测试图像生成的检测掩模相比,平均曲线下面积达到0.821。
📄 Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.
📄 可信度是基于大语言模型(LLM)构建的智能体AI系统的核心研究挑战。为增强可信度,通常通过检索外部知识并利用LLM核验陈述与检索证据的一致性,来检验来自不同来源(包括人类撰写的文本、网络内容和模型输出)的自然语言陈述的事实性。因此,这类方法受限于检索错误和外部数据可用性,同时未能充分利用模型内在的事实核查能力。我们提出无需检索的事实核查任务,专注于对任意自然语言陈述进行验证,无论其来源如何。为研究这一设定,我们引入了一个聚焦泛化能力的综合评估框架,测试其对以下方面的鲁棒性:(一)长尾知识,(二)陈述来源的多样性,(三)多语言性,以及(四)长文本生成。通过对9个数据集、18种方法和3个模型的实验,我们发现基于逻辑概率的方法通常逊色于利用模型内部表征的方法。基于这一发现,我们提出了INTRA方法,该方法通过挖掘内部表征间的交互关系,实现了卓越的性能和强大的泛化能力。更广泛而言,我们的工作确立了无需检索的事实核查作为一个前景广阔的研究方向,它能够与基于检索的框架互补,提升可扩展性,并使得此类系统可作为训练过程中的奖励信号,或作为集成至生成流程的组件使用。
📄 Single-object tracking (SOT) on edge devices is a critical computer vision task, requiring accurate and continuous target localization across video frames under occlusion, distractor interference, and fast motion. However, recent state-of-the-art distractor-aware memory mechanisms are largely built on segmentation-based trackers and rely on mask prediction and attention-driven memory updates, which introduce substantial computational overhead and limit real-time deployment on resource-constrained hardware; meanwhile, lightweight trackers sustain high throughput but are prone to drift when visually similar distractors appear. To address these challenges, we propose EdgeDAM, a lightweight detection-guided tracking framework that reformulates distractor-aware memory for bounding-box tracking under strict edge constraints. EdgeDAM introduces two key strategies: (1) Dual-Buffer Distractor-Aware Memory (DAM), which integrates a Recent-Aware Memory to preserve temporally consistent target hypotheses and a Distractor-Resolving Memory to explicitly store hard negative candidates and penalize their re-selection during recovery; and (2) Confidence-Driven Switching with Held-Box Stabilization, where tracker reliability and temporal consistency criteria adaptively activate detection and memory-guided re-identification during occlusion, while a held-box mechanism temporarily freezes and expands the estimate to suppress distractor contamination. Extensive experiments on five benchmarks, including the distractor-focused DiDi dataset, demonstrate improved robustness under occlusion and fast motion while maintaining real-time performance on mobile devices, achieving 88.2% accuracy on DiDi and 25 FPS on an iPhone 15. Code will be released.
📄 边缘设备上的单目标跟踪(SOT)是一项关键的计算机视觉任务,需要在遮挡、干扰物干扰和快速运动等条件下,在视频帧中实现准确且连续的目标定位。然而,当前最先进的干扰物感知记忆机制主要基于分割式跟踪器,依赖掩码预测和注意力驱动的记忆更新,这带来了巨大的计算开销,限制了其在资源受限硬件上的实时部署;与此同时,轻量级跟踪器虽能维持高吞吐量,但在出现视觉相似干扰物时容易发生跟踪漂移。为解决这些挑战,我们提出了EdgeDAM——一种轻量级的检测引导跟踪框架,在严格的边缘约束下重新设计了面向边界框跟踪的干扰物感知记忆机制。EdgeDAM引入了两项关键策略:(1)双缓冲区干扰物感知记忆,通过集成近期感知记忆以保持时间一致的目标假设,以及干扰物解析记忆以显式存储困难负样本候选,并在恢复过程中抑制其被重新选中;(2)置信度驱动切换与保持框稳定机制,该机制依据跟踪器可靠性和时间一致性准则,在遮挡期间自适应激活检测与记忆引导的重新识别,同时通过保持框机制临时冻结并扩展估计框以抑制干扰物污染。在包括专注干扰场景的DiDi数据集在内的五个基准测试上的大量实验表明,该方法在遮挡和快速运动下具有更强的鲁棒性,同时在移动设备上保持实时性能,在DiDi数据集上达到88.2%的准确率,并在iPhone 15上实现25 FPS。代码将公开。
📄 Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh's National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.
📄 针对低资源语言的阅读理解系统在处理不可回答问题方面面临显著挑战。当上下文中缺乏正确答案时,这些系统往往会产生不可靠的回应。为解决这一问题,我们推出了NCTB-QA——一个大规模孟加拉语问答数据集,包含从孟加拉国国家课程与教科书委员会出版的50本教材中提取的87,805个问答对。与现有孟加拉语数据集不同,NCTB-QA保持了可回答问题(57.25%)与不可回答问题(42.75%)的均衡分布,同时包含含有干扰选项的对抗性设计实例。我们对三种基于Transformer的模型(BERT、RoBERTa、ELECTRA)进行基准测试,并通过微调实现了显著性能提升:BERT模型的F1分数获得313%的相对提升(从0.150增至0.620)。通过BERTScore衡量的语义答案质量在所有模型中也显著提高。实验结果表明,NCTB-QA可作为孟加拉语教育问答领域具有挑战性的基准数据集。本研究证明,在低资源场景中,领域特异性微调对实现稳健性能至关重要。
📄 Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on 'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.
📄 扩散语言模型(DLMs)具备高度并行文本生成的潜力,但其实际推理速度常受限于次优的解码调度策略。标准方法依赖"分散接受"机制——在序列的离散位置上提交高置信度标记。这种做法会无意间割裂键值(KV)缓存,破坏内存局部性,并迫使模型在不稳定的标记边界上进行代价高昂的重复修复。为此,我们提出最长稳定前缀(LSP)调度器,这是一种基于整体前缀吸收、无需训练且与模型无关的推理范式。在每个去噪步骤中,LSP通过单次前向传播评估标记稳定性,动态识别连续左对齐的稳定预测块,并在原子化提交前将其边界对齐至自然语言或结构分隔符。这种前缀优先的拓扑结构带来双重优势:系统层面,它将碎片化的KV缓存更新转化为高效的连续追加;算法层面,它在几何收缩的活跃后缀上保持双向前瞻能力,显著降低标记翻转率与去噪器调用次数。基于LLaDA-8B和Dream-7B的广泛实验表明,在数学推理、代码生成、多语言(CJK)任务及创意写作等严格基准测试中,LSP将推理速度提升最高达3.4倍,同时保持或略微提升输出质量。通过根本性重构提交拓扑结构,LSP弥合了DLMs理论并行性与实际硬件效率之间的鸿沟。
📄 Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.
📄 建立共同基础——即一套共享的信念和相互认可的事实——是协作的根本,但对当前的人工智能系统而言仍是一项挑战,尤其是在多模态、多方参与的协作场景中,参与者各自掌握不同的信息。我们提出了分布式部分信息谜题(DPIP),这是一种在认知不对称条件下引发丰富多模态交流的协作建构任务。我们构建了一个多模态交互数据集,该数据集经过标注,并在语音、手势和动作模态间实现了时间对齐,以支持对命题内容和信念动态的推理。随后,我们评估了两种建模共同基础(CG)的范式:(1)采用先进的大型语言模型(LLMs),通过多模态信息更新推断共享信念;(2)基于动态认知逻辑(DEL)构建的公理化流程,以增量方式执行相同任务。在已标注的DPIP数据上的实验结果表明,该任务对现代LLMs同时追踪任务进展和信念状态的能力构成了挑战。
📄 Contact-rich micromanipulation in microfluidic flow is challenging because small disturbances can break pushing contact and induce large lateral drift. We study planar cell pushing with a magnetic rolling microrobot that tracks a waypoint-sampled reference curve under time-varying Poiseuille flow. We propose a hybrid controller that augments a nominal MPC with a learned residual policy trained by SAC. The policy outputs a bounded 2D velocity correction that is contact-gated, so residual actions are applied only during robot--cell contact, preserving reliable approach behavior and stabilizing learning. All methods share the same actuation interface and speed envelope for fair comparisons. Experiments show improved robustness and tracking accuracy over pure MPC and PID under nonstationary flow, with generalization from a clover training curve to unseen circle and square trajectories. A residual-bound sweep identifies an intermediate correction limit as the best trade-off, which we use in all benchmarks.
📄 微流控流动中的密集接触微操作具有挑战性,因为微小扰动可能破坏推动接触并引发显著的横向漂移。本研究利用磁控滚动微型机器人在时变泊肃叶流中跟踪路径点采样的参考曲线,进行平面细胞推动实验。我们提出一种混合控制器,通过SAC训练的学习残差策略增强标称模型预测控制(MPC)。该策略输出有界的二维速度校正量,并采用接触门控机制——仅在机器人-细胞接触时施加残差动作,从而保持可靠的趋近行为并稳定学习过程。所有方法均采用相同的驱动接口与速度范围以确保公平比较。实验表明,在非稳态流场中,该方法相较于纯MPC与PID控制展现出更强的鲁棒性与跟踪精度,并能从训练所用的三叶草曲线泛化至未训练过的圆形与方形轨迹。通过残差边界扫描实验,我们确定了中等校正限度为最佳平衡点,并将其应用于所有基准测试中。
📄 High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.
📄 高质量的多相机三维流传输对于众多AR/VR应用中的沉浸式体验至关重要。由于实时性限制,视角数量有限往往导致渲染图像中出现信息缺失与表面不完整的问题。现有方法通常依赖简单的启发式算法进行空洞填充,可能导致画面不一致或视觉伪影。我们提出一种新颖的、面向应用的修复方法,在完成新视角渲染后,将其作为基于图像的后处理步骤来补全缺失纹理,该方法独立于底层表示形式。该模块设计为独立组件,可与任何标定后的多相机系统兼容。为此,我们引入一种基于Transformer的多视角感知网络架构,利用时空嵌入确保帧间一致性,同时保留细节特征。此外,我们的分辨率无关设计可适配不同相机配置,而自适应分块选择策略在推理速度与质量间取得平衡,从而实现实时性能。在相同实时约束条件下,我们将本方法与最先进的修复技术进行对比评估,结果表明我们的模型在质量与速度间达到了最佳平衡,在图像与视频评价指标上均优于现有方法。
📄 We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.
📄 我们推出了FaceCam系统,该系统能够根据可定制的相机轨迹,为单目人像视频输入生成相应视频。当前基于大型视频生成模型的相机控制方法已取得显著进展,但由于相机表示的尺度模糊性或三维重建误差,在人像视频中常出现几何失真和视觉伪影。为克服这些局限,我们提出了一种面向人脸的尺度感知相机变换表示方法,该方法无需依赖三维先验知识即可提供确定性条件约束。我们利用多视角工作室采集数据与真实场景单目视频共同训练视频生成模型,并引入了两种相机控制数据生成策略:合成相机运动与多镜头拼接,以充分利用静态训练相机的同时,在推理阶段泛化至动态连续的相机轨迹。在Ava-256数据集及多样真实场景视频上的实验表明,FaceCam在相机可控性、视觉质量、身份特征与运动保持方面均展现出卓越性能。
📄 Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.
📄 近年来,扩散模型已能生成高质量视频,但存在运行速度缓慢的问题。这些模型所采用的大型基于Transformer的主干网络受限于时空注意力机制的计算效率。本文研究发现,在不同输入条件下,大量token之间的连接始终产生可忽略的分数值,且这些连接模式在多个查询中经常重复出现。因此,在计算注意力时跳过这些连接几乎不会影响生成结果。这一现象在局部token块之间的连接中同样存在。基于此,我们提出CalibAtt——一种无需训练即可通过校准稀疏注意力加速视频生成的方法。CalibAtt通过离线校准步骤识别出跨输入稳定的块级稀疏性与重复模式,并将这些模式编译为针对每层、每个注意力头及每个扩散时间步的优化注意力操作。在推理阶段,我们仅密集计算选定的输入相关连接,并以硬件高效的方式跳过未选定的连接。在Wan 2.1 14B、Mochi 1及多种分辨率下的少步蒸馏模型上进行的大量实验表明,CalibAtt可实现最高1.58倍的端到端加速效果,在保持视频生成质量与文本-视频对齐度的同时,性能优于现有无需训练的方法。
📄 We introduce group surface codes, which are a natural generalization of the $\mathbb{Z}_2$ surface code, and equivalent to quantum double models of finite groups with specific boundary conditions. We show that group surface codes can be leveraged to perform non-Clifford gates in $\mathbb{Z}_2$ surface codes, thus enabling universal computation with well-established means of performing logical Clifford gates. Moreover, for suitably chosen groups, we demonstrate that arbitrary reversible classical gates can be implemented transversally in the group surface code. We present the logical operations in terms of a set of elementary logical operations, which include transversal logical gates, a means of transferring encoded information into and out of group surface codes, and preparation and readout. By composing these elementary operations, we implement a wide variety of logical gates and provide a unified perspective on recent constructions in the literature for sliding group surface codes and preparing magic states. We furthermore use tensor networks inspired by ZX-calculus to construct spacetime implementations of the elementary operations. This spacetime perspective also allows us to establish explicit correspondences with topological gauge theories. Our work extends recent efforts in performing universal quantum computation in topological orders without the braiding of anyons, and shows how certain group surface codes allow us to bypass the restrictions set by the Bravyi-K{ö}nig theorem, which limits the computational power of topological Pauli stabilizer models.
📄 我们提出群表面码,这是$\mathbb{Z}_2$表面码的自然推广,等价于具有特定边界条件的有限群量子双模型。我们证明,群表面码可用于在$\mathbb{Z}_2$表面码中实现非克利福德门操作,从而通过成熟的逻辑克利福德门执行方式实现通用计算。此外,对于适当选择的群,我们展示了任意可逆经典门可以在群表面码中横向实现。我们通过一组基本逻辑操作来描述这些逻辑运算,包括横向逻辑门、将编码信息传入和传出群表面码的方法,以及制备与读取过程。通过组合这些基本操作,我们实现了多种逻辑门,并为近期文献中关于滑动群表面码和制备魔术态的构造提供了统一视角。进一步,我们利用受ZX演算启发的张量网络构建了基本操作的时空实现方法。这种时空视角还使我们能够与拓扑规范理论建立明确对应关系。我们的工作拓展了近期在拓扑序中无需编织任意子即可实现通用量子计算的探索,并展示了特定群表面码如何突破Bravyi-König定理的限制——该定理制约了拓扑泡利稳定子模型的计算能力。
📄 Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.
📄 高效稳定地训练大语言模型(LLM)仍是现代机器学习系统的核心挑战。为应对这一挑战,研究者提出了重参数化正交等价训练(POET)——一种通过正交等价变换优化各权重矩阵、同时保持谱特性的训练框架。尽管POET具备出色的训练稳定性,但其原始实现因密集的矩阵乘法运算导致内存消耗与计算开销过高。为突破这些限制,我们提出了POET-X:一种可扩展且内存高效的改进方案,能以显著降低的计算成本实现正交等价变换。POET-X在保持POET泛化能力与稳定性优势的同时,大幅提升了训练吞吐量与内存效率。实验表明,POET-X可在单张Nvidia H100 GPU上完成数十亿参数大语言模型的预训练;相比之下,AdamW等标准优化器在相同配置下会出现内存不足的问题。
📄 We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.
📄 我们研究了Transformer语言模型中两种反复出现的现象:**巨量激活**(少数词元在个别通道中呈现极端异常值)与**注意力洼地**(某些词元无论语义相关性如何都会吸引不成比例的注意力权重)。先前研究观察到这两种现象经常同时出现,且往往涉及相同的词元,但它们的功能作用及因果关系尚不明确。通过系统性实验,我们发现这种共现现象主要是现代Transformer架构设计的人工产物,且两种现象承担着相关但不同的功能。巨量激活具有全局性作用:它们引发近乎恒定的隐藏表示,并在各层间持续传递,实质上充当了模型的隐式参数;而注意力洼地则发挥局部调节功能:它们调控各注意力头的输出,并使单个注意力头偏向短程依赖关系。我们指出**前置归一化配置**是实现两者共现的关键设计选择,并证明移除该配置会导致两种现象解耦。
📄 Traditional safety-critical control methods, such as control barrier functions, suffer from semantic blindness, exhibiting the same behavior around obstacles regardless of contextual significance. This limitation leads to the uniform treatment of all obstacles, despite their differing semantic meanings. We present Safe-SAGE (Social-Semantic Adaptive Guidance for Safe Engagement), a unified framework that bridges the gap between high-level semantic understanding and low-level safety-critical control through a Poisson safety function (PSF) modulated using a Laplace guidance field. Our approach perceives the environment by fusing multi-sensor point clouds with vision-based instance segmentation and persistent object tracking to maintain up-to-date semantics beyond the camera's field of view. A multi-layer safety filter is then used to modulate system inputs to achieve safe navigation using this semantic understanding of the environment. This safety filter consists of both a model predictive control layer and a control barrier function layer. Both layers utilize the PSF and flux modulation of the guidance field to introduce varying levels of conservatism and multi-agent passing norms for different obstacles in the environment. Our framework enables legged robots to navigate semantically rich, dynamic environments with context-dependent safety margins while maintaining rigorous safety guarantees.
📄 传统安全关键控制方法(如控制屏障函数)存在语义盲区,无论障碍物的上下文意义如何,其在障碍物周围均表现出相同行为。这一局限导致所有障碍物被统一对待,尽管它们具有不同的语义含义。我们提出Safe-SAGE(安全交互的社会语义自适应引导框架),该统一框架通过拉普拉斯引导场调制的泊松安全函数,弥合了高层语义理解与底层安全关键控制之间的鸿沟。我们的方法通过融合多传感器点云与基于视觉的实例分割及持续目标跟踪来感知环境,从而维持摄像头视场之外的最新语义信息。随后采用多层安全过滤器,基于对环境的语义理解来调制系统输入,实现安全导航。该安全过滤器包含模型预测控制层和控制屏障函数层,两者均利用泊松安全函数及引导场的通量调制技术,针对环境中不同障碍物引入差异化的保守度层级与多智能体通行规范。我们的框架使足式机器人能够在语义丰富的动态环境中,根据上下文保持差异化安全边界的同时,始终维持严格的安全保障。
📄 The realization of quantum error correction protocols whose logical error rates are suppressed far below physical error rates relies on an intricate combination: the error-correcting code's efficiency, the syndrome extraction circuit's fault tolerance and overhead, the decoder's quality, and the device's constraints, such as physical qubit count and connectivity. This work makes two contributions towards error-corrected quantum devices. First, we introduce mirror codes, a simple yet flexible construction of LDPC stabilizer codes parameterized by a group $G$ and two subsets of $G$ whose total size bounds the check weight. These codes contain all abelian two-block group algebra codes, such as bivariate bicycle (BB) codes. At the same time, they are manifestly not CSS in general, thus deviating substantially from most prior constructions. Fixing a check weight of 6, we find $[[ 60, 4, 10 ]], [[ 36, 6, 6 ]], [[ 48, 8, 6 ]]$, and $[[ 85, 8, 9 ]]$ codes, all of which are not CSS; we also find several weight-7 codes with $kd > n$. Next, we construct syndrome extraction circuits that trade overhead for provable fault tolerance. These circuits use 1-2, 3, and 6 ancillae per check, and respectively are partially fault-tolerant (FT), provably FT on weight-6 CSS codes, and provably FT on \emph{all} weight-6 stabilizer codes. Using our constructions, we perform end-to-end quantum memory experiments on several representative mirror codes under circuit-level noise. We achieve an error pseudothreshold on the order of $0.2\%$, approximately matching that of the $[[ 144, 12, 12 ]]$ BB code under the same model. These findings position mirror codes as a versatile candidate for fault-tolerant quantum memory, especially on smaller-scale devices in the near term.
📄 实现逻辑错误率远低于物理错误率的量子纠错协议依赖于多方面的复杂结合:纠错码的效率、症状提取电路的容错性与开销、解码器的质量,以及设备的限制条件(如物理量子比特数量和连接性)。本研究在纠错量子设备方面做出了两项贡献。首先,我们引入了镜像码——一种简单而灵活的LDPC稳定子码构造,其参数由群$G$和$G$的两个子集确定,子集总规模限定了校验权重。这类编码包含所有阿贝尔双块群代数码(例如双变量自行车码)。同时,它们通常明显不属于CSS类型,从而与大多数现有构造有显著差异。在固定校验权重为6的条件下,我们发现了$[[ 60, 4, 10 ]]$、$[[ 36, 6, 6 ]]$、$[[ 48, 8, 6 ]]$和$[[ 85, 8, 9 ]]$等非CSS编码;我们还发现了多个满足$kd > n$的权重-7编码。其次,我们构建了以开销换取可证明容错性的症状提取电路。这些电路为每个校验分别使用1-2、3和6个辅助比特,相应地实现了部分容错、在权重-6 CSS编码上的可证明容错,以及在所有权重-6稳定子码上的可证明容错。基于这些构造,我们在电路级噪声模型下对多个代表性镜像码进行了端到端量子存储实验。实验获得的错误伪阈值约为$0.2\%$,与相同模型下$[[ 144, 12, 12 ]]$双变量自行车码的结果基本相当。这些发现表明镜像码可作为容错量子存储的通用候选方案,尤其适用于近期较小规模的量子设备。
📄 To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects "cheap" imperfect labels, then performs supervised pretraining, and finally refines the model through self-supervised learning to improve overall performance. Our theoretical analysis and merit-based criterion show that labeled data need only place the model within a basin of attraction, confirming that only modest numbers of inexact labels and training epochs are required. We empirically validate our simple three-stage strategy across challenging domains, including nonconvex constrained optimization, power-grid operation, and stiff dynamical systems, and show that it yields faster convergence; improved accuracy, feasibility, and optimality; and up to 59x reductions in total offline cost.
📄 为了扩大优化和仿真问题的求解规模,先前的研究探索了机器学习代理模型,这些模型能够以低成本将问题参数映射到相应的解。常用的方法包括采用软约束或硬约束的监督学习和自监督学习,但它们面临固有的挑战,例如依赖昂贵的高质量标签或复杂的优化空间。为了平衡这些方法的优劣,我们提出了一种新颖的框架:首先收集“廉价”的不完美标签,然后进行监督预训练,最后通过自监督学习对模型进行微调,以提升整体性能。我们的理论分析和基于性能的准则表明,标记数据只需将模型置于吸引域内即可,这证实了仅需少量不精确标签和较短的训练周期。我们在多个具有挑战性的领域(包括非凸约束优化、电网运行和刚性动力系统)中实证验证了这一简单的三阶段策略,结果表明该策略能够实现更快的收敛速度,提高准确性、可行性和最优性,并将总离线成本降低高达59倍。
📄 Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
📄 大型语言模型有时会产生虚假或误导性的回答。针对此问题有两种主要方法:诚实诱导——通过修改提示或权重使模型如实作答;以及谎言检测——判断给定回答是否虚假。先前的研究通常在专门训练用于说谎或隐瞒信息的模型上评估这些方法,但这些人为构建的场景可能与自然产生的欺骗行为存在差异。我们转而研究中国开发者发布的开源权重大语言模型,这些模型被训练用于审查政治敏感话题:例如Qwen3模型在涉及法轮功或天安门抗议等话题时频繁输出虚假信息,但偶尔也能给出正确答案,这表明它们具备被训练所压制的知识。以此为测试平台,我们评估了一系列诱导与谎言检测技术。在诚实诱导方面,不使用对话模板的采样、少量示例提示以及在通用诚实数据上的微调最能可靠提升真实回答的比例。对于谎言检测,直接要求被审查模型对其自身回答进行分类的方法接近未经审查模型的理论上限,而基于无关数据训练的线性探针则提供了更经济的替代方案。最强的诚实诱导技术同样适用于前沿开源权重模型(如DeepSeek R1)。值得注意的是,没有任何技术能完全消除虚假回答。我们已公开所有提示词、代码及对话记录。
📄 Characterizing the dynamics of open quantum systems at the level of microscopic interactions and error mechanisms is essential for calibrating quantum hardware, designing robust simulation protocols, and developing tailored error-correction methods. Under Markovian noise/dissipation, a natural characterization approach is to identify the full Lindbladian generator that gives rise to both coherent (Hamiltonian) and dissipative dynamics. Prior protocols for learning Lindbladians from dynamical data assumed pre-specified interaction structure, which can be restrictive when the relevant noise channels or control imperfections are not known in advance. In this paper, we present the first sample-efficient protocol for learning sparse Lindbladians without assuming any a priori structure or locality. Our protocol is ancilla-free, uses only product-state preparations and Pauli-basis measurements, and achieves near-optimal time resolution, making it compatible with near-term experimental capabilities. The final sample complexity depends on linear-system conditioning, which we find empirically to be moderate for a broad class of physically motivated models. Together, this provides a systematic route to scalable characterization of open-system quantum dynamics, especially in settings where the error mechanisms of interest are unknown.
📄 在微观相互作用和误差机制的层面上表征开放量子系统的动力学,对于校准量子硬件、设计鲁棒的模拟协议以及开发定制化的纠错方法至关重要。在马尔可夫噪声/耗散条件下,一种自然的表征方法是识别完整的林布拉德生成元,该生成元同时描述了相干(哈密顿量)和耗散动力学。以往从动力学数据中学习林布拉德生成元的协议均假设了预先指定的相互作用结构,这在相关噪声通道或控制缺陷未知的情况下具有局限性。本文提出了首个无需假设任何先验结构或局域性、且样本效率高的稀疏林布拉德生成元学习协议。该协议无需辅助量子比特,仅使用乘积态制备和泡利基测量,并实现了接近最优的时间分辨率,使其与近期实验能力兼容。最终样本复杂度取决于线性系统的条件数,我们通过实验发现,在广泛的物理动机模型中这一条件数处于中等水平。综上,本研究为开放系统量子动力学的可扩展表征提供了一条系统化路径,尤其适用于目标误差机制未知的场景。
📄 We study the local limits of uniform random triangulations with boundaries in the regime where the genus is proportional to the number of faces. Budzinski and Louf proved in 2020 that when there are no boundaries, the local limits exist and are the Planar Stochastic Hyperbolic Triangulation (PSHT) introduced in PSHT. We show that when the triangulations considered have size n and boundaries with total length p that tends to infinity with n and p=o(n), the local limits around a typical boundary edge are the half-plane hyperbolic triangulations defined by Angel and Ray. This provides, for the first time, a construction of these hyperbolic half-plane triangulations as local limits of large genus triangulations. We also prove that under the condition p = o(n), the local limit when rooted on a uniformly chosen oriented edge is given by the PSHT. Contrary to the proof of Budzinski and Louf, the latter does not rely on the Goulden-Jackson recurrence relation, but only on coarse combinatorial estimates. Thus, we expect that the proof can be adapted to local limits in similar models.
📄 我们研究了在亏格与面数成正比的条件下,带边界均匀随机三角剖分的局部极限。Budzinski和Louf于2020年证明,当三角剖分无边界时,其局部极限存在且为PSHT中引入的平面随机双曲三角剖分(PSHT)。我们证明,当所考虑的三角剖分具有n个面且边界总长度为p(p随n趋于无穷大且满足p=o(n))时,围绕典型边界边的局部极限是Angel和Ray定义的双曲半平面三角剖分。这首次实现了将双曲半平面三角剖分构造为大亏格三角剖分的局部极限。我们还证明,在p=o(n)条件下,以均匀随机选取的有向边为根时,局部极限由PSHT给出。与Budzinski和Louf的证明不同,后一结果不依赖于Goulden-Jackson递推关系,而仅基于粗略的组合估计。因此,我们预期该证明方法可适用于类似模型的局部极限研究。
📄 The growing complexity of hardware design and the widening gap between high-level specifications and register-transfer level (RTL) implementation hinder rapid prototyping and system design. We introduce NL2GDS (Natural Language to Layout), a novel framework that leverages large language models (LLMs) to translate natural language hardware descriptions into synthesizable RTL and complete GDSII layouts via the open-source OpenLane ASIC flow. NL2GDS employs a modular pipeline that captures informal design intent, generates HDL using multiple LLM engines and verifies them, and orchestrates automated synthesis and layout. Evaluations on ISCAS'85 and ISCAS'89 benchmark designs demonstrate up to 36% area reduction, 35% delay reduction, and 70% power savings compared to baseline designs, highlighting its potential to democratize ASIC design and accelerate hardware innovation.
📄 硬件设计日益复杂,高层次规范与寄存器传输级(RTL)实现之间的鸿沟不断扩大,这阻碍了快速原型设计与系统开发的进程。我们提出NL2GDS(自然语言到版图)——一种创新框架,该框架利用大语言模型(LLM)将自然语言硬件描述转化为可综合的RTL,并通过开源OpenLane ASIC流程生成完整的GDSII版图。NL2GDS采用模块化流程,能够捕捉非形式化的设计意图,通过多LLM引擎生成硬件描述语言并进行验证,同时协调自动化综合与版图生成。在ISCAS'85和ISCAS'89基准设计上的评估表明,相较于基线设计,该框架可实现高达36%的面积缩减、35%的延迟降低以及70%的功耗节约,彰显了其在推动ASIC设计普及与加速硬件创新方面的潜力。
📄 We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.
📄 我们通过研究发现推理模型中存在"表演性思维链"现象:模型对其最终答案已高度确信,却仍持续生成文本而不显露内部真实信念。我们在两个大模型(DeepSeek-R1 671B 和 GPT-OSS 120B)上对比了激活探针、早期强制回答和思维链监控三种方法,发现任务难度导致显著差异:对于简单的基于记忆的MMLU问题,从思维链早期激活状态解码最终答案的时间远早于监控器判断时间。这与困难多跳推理GPQA-Diamond问题中真正的推理过程形成鲜明对比。值得注意的是,转折点(如回溯、"顿悟"时刻)几乎只出现在探针检测到信念大幅波动的回答中,表明这些行为反映了真实的不确定性而非习得的"推理表演"。最后,探针引导的早期退出策略在MMLU上减少80%令牌消耗,在GPQA-Diamond上减少30%,同时保持精度相当,这证明注意力探针可作为检测表演性推理、实现自适应计算的高效工具。
📄 Vision-Language-Action Models (VLAs) have shown remarkable progress towards embodied intelligence. While their architecture partially resembles that of Large Language Models (LLMs), VLAs exhibit higher complexity due to their multi-modal inputs/outputs and often hybrid nature of transformer and diffusion heads. This is part of the reason why insights from mechanistic interpretability in LLMs, which explain how the internal model representations relate to their output behavior, do not trivially transfer to VLA counterparts. In this work, we propose to close this gap by introducing and analyzing two main concepts: feature-observability and feature-controllability. In particular, we first study features that are linearly encoded in representation space, and show how they can be observed by means of a linear classifier. Then, we use a minimal linear intervention grounded in optimal control to accurately place internal representations and steer the VLA's output towards a desired region. Our results show that targeted, lightweight interventions can reliably steer a robot's behavior while preserving closed-loop capabilities. We demonstrate on different VLA architectures ($π_{0.5}$ and OpenVLA) through simulation experiments that VLAs possess interpretable internal structure amenable to online adaptation without fine-tuning, enabling real-time alignment with user preferences and task requirements.
📄 视觉-语言-动作模型(VLAs)在具身智能领域取得了显著进展。尽管其架构部分类似于大语言模型(LLMs),但由于其多模态输入/输出以及通常结合了Transformer与扩散头的混合特性,VLAs展现出更高的复杂性。这也部分解释了为何从LLMs机制可解释性研究中获得的洞见——即解释内部模型表征如何与输出行为相关联——难以直接迁移到VLA模型中。本研究旨在通过引入并分析两个核心概念——特征可观测性与特征可控性——来弥合这一差距。具体而言,我们首先研究了表征空间中线性编码的特征,并展示了如何通过线性分类器对其进行观测。随后,我们基于最优控制理论设计了一种最小线性干预方法,以精准调整内部表征并引导VLA的输出朝向目标区域。实验结果表明,这种定向、轻量级的干预能够可靠地引导机器人行为,同时保持其闭环控制能力。我们通过对不同VLA架构($π_{0.5}$与OpenVLA)的仿真实验证明,VLAs具有可解释的内部结构,支持无需微调的在线自适应,能够实时对齐用户偏好与任务需求。
📄 We conducted a HI 21cm absorption study of a sample of 147 nearby (z < 0.1) low-power radio sources with $10\,\mathrm{mJy} < S_{1.4\,\mathrm{GHz}} < 30\,\mathrm{mJy}$ and $\log(P_{1.4\,\mathrm{GHz}}/\mathrm{W\,Hz^{-1}}) = 20.5-23.7$, using the Five-hundred-meter Aperture Spherical radio Telescope. By investigating the origin and kinematics of HI absorbing gas, we aim to study the interplay between the active galactic nucleus (AGN) and its surrounding interstellar medium. Our observations detect 12 new absorbers, combining results from the pilot survey (three absorbers out of 26 sources), yielding a detection rate of $\sim10.2^{+3.1}_{-2.0}\%$. The detection rate in our sample is lower than in higher-power samples, which is likely due to emission dilution and the dominance of extended sources, indicating a gas-rich and star-forming-dominated population in low-power sources. Among new detections, most line profiles are narrow and show velocities close to systemic ones, consistent with rotating disks, while four show disturbed kinematics indicative of inflows or outflows. The fraction of outflow candidates rises with radio power, while the fraction of inflow ones remains constant, suggesting the effect of radio emission on driving HI outflows. In our sample, compact sources show a higher HI detection rate than extended sources. Contrary to expectations from higher-power samples, MIR-bright sources at low-power radio do not exhibit a higher HI detection rate or more disturbed kinematics. In low-power radio sources, blueshifted absorption occurs only in Seyferts and low-ionization nuclear emitting regions, indicating the connection between atomic outflows and the ionization state of AGN.
📄 我们利用五百米口径球面射电望远镜,对147个近邻(红移z < 0.1)低功率射电源样本进行了HI 21厘米吸收线研究,这些源的流量密度范围为$10\,\mathrm{mJy} < S_{1.4\,\mathrm{GHz}} < 30\,\mathrm{mJy}$,射电功率为$\log(P_{1.4\,\mathrm{GHz}}/\mathrm{W\,Hz^{-1}}) = 20.5-23.7$。通过研究HI吸收气体的起源与运动学特征,我们旨在探讨活动星系核与其周围星际介质之间的相互作用。观测共探测到12个新的吸收系统,结合先导巡天结果(26个源中探测到3个吸收系统),总探测率约为$\sim10.2^{+3.1}_{-2.0}\%$。本样本的探测率低于高功率射电源样本,这可能是由于辐射稀释效应以及延展源占主导所致,表明低功率射电源群体具有气体丰富且恒星形成主导的特征。在新探测到的吸收系统中,多数谱线轮廓狭窄且速度接近系统速度,与旋转盘模型相符,而其中四个系统表现出扰动运动学特征,暗示存在气体流入或外流。外流候选者的比例随射电功率增加而上升,而内流比例保持稳定,说明射电辐射对驱动HI外流存在影响。在本样本中,致密源的HI探测率高于延展源。与高功率样本的预期相反,低功率射电源中的中红外明亮源并未表现出更高的HI探测率或更显著的运动学扰动。在低功率射电源中,蓝移吸收仅出现在赛弗特星系和低电离核发射区星系中,表明原子气体外流与活动星系核的电离状态存在关联。
📄 Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene's geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural representation of volumetric scene properties that enable novel-view rendering and geometry reconstruction, offering a promising avenue for hyperspectral 3D scene reconstruction. We explore the possibility of using NeRFs to create 3D scene reconstructions from LWIR HSI and demonstrate that the model can be used for the basic downstream analysis task of gas plume detection. The physics-based DIRSIG software suite was used to generate a synthetic multi-view LWIR HSI dataset of a simple facility with a strong sulfur hexafluoride gas plume. Our method, built on the standard Mip-NeRF architecture, combines state-of-the-art methods for hyperspectral NeRFs and sparse-view NeRFs, along with a novel adaptive weighted MSE loss. Our final NeRF method requires around 50% fewer training images than the standard Mip-NeRF and achieves an average PSNR of 39.8 dB with as few as 30 training images. Gas plume detection applied to NeRF-rendered test images using the adaptive coherence estimator achieves an average AUC of 0.821 when compared with detection masks generated from ground-truth test images.
📄 高光谱图像(HSI)在环境监测到国家安全等多个领域具有广泛应用,可用于材料检测与识别。长波红外(LWIR)高光谱图像尤其适用于气体羽流检测与分析。在实际应用中,通常只能获取少量感兴趣场景的图像并进行独立分析。若能将多幅图像信息融合为统一连贯的表征,则可通过提供更丰富的场景几何结构与光谱特性上下文信息,从而提升分析效能。神经辐射场(NeRFs)通过创建场景体积属性的潜在神经表征,实现了新视角渲染与几何重建,为高光谱三维场景重建提供了创新路径。本研究探索利用NeRFs从长波红外高光谱图像构建三维场景重建的可能性,并验证该模型可支持气体羽流检测这一基础下游分析任务。基于物理机理的DIRSIG软件套件被用于生成包含强六氟化硫气体羽流的简易设施合成多视角长波红外高光谱数据集。本研究方法基于标准Mip-NeRF架构,融合了高光谱NeRF与稀疏视角NeRF的前沿技术,并引入新型自适应加权均方误差损失函数。最终提出的NeRF方法所需训练图像数量较标准Mip-NeRF减少约50%,仅需30张训练图像即可实现平均39.8 dB的峰值信噪比。应用自适应相干估计器对NeRF渲染测试图像进行气体羽流检测,与真实测试图像生成的检测掩模相比,平均曲线下面积达到0.821。
📄 The emergence of generative AI models has dramatically expanded the availability and use of synthetic data across scientific, industrial, and policy domains. While these developments open new possibilities for data analysis, they also raise fundamental statistical questions about when synthetic data can be used in a valid, reliable, and principled manner. This paper reviews the current landscape of synthetic data generation and use from a statistical perspective, with the goal of clarifying the assumptions under which synthetic data can meaningfully support downstream discovery, inference, and prediction. We survey major classes of modern generative models, their intended use cases, and the benefits they offer, while also highlighting their limitations and characteristic failure modes. We additionally examine common pitfalls that arise when synthetic data are treated as surrogates for real observations, including biases from model misspecification, attenuated uncertainty, and difficulties in generalization. Building on these insights, we discuss emerging frameworks for the principled use of synthetic data. We conclude with practical recommendations, open problems, and cautions intended to guide both method developers and applied researchers.
📄 生成式人工智能模型的出现,极大地扩展了合成数据在科学、工业和政策领域的可用性与应用范围。这些进展为数据分析开辟了新可能,同时也引发了关于合成数据何时能够以有效、可靠且符合原则的方式使用的基础性统计学问题。本文从统计学视角梳理了当前合成数据生成与使用的现状,旨在阐明合成数据能够有意义地支持下游发现、推断与预测所需的前提假设。我们系统考察了现代生成模型的主要类别、其预期用例及优势,同时也指出其局限性和典型失效模式。此外,我们分析了将合成数据视为真实观测替代品时常见的误区,包括模型设定偏误导致的偏差、不确定性被低估以及泛化困难等问题。基于这些观察,我们探讨了合成数据规范化使用的新兴框架。最后,我们提出实用建议、待解决的问题以及需警惕的事项,以期为方法开发者和应用研究者提供指导。
📄 Thermonuclear X-ray bursts from the surface of accreting neutron stars are the most common astrophysical explosions in our galaxy. They provide a unique window into the physics of neutron stars, the physics of matter under extreme conditions, and the physics of astrophysical thermonuclear explosions. X-ray bursts are powered by a broad range of nuclear reactions that need to be understood to interpret observations. The relevant nuclei are mostly neutron deficient and unstable, and thus experimental information and theoretical understanding is limited and an active area of research in nuclear science. We review the current status of the nuclear physics of X-ray bursts, with special emphasis on new experimental and theoretical information on a large number of reaction rates. As such we provide an overview of the broad experimental and theoretical methods currently used to advance the nuclear physics of X-ray bursts. The new information is used to update the public JINA REACLIB database with 32 new reaction rates based on experimental information, and a new dataset of theoretical statistical model reaction rates where no experimental information is available. Using several models for X-ray bursts that are powered by mixed hydrogen and helium burning, we take advantage of the updated nuclear data to review the current understanding of the nuclear reaction sequences in such X-ray bursts, the modeling of light curves, and predictions of the composition of nuclear ashes.
📄 吸积中子星表面发生的热核X射线暴是我们银河系中最常见的天体物理爆发事件。它们为研究中子星物理、极端条件下物质特性以及天体热核爆炸物理提供了独特窗口。X射线暴由一系列广泛的核反应提供能量,理解这些反应是解读观测数据的关键。相关原子核大多缺乏中子且不稳定,因此实验数据和理论认知均有限,现已成为核科学领域的前沿研究方向。本文综述了X射线暴核物理研究现状,重点介绍了大量核反应速率的最新实验与理论进展,系统阐述了当前推动该领域发展的各类实验与理论方法。基于实验数据,我们更新了公共JINA REACLIB数据库中的32个反应速率,并针对无实验数据的反应构建了新的统计模型理论反应速率数据集。通过建立氢氦混合燃烧驱动的多组X射线暴模型,我们运用更新后的核数据评估了当前对这类X射线暴中核反应序列的认知水平,探讨了光变曲线模拟方法,并对核灰烬成分的预测进行了综合分析。
📄 Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech foundation models, which are typically adapted through domain-specific fine-tuning, resulting in multiple customised checkpoints, for which repeating full fine-tuning when new data becomes available is computationally prohibitive. In this work, we study model merging for multi-domain ASR and benchmark 11 merging algorithms for 10 European Portuguese domains, evaluating in-domain accuracy, robustness under distribution shift, as well as English and multilingual performance. We further propose BoostedTSV-M, a new merging algorithm based on TSV-M that mitigates rank collapse via singular-value boosting and improves numerical stability. Overall, our approach outperforms full fine-tuning on European Portuguese while preserving out-of-distribution generalisation in a single model.
📄 模型融合是一种可扩展的多任务训练替代方案,它将多个专用模型的能力整合到单一模型中。这对于大型语音基础模型尤其具有吸引力,这类模型通常通过领域特定微调进行适配,从而产生多个定制化检查点。当新数据可用时,重复进行完整微调在计算上是不可行的。本研究针对多领域自动语音识别(ASR)探索模型融合方法,并在10个欧洲葡萄牙语领域中对11种融合算法进行基准测试,评估其领域内准确性、分布偏移下的鲁棒性,以及英语和多语言性能。我们进一步提出BoostedTSV-M算法——一种基于TSV-M的新融合方法,通过奇异值增强缓解秩崩溃问题并提升数值稳定性。总体而言,该方法在欧洲葡萄牙语任务上表现优于完整微调,同时在单一模型中保持了分布外泛化能力。
📄 The processes governing protostellar mass growth remain debated, although episodic accretion is now understood as a key feature of protostellar evolution across all masses. Luminosity bursts have been observed in both low- and high-mass protostars, but the overall statistics remain limited, especially for high-mass objects. Over the past decade, numerical simulations of high-mass core collapse have provided a theoretical framework for interpreting protostellar variability, yet additional observational constraints are required to determine the characteristics and importance of bursts. In this work, we analyse data from GASTON-GP programme, which mapped a 2.4 square degrees region of the Galactic plane (centred at l = 24 deg) at 1.15 and 2.00 mm using NIKA2 on the IRAM 30 m telescope. The survey obtained 11 epochs over four years, offering the first opportunity to study millimetre variability in a large sample of massive protostellar sources. From the combined dataset, we constructed catalogues of 2925 compact sources at 1.15 mm and 1713 at 2.00 mm. Using a dedicated relative calibration scheme, we generated millimetre light curves for around 200 high-signal-to-noise sources and identified one variable candidate. However, it is not protostellar. Consequently, we report no robust detections of variable protostellar sources in GASTON field. This is the direct consequence of observational limitations (i.e., sensitivity, resolution) combined with the lack of any 100-fold luminosity bursts during the observations, which is consistent with estimates inferred from isolated core collapse simulations. This study highlights the need for future high-resolution, high-cadence surveys to constrain the accretion histories of massive protostars.
📄 尽管间歇性吸积已被视为所有质量等级原恒星演化过程中的关键特征,但控制原恒星质量增长的机制仍存在争议。目前已在低质量与高质量原恒星中观测到光度爆发事件,但整体统计数据仍然有限,尤其对于高质量天体而言。过去十年间,高质量星核塌缩的数值模拟为解释原恒星变异性提供了理论框架,但仍需更多观测约束来确定爆发的特征与重要性。 本研究分析了GASTON-GP项目的数据,该项目使用IRAM 30米望远镜的NIKA2设备,在1.15毫米和2.00毫米波段对银河系平面2.4平方度区域(中心点l = 24度)进行了测绘。该巡天在四年间获取了11个观测历元,首次为研究大样本高质量原恒星源的毫米波变异性提供了机会。基于整合数据集,我们构建了包含2925个1.15毫米致密源和1713个2.00毫米致密源的星表。通过专用相对校准方案,我们为约200个高信噪比源生成了毫米波光变曲线,仅发现一个候选变源,但该源并非原恒星。 因此,我们在GASTON天区未获得原恒星变源的可靠探测。这一结果直接源于观测限制(如灵敏度、分辨率)与观测期间未出现百倍级光度爆发的共同影响,这与孤立星核塌缩模拟的推断估计相符。本研究强调未来需要开展高分辨率、高采样率的巡天观测,以约束高质量原恒星的吸积历史。
📄 The Microchannel X-ray Telescope on board the Space-based multi-band astronomical Variable Objects Monitor (SVOM) satellite detects and localizes the X-ray afterglow of gamma-ray bursts. One year after the launch, this paper presents the in-flight performance of the scientific analyses conducted by the on-board computer. After summarizing the analysis steps, the paper reviews the on-board results obtained with 15 gamma-ray burst afterglows detected by the telescope between October 2024 and August 2025. For all bursts, the localization uncertainty is estimated to be below 2 arcmin, as required by the mission design. On average, the measured position is found to be 40 arcsec away from the position measured by other experiments with a better sky resolution. Moreover, we show that the on-board analysis provides a precise sky location for the burst only a few seconds after the beginning of the observation. Taking advantage of an efficient very-high-frequency antenna network, this information is quickly collected on the ground and disseminated to other observation facilities. This low-latency strategy is critical for the multi-wavelength and multi-instrument follow-up program of SVOM.
📄 搭载于天基多波段天文变源监测卫星(SVOM)上的微通道X射线望远镜,负责探测并定位伽马射线暴的X射线余辉。在卫星发射一年后,本文介绍了星载计算机执行科学分析的飞行性能。文章首先概述了分析步骤,随后系统回顾了2024年10月至2025年8月期间望远镜探测到的15个伽马暴余辉的星载分析结果。所有伽马暴的定位不确定度均低于2角分,符合任务设计要求。平均而言,测量位置与其他具有更高空间分辨率的实验所得结果相差约40角秒。研究还表明,星载分析能在观测开始后数秒内为伽马暴提供精确的天区定位。借助高效的高频天线网络,这些信息可被快速收集至地面并分发给其他观测设施。这种低延迟策略对SVOM的多波段、多仪器联合观测计划具有至关重要的意义。
📄 Our understanding of the early Universe has long been limited by biased galaxy samples selected through various color criteria. With deep JWST infrared imaging, mass-complete galaxy samples can now be studied up to $z \sim 8$ for the first time. However, recent work has revealed systematic uncertainties in measuring physical properties of galaxies based solely on JWST/NIRCam and HST photometry, due to their limited wavelength coverage. This highlights the need for supplementary data, particularly in the rest-frame UV and near-infrared. Here we present the ULTIMATE-deblending project, which will eventually deliver self-consistent UV-to-Radio photometry for galaxies detected in deep JWST surveys, including both NIRCam and MIRI data. In this first paper, we release a 50-band photometric catalog spanning CFHT/U to JWST/MIRI F1800W, covering a total of 627.1 arcmin$^2$ across two JWST/PRIMER fields. We detail the reduction of JWST imaging data, the photometric procedures, and the SED-fitting methodology used to derive galaxy properties. Compared with photometry including only HST and JWST bands, the inclusion of deblended low-resolution photometry from ground-based telescopes improves the accuracy of photometric redshifts by $\sim$40%, while reducing the outlier fraction by $\sim$60%. This galaxy sample can serve as a key reference for statistical studies of galaxy formation and evolution in the early universe. All catalogs and JWST mosaics from the ULTIMATE-deblending project will be made publicly available.
📄 长期以来,我们对早期宇宙的认识一直受限于通过不同颜色标准筛选出的有偏差的星系样本。随着詹姆斯·韦伯太空望远镜(JWST)深度红外成像技术的发展,我们首次能够研究质量完备的星系样本,追溯至红移值约8的宇宙早期。然而,近期研究揭示,仅依赖JWST/NIRCam和哈勃太空望远镜(HST)的光度测量数据来推算星系物理性质存在系统不确定性,这主要源于其有限的波长覆盖范围。这凸显了补充数据的重要性,特别是在静止波段的紫外和近红外区域。 本文介绍"ULTIMATE-deblending"项目,该项目旨在为深度JWST巡天(包含NIRCam和MIRI数据)中探测到的星系提供自洽的紫外至射电波段光度测量。作为系列首篇论文,我们发布了包含50个波段的测光星表,覆盖从CFHT/U波段到JWST/MIRI F1800W波段的观测,总计覆盖两个JWST/PRIMER天区共627.1平方角分的区域。我们详细阐述了JWST成像数据的处理流程、测光程序以及用于推导星系属性的光谱能量分布拟合方法。 与仅包含HST和JWST波段的测光数据相比,加入地基望远镜的去混叠低分辨率测光数据后,测光红移的精度提升了约40%,异常值比例降低了约60%。该星系样本可作为研究早期宇宙星系形成与演化统计规律的关键基准。ULTIMATE-deblending项目生成的所有星表与JWST拼接图像将向公众开放。
📄 We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (multiling.cloud), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.
📄 我们推出孟加拉国首个国家级平行多模态语言数据集——多语言云语料库,该系统性地收录了该国少数民族与土著语言的数字资源。孟加拉国虽拥有约40种分属四大语系的少数民族语言,其中14种被列为濒危语言,但这些以口语为主、在计算语言学领域属于"零资源"的语言长期缺乏跨语系的系统性数字语料库。 本语料库包含85792条结构化文本条目,每条均包含孟加拉语刺激文本、英语译文及国际音标转写,同时配有约107小时的转写音频,覆盖藏缅、印欧、南亚和达罗毗荼四大语系的42种语言变体,另收录两种未分类语系语言。数据采集历时90天,覆盖全国9个行政区,由16名采集员、77名发音人及43名验证者共同完成,采用包含2224个独立项目的标准化采集模板,涵盖三个语言粒度层级:孤立词汇项(22个语义域的475个词汇)、语法结构(21个类别共887个句子,含动词变位范式)以及引导性话语(46个对话场景的862个提示)。 田野调查后处理工作由10名语言学家完成国际音标转写,并由6名评审员独立审核。完整数据集通过多语言云平台(multiling.cloud)公开,提供所有语言变体的可检索标注音频与文本数据。本文详细阐述语料库设计理念、田野调查方法、数据集结构及各语言覆盖情况,并探讨该资源对濒危语言记录、低资源自然语言处理以及语言多样性发展中国家的数字保存等领域的重要意义。
📄 While it is well known that galaxies are composites of many emission processes, quantifying the various contributions remains challenging. In this work, we use unsupervised machine learning based clustering algorithms to evaluate the agreement between the clustering tools and astrophysical classifications, and hence quantify the fractional contributions of star formation processes and nuclear black hole activity to the total galaxy energy budget of radio sources. We perform clustering on the multiwavelength (optical, infrared (IR), and radio) active galactic nuclei (AGN) diagnostic spaces, using the data from the G09 and G23 fields from the Galaxy and Mass Assembly (GAMA) survey, Evolutionary Map of the Universe (EMU) survey, and the Wide-field Infrared Survey Explorer (WISE). We find that the statistical clustering recovers $\approx$ 90 % of the star forming galaxies (SFGs) and $\approx$ 80 % of the AGN. We define a new IR-radio AGN diagnostic scheme that identifies radio AGN from IR SFGs and AGN, corresponding to the KMeans cluster with approximately 90 % reliability. We demonstrate the superior power of radio AGN selection in higher dimensions using a three-dimensional space composed of directly observable parameters ($\rm W_1-W_2$ colour, $\rm W_2$ magnitude, and the 1.4 GHz radio flux density). This novel three dimensional diagnostic shows immense potential in radio AGN selection that is close to 90 % reliable and 90 % complete. We also publish a catalogue of radio sources in the EMU survey with associated probabilities for them to be active in the optical regime, through which we emphasise the philosophy of considering a galaxy to be composed of various fractions rather than a binary classification of SFGs and AGN.
📄 尽管众所周知,星系是多种辐射过程的复合体,但量化这些不同成分的贡献仍然具有挑战性。在本研究中,我们采用基于无监督机器学习的聚类算法,评估聚类工具与天体物理分类之间的一致性,从而量化恒星形成过程和核黑洞活动对射电源星系总能量预算的比例贡献。我们利用来自星系与质量组装(GAMA)巡天的G09和G23区域、宇宙演化图(EMU)巡天以及广域红外巡天探测器(WISE)的数据,在多波段(光学、红外和射电)活动星系核(AGN)诊断空间中进行聚类分析。研究发现,统计聚类方法能够恢复约90%的恒星形成星系(SFGs)和约80%的活动星系核。我们提出了一种新的红外-射电活动星系核诊断方案,该方案能够以约90%的可靠性从红外恒星形成星系和活动星系核中识别出射电活动星系核,对应KMeans聚类结果。通过构建由直接可观测参数($\rm W_1-W_2$颜色、$\rm W_2$星等及1.4 GHz射电流量密度)组成的三维空间,我们证明了高维空间中射电活动星系核筛选的优越性。这种新颖的三维诊断方法在射电活动星系核筛选中展现出巨大潜力,其可靠性接近90%,完备性达90%。同时,我们发布了EMU巡天中射电源的星表,并提供了它们在光学波段活跃的相关概率。通过这项工作,我们强调了将星系视为多种成分组合体而非简单二元分类(恒星形成星系与活动星系核)的研究理念。
📄 Urban mobility models are essential tools for understanding and forecasting how people and goods move within cities, which is vital for transportation planning. The spatial scale at which urban mobility is analysed is a crucial determinant of the insights gained from any model as it can affect models' performance. It is, therefore, important that urban mobility models should be assessed at appropriate spatial scales to reflect the underlying dynamics. In this study, we systematically evaluate the performance of three popular urban mobility models, namely gravity, radiation, and visitation models across spatial scales. The results show that while the visitation model consistently performs better than its gravity and radiation counterparts, their performance does not differ much when being assessed at some appropriate spatial scale common to all of them. Interestingly, at scales where all models perform badly, the visitation model suffers the most. Furthermore, results based on the conventional admin boundary may not perform so well as compared to distance-based clustering. The cross examination of urban mobility models across spatial scales also reveals the spatial organisation of the urban structure.
📄 城市交通模型是理解和预测城市内人员与货物流动的关键工具,对交通规划至关重要。分析城市交通的空间尺度是决定模型洞察力的核心因素,因为它直接影响模型的性能表现。因此,在合适的空间尺度上评估城市交通模型,以准确反映其内在动态机制显得尤为重要。本研究系统评估了三种主流城市交通模型——引力模型、辐射模型和访问模型在不同空间尺度下的表现。结果显示,虽然访问模型整体表现优于引力模型和辐射模型,但当所有模型在共同适用的合适空间尺度下评估时,它们的性能差异并不显著。有趣的是,在所有模型表现均不佳的尺度上,访问模型的性能下降最为明显。此外,与传统行政边界划分相比,基于距离聚类的评估方法能获得更优结果。通过跨尺度检验城市交通模型,研究还揭示了城市结构的空间组织特征。
📄 The dramatic slowdown of dynamics in supercooled liquids approaching the glass transition remains one of the central unresolved problems in condensed matter physics. We review approaches that attribute this slowdown to growing thermodynamic or structural length scales and discuss their difficulties in accounting for recent numerical results. These limitations motivate the present review, which critically examines alternative theories in which the glassy slowdown is instead controlled by localized excitations and their elastic interactions. After reviewing key phenomenology with a focus on the fragility of liquids, dynamical heterogeneities, thermodynamics-dynamics correlation, and the effect of kinetic rules and swap algorithms, we compare elastic descriptions based on homogeneous and local heterogeneous elasticity to excitation-based theories incorporating nonlinear responses. Results are compiled to relate global and local elastic moduli, the Debye-Waller factor, and the density of excitations, leading to a quantitative theory testable in experiments. The thermal evolution of the excitation spectrum provides a parameter-free account of the activation energy, while their elastic interactions quantitatively reproduce dynamical heterogeneities via thermal avalanche processes. Synthesized together, these results lead to a framework where the evolution of the excitation spectrum, rather than the growth of a thermodynamic length scale, governs fragility in simple glass-forming liquids -- yet mean-field concepts of dynamical transitions remain central to describing excitations and building a real-space picture of relaxation.
📄 过冷液体在接近玻璃化转变时动力学急剧减缓,这仍然是凝聚态物理中尚未解决的核心问题之一。本文综述了将这种减缓归因于热力学或结构长度尺度增长的理论方法,并讨论了它们在解释近期数值结果时面临的困难。这些局限性促使我们撰写本篇综述,批判性地审视另一类理论——该类理论认为玻璃化减缓是由局域激发及其弹性相互作用主导的。在重点回顾液体脆性、动力学异质性、热力学-动力学关联以及动力学规则与交换算法影响等关键现象后,我们将基于均匀与非均匀弹性理论的描述与包含非线性响应的激发理论进行对比。通过整合全局/局部弹性模量、德拜-瓦勒因子及激发密度等数据,我们构建出可供实验验证的定量理论。激发谱的热演化过程为活化能提供了无参数解释,而其弹性相互作用通过热雪崩过程定量重现了动力学异质性。综合这些结果,我们建立了一个理论框架:在简单玻璃形成液体中,决定脆性的关键因素是激发谱的演化而非热力学长度尺度的增长——然而描述激发的核心仍离不开动力学转变的平均场概念,这为构建弛豫过程的实空间图像奠定了基础。
📄 While Aerospace engineering can benefit greatly from collaborative knowledge management, its infrastructure is still fragmented. Bridging this divide is essential to reduce the current practice of redundant work and to address the challenges posed by the rapidly growing volume of aviation data. This study presents an accessible platform, built on Wikibase, to enable collaborative sharing and curation of aerospace engineering knowledge, initially populated with data from a recent systematic literature review. As a solid foundation, the Aerospace.Wikibase provides over 700 terms related to processes, software and data, openly available for future extension. Linking project-specific concepts to persistent, independent infrastructure enables aerospace engineers to collaborate on universal knowledge without risking the appropriation of project information, thereby promoting sustainable solutions to modern challenges while acknowledging the limitations of the industry.
📄 尽管航空航天工程能从协同知识管理中获益良多,但其知识体系仍处于碎片化状态。弥合这一鸿沟对于减少当前工作中的重复劳动、应对航空数据量激增带来的挑战至关重要。本研究基于维基数据库构建了一个开放共享平台,旨在实现航空航天工程知识的协同管理与共建共享,平台初期数据来源于近期一项系统性文献综述。作为坚实基础,该航空航天知识库已收录超过700条涵盖工艺流程、软件系统与数据资源的专业术语,并向社会开放以供持续扩展。通过将项目特定概念与持久性独立基础设施相链接,该平台使航空航天工程师能够在保障项目信息安全的前提下开展通用知识协作,从而在正视行业局限性的同时,为应对现代挑战提供可持续解决方案。
📄 Large Language Models (LLMs) are increasingly deployed in resume screening pipelines. Although explicit PII (e.g., names) is commonly redacted, resumes typically retain subtle sociocultural markers (languages, co-curricular activities, volunteering, hobbies) that can act as demographic proxies. We introduce a generalisable stress-test framework for hiring fairness, instantiated in the Singapore context: 100 neutral job-aligned resumes are augmented into 4100 variants spanning four ethnicities and two genders, differing only in job-irrelevant markers. We evaluate 18 LLMs in two realistic settings: (i) Direct Comparison (1v1) and (ii) Score & Shortlist (top-scoring rate), each with and without rationale prompting. Even without explicit identifiers, models recover demographic attributes with high F1 and exhibit systematic disparities, with models favouring markers associated with Chinese and Caucasian males. Ablations show language markers suffice for ethnicity inference, whereas gender relies on hobbies and activities. Furthermore, prompting for explanations tends to amplify bias. Our findings suggest that seemingly innocuous markers surviving anonymisation can materially skew automated hiring outcomes.
📄 大型语言模型正越来越多地被应用于简历筛选流程中。尽管通常会对姓名等显性个人身份信息进行脱敏处理,但简历中往往保留着语言技能、课外活动、志愿服务、兴趣爱好等微妙的社会文化标记,这些标记可能成为人口统计特征的替代指标。我们开发了一个适用于招聘公平性评估的通用压力测试框架,并以新加坡为背景进行实例化研究:将100份中性且符合职位要求的简历扩展为4100个变体,涵盖四种民族和两种性别,这些变体仅在非工作相关的标记上存在差异。 我们评估了18个大型语言模型在两种实际场景中的表现:(1)直接对比(一对一比较);(2)评分与候选名单筛选(最高分入选率),每种场景均包含是否要求模型提供决策依据的对比测试。结果显示,即使没有显性身份标识,模型仍能以较高的F1分数推断出人口统计特征,并表现出系统性偏差——模型普遍倾向于选择与中国裔及白人男性相关的标记。消融实验表明,语言标记足以推断民族信息,而性别推断则主要依赖兴趣爱好和活动经历。值得注意的是,要求模型提供解释往往会放大其偏见。 我们的研究结果表明,那些在匿名化处理后仍保留的、看似无害的标记,实际上可能显著影响自动化招聘结果的公平性。
📄 These are course notes for the 'Introduction to holography' Master level course at University of Cologne. The goal of the course is to give a pedogogical introduction to holography. Holography is a popular approach to quantum gravity, in which a theory of gravity can be described by a lower-dimensional boundary theory that itself has no gravity. The most concrete known example of a holographic model is the AdS/CFT correspondence, where the gravitational theory has a negative cosmological constant (the universe is asymptotically Anti-de Sitter) and the boundary theory is a conformal field theory. Symmetry plays a very important role in this duality. We therefore start the course with a review of Poincaré symmetry in quantum field theory, before moving on in the second chapter to conformal symmetry in conformally invariant quantum field theories or CFT's. Then we move to the basics of AdS physics in chapters 3 and 4, which will already reveal hints to the existence of a duality with CFT. After gathering the basic ingredients (CFT and AdS), in the second half of the course we are ready to formulate the AdS/CFT correspondence (chapter 5), including finite temperature AdS/CFT (chapter 6), which involves black holes and their thermodynamics in the gravitational theory (chapter 7). We end the course with an introduction to entanglement in AdS/CFT and the origin of statements that 'gravity emerges from entanglement' in holography.
📄 本文为科隆大学硕士课程《全息原理导论》的讲义。本课程旨在系统性地介绍全息原理的基本框架。全息原理是量子引力研究中广受关注的进路,其核心思想在于:引力理论可由一个低维且不含引力的边界理论完全描述。目前最具体的全息模型是AdS/CFT对应——其中引力理论具有负宇宙学常数(即渐近反德西特时空),而边界理论则是共形场论。对称性在这一对偶关系中起着至关重要的作用。 因此,课程首先回顾量子场论中的庞加莱对称性(第一章),随后在第二章深入探讨共形不变量子场论(CFT)中的共形对称性。第三、四章将系统阐述反德西特时空(AdS)物理基础,其间已能窥见AdS与CFT对偶关系的雏形。 在掌握CFT与AdS两大基础要素后,课程后半部分正式构建AdS/CFT对应体系:第五章阐述对应关系的基本表述,第六章拓展至有限温度情形下的AdS/CFT,第七章则探讨引力理论中的黑洞及其热力学。课程最终以AdS/CFT中的纠缠现象作结,阐释全息理论中"引力源于纠缠"这一重要命题的起源。