Nex-N2

思考，为行动而生Thinking, built for action

真实生产力场景性能领先Leading performance in real-world productivity

Nex-N2-Pro 展现出了紧跟顶尖模型的强劲性能，在编程和长时任务上表现优异。Nex-N2-Pro demonstrates strong performance that keeps pace with top-tier models, excelling at coding and long-horizon tasks.

推理与行动，同一种思维Reasoning and action, one way of thinking

Nex-N2 实现了 Thinking 范式的全局统一。无论是 Search、Coding 还是 Agentic Tool Calling，模型的思维链遵循一致的结构范式：目标分解、状态追踪、策略调整、自我校验，这种优势在混合任务（如一次代码任务中穿插搜索和工具调用）中尤为突出。Nex-N2 unifies the Thinking paradigm across the board. Whether it's Search, Coding, or Agentic Tool Calling, the model's chain of thought follows a consistent structural pattern: goal decomposition, state tracking, strategy adjustment, and self-verification. This advantage stands out especially in mixed tasks—such as a single coding task interleaved with searches and tool calls.

Nex-N2

基于任务复杂度的自适应推理Adaptive reasoning based on task complexity

Nex-N2能够自主决定是否开启 Thinking，并动态调控推理强度。相比强制开启，Adaptive Thinking 在保持任务完成率的同时，显著降低了推理 token 消耗，实现资源的最优分配。Nex-N2 can autonomously decide whether to engage Thinking and dynamically modulate its reasoning intensity. Compared with forcing it always on, Adaptive Thinking maintains task completion rates while significantly reducing reasoning token consumption, achieving optimal resource allocation.

Adaptive Thinking Forced-on Forced-off

Adaptive Thinking下的 Nex-N2-mini 模型效果相较于强制关闭思维链显著提升，与强制每轮开启思维链相比性能持平甚至略好，同时整体 token 花销节省20%左右。Under Adaptive Thinking, Nex-N2-mini performs significantly better than with the chain of thought forced off, and matches or even slightly exceeds forcing the chain of thought on every turn—while saving roughly 20% in overall token cost.

在三类任务上，Nex-N2 展现出与任务结构契合的三种推理构型。在搜索任务中，前期着重拆解搜索策略，末段再集中综合证据；在 SWE 类任务中，定位 bug 阶段与验证修复阶段最为密集；在 OpenClaw 开放式长程任务中，推理随任务推进逐步加深，在收尾整合结果时达到峰值。推理总是被集中于不确定性高、需要关键决策的环节，更高效。Across three task types, Nex-N2 exhibits three reasoning profiles that fit each task's structure. In search tasks, it focuses early on breaking down the search strategy, then concentrates on synthesizing evidence toward the end; in SWE-style tasks, reasoning is densest during the bug-localization and fix-verification phases; in OpenClaw open-ended long-horizon tasks, reasoning deepens progressively as the task advances, peaking when integrating results at the finish. Reasoning is always concentrated where uncertainty is high and key decisions are needed—making it more efficient.

用 Nex-N2 构建Built with Nex-N2

评估Evaluation

Benchmark		Nex-N2-Mini	Nex-N2-Pro	GPT-5.5	Opus 4.7	Kimi-K2.6	GLM-5.1	MiniMax M3	DeepSeek-V4-Pro
Agent	BrowseComp	74.1	83.7	84.4	79.8	83.2	79.3	83.5	83.4
	GDPval	1402	1585	1769	1753	1481	1535	–	1554
	Toolathlon	33.3	51.9	55.6	52.8	50.0	40.7	–	52.8
	WildClawBench	47.7	53.5	58.2	62.2	–	48.2	–	43.7
	Widesearch	62.0	75.6	–	–	80.8	–	–	–
	TAU3	65.9	71.1	–	–	–	70.6	–	–
Coding & SWE	SWE-Bench Pro	50.2	58.8	58.6	64.3	58.6	58.4	59.0	55.4
	Terminal-Bench 2.1	60.7	75.3	83.4	69.7	–	58.7	66.0	72.0
	DeepSWE	8.0	33.6	70	54	24	18	–	8
	SWE-Bench Verified	74.4	80.8	82.9	87.6	80.2	–	80.5	80.6
	SWE Atlas QnA	31.5	37.9	45.4	45.2	–	–	37.9	–
	SWE Atlas RF	30.0	32.9	44.8	48.6	–	–	–	–
	SWE Atlas TW	23.3	40.0	42.6	38.2	–	–	30.8	–
General & Reasoning	GPQA Diamond	82.6	90.7	93.6	94.2	90.5	86.2	–	90.1
	IFEval	89.1	94.0	–	–	94.5	94.5	–	91.9
	Apex	9.4	36.5	–	–	24.0	11.5	–	38.3

BrowseComp：我们测试了两种策略，简单的上下文摘要策略得分为 77，而 DeepSeek-V3.2 使用的相同 discard-all 策略得分为 83.7。BrowseComp: We tested two strategies. The simple context-summarization strategy scored 77, while the same discard-all strategy used by DeepSeek-V3.2 scored 83.7.

Terminal-Bench 2.1：Harbor/NexAU；4 小时超时，8 CPU / 16 GB RAM；temp=0.7，top_p=0.95，top_k=40，max_tokens=64K，256K 上下文；5 次运行的平均值。Terminal-Bench 2.1: Harbor/NexAU; 4-hour timeout, 8 CPU / 16 GB RAM; temp=0.7, top_p=0.95, top_k=40, max_tokens=64K, 256K context; average of 5 runs.

DeepSWE：Harbor/MiniSWE；3 小时超时，2 CPU / 8 GB RAM；temp=0.7，top_p=0.95，top_k=40，max_tokens=64K，256K 上下文；3 次运行的平均值。DeepSWE: Harbor/MiniSWE; 3-hour timeout, 2 CPU / 8 GB RAM; temp=0.7, top_p=0.95, top_k=40, max_tokens=64K, 256K context; average of 3 runs.

SWE-Pro / SWE-Verified：使用 Harbor/NexAU 在优化后的基准上进行评估；temp=0.7，top_p=0.95，top_k=40，max_tokens=64K，256K 上下文。SWE-Pro / SWE-Verified: Evaluated on the optimized benchmark using Harbor/NexAU; temp=0.7, top_p=0.95, top_k=40, max_tokens=64K, 256K context.

SWE-Atlas Q&A / TW / RF：Harbor/NexAU；temp=0.7，top_p=0.95，top_k=40，max_tokens=64K，256K 上下文。SWE-Atlas Q&A / TW / RF: Harbor/NexAU; temp=0.7, top_p=0.95, top_k=40, max_tokens=64K, 256K context.

GDP-Val：评分基于内部复现的 GDPval-AA 评估。GDP-Val: Scores are based on an internally reproduced GDPval-AA evaluation.

我们没有对 Nex-N2 的多模态能力进行特殊增强，仅保留了其基础视觉能力。We did not apply any special enhancements to the multimodal capabilities of Nex-N2; only its basic vision capabilities were retained.

模型开源Open-source models

Nex-N2-Pro

基座 · Qwen3.5-397B-A17BBase · Qwen3.5-397B-A17B

适合复杂推理、多智能体编排以及高级软件工程任务。Suited for complex reasoning, multi-agent orchestration, and advanced software engineering tasks.

Nex-N2-mini

基座 · Qwen3.5-35B-A3B-BaseBase · Qwen3.5-35B-A3B-Base

适合高速指令遵循、实时工具执行以及高性价比规模化部署。Suited for high-speed instruction following, real-time tool execution, and cost-effective large-scale deployment.