Skip to main content

"Informed AI News" is an publications aggregation platform, ensuring you only gain the most valuable information, to eliminate information asymmetry and break through the limits of information cocoons. Find out more >>

Enhancing Mathematical Reasoning with Step-Controlled DPO

Step-Controlled DPO (SCDPO) refines Direct Preference Optimization, enhancing large language models' reasoning. SCDPO introduces stepwise error supervision, crafting flawed reasoning samples from correct starts. This method sharpens models' error detection and reasoning accuracy. Applied to various models, SCDPO boosts performance, notably in mathematical tasks. A 20B model trained with SCDPO excels, scoring 88.5% on GSM8K and 58.1% on MATH, challenging top open-source LLMs.

Full article>>