Enhancing Mathematical Reasoning with Step-Controlled DPO

Step-Controlled DPO (SCDPO) refines Direct Preference Optimization, enhancing large language models' reasoning. SCDPO introduces stepwise error supervision, crafting flawed reasoning samples from correct starts. This method sharpens models' error detection and reasoning accuracy. Applied to various models, SCDPO boosts performance, notably in mathematical tasks. A 20B model trained with SCDPO excels, scoring 88.5% on GSM8K and 58.1% on MATH, challenging top open-source LLMs.

