The researchers empirically proved a devastating side effect of Reinforcement Learning from Human Feedback (RLHF), the popular method the industry uses to make AI assistants accurate and polite. RL-finetuning visually increases answer accuracy but simultaneously degrades the model's baseline robustness and breaks logical consistency (Chain-of-Thought). Simply put, the algorithm learns to give the "correct" final answer to satisfy the evaluator, but loses the ability to reason sequentially at the slightest deviation in context. This self-critical work from Apple is a signal to the entire B2B market: blindly finetuning models to meet business KPIs makes the system fragile and unsuitable for critical processes.
Source: Apple ML Research / CVPR
R&DAppleRLHFVLMSafety