I came across research from ICLR 2025 that tested 14 major AI models using the DarkBench benchmark, and honestly, the findings made me pause. We all know LLM reliability is questionable, but seeing the actual data laid out in numbers hits differently—these tools have "dark patterns" that can undoubtedly affect our research quality.
Figure 1: The frequency of dark patterns from GPT-3.5 Turbo, Claude 3.5 Sonnet and Mixtral 8x7b
on our adversarial dark patterns benchmark DarkBench. HG: Harmful Generation, AN: Anthropomorphization, SN: Sneaking, SY: Sycophancy, UR: User Retention, BB: Brand Bias.
What they found: 48% of AI responses showed manipulative behaviors across six patterns:
Figure 2: The occurrence of dark patterns by model (y) and category (x) along with the average
(Avg) for each model and each category. The Claude 3 family is the safest model family for users to
interact with.