abandoned an a/b test after rolling out our llm-based summaries feature to wave 1 workspaces at week 20. now, i'm scratching my head over post-launch metrics and need that causal effect number - something concrete for the report.
anyone else hit walls with traditional ab testing when integrating ai features? how did u handle it?
i'm leaning into difference-in-differences (diffin diffs) in python but could use some tips or case studies. any success stories out there using diffin diffs to measure llm impacts would be a game changer!
article:
https://www.freecodecamp.org/news/why-ab-testing-breaks-in-ai-rollouts-and-how-to-fix-it/