What I learned and didn't learn from the "bible" of A/B testing...
Ronny Kohavi’s book is a must-read for anyone dealing with experimentation, but the ‘application’ is, unsurprisingly, messier than theory. Turns out, experimentation is also a soft skill!
Even though I went to school for economics and applied math, my first exposure to A/B testing happened in 2017, when I was doing a so-called “modern” online MBA through Quantic. The material was very high-level, but it was enough for me to pass the job interview at Peloton and kick off some experiments while there.
The basic intuition that their materials captured is fairly straightforward:
If there is one and only one difference between two groups (which is achieved by randomly assigning users to groups and making sure the product is really introducing a single change), then we can reason about the impact of this change. Any externalities - what happened with the market, in the media, due to other departments’ launches - don’t muddle the assessment, given that they would affect both group the same
It is not enough to say that your metric increased in group B and didn’t increase in group A, we need to apply statistical testing to ensure that the difference you’re seeing isn’t just the product of sampling - i.e. running the test at a specific moment in time
If we have enough users across A and B groups and our metric is straightforward (e.g. conversion rate, which is just a representation of a binary outcome for each ‘test subject’), we can leverage an online calculator to figure out if your results are indeed statistically significant
It is important to figure out how big of a population your test needs upfront. And you should wait to report the results until you do have the right sample size, so you aren’t reporting an accidental fluctuation as a statistically significant result
(Of course, the points above can be a bit more statistically rigorous, and some of them can be softened e.g. if you’re Bayesian, but I’m going with the basics here, bear with me!)
If you have the above down pat, you’re already doing better than ~70% of people that do A/B testing! Congratulations!
Book Smarts…
Now it is a good time to tackle Ronny Kohavi’s book Trustworthy Online Controlled Experiments that is rightfully dubbed as The Bible of the A/B testing. (I also recently learned that Ronny & his coauthors give all proceeds from the sales to charities, so it is a good purchase if the library waitlist is too long).
I will confess, when I tackled this book initially, I got maybe 60% into it and ran out of gusto. It is not a dense book by any means (doesn’t feel like you’re reading an economics journal article, if you know what I mean), but it requires undivided attention. So when I joined Spring Health, we organized a book club to keep each other accountable for reading and discussing a few chapters a month.
This book is full of gems, from discussing experimentation culture, to experiment design, to technical specifics of experiment setup, to running experimentation programs at scale, and even ethics.
Here are a few things that I found the most impactful:
Information about things that can (and will) go wrong and ways to get ahead of it. For example, from this book I learned about the Sample Ratio Mismatch (when you are deploying test at 50/50 but you’re seeing uneven group sizes in your analysis). Turns out, you can run a stat test to see if your difference in group sizes is big enough to be worrisome. The book also goes in detail on A/A testing - i.e. practicing an A/B test rollout (randomization & data collection) without introducing any UX changes. Highly recommend doing it if you are deploying your first A/B test or getting into some complex conditional logic for when your users need and not need to be included in the test. When we ran one of those at Spring Health, we found that our data logging was not capturing test group assignments at the right time, and our assignments weren’t sticking across mobile and web.
Another resource I like on the topic of various experimentation bugs is this 5-minute talk from NormConf
This book also helped me better understand the role of ‘variance’ in the ultimate success of the test. Reducing variance improves sensitivity of the test (ability to detect the treatment effect when it exists). One of the simple ways to do it is considering whether the outcome metric can be made binary - e.g. instead of # of workouts per week on Peloton, it can be ‘did the user workout or not’ or ‘% users who worked out at least X times’ so you don’t fall victim to users who like to stack four Emma Lovewell Crush Your Classes every day.
And more ‘fancy’ methods like CUPED are really there to reduce variance and thus enable us to make decisions quicker - i.e. with a smaller sample.
Kohavi’s book has one of the best explanations of gradual ramp for experiments. Turns out, rolling an A/B test at 10% doesn’t mean that 90% of the users is control and 10% - test. It means that 10% of the users are within the test, with a 50/50 split, i.e. 5% - in test, and 5% - in control.
Lastly, both in the book, and in his interview on Lenny’s podcast, Ronny mentions that the vast majority of experiments don’t show expected improvements! This is a great thing to remember when you’re in a stretch of several not ‘stat sig’ tests and your product manager starts questioning whether it is humanly possible to move the metric and whether you need a new analysis methodology.
No kidding, I have been previously asked in an interview by a marketer: ‘If you got A/B test results that aren’t statistically significant, would you reanalyze the same test some other way to get to significance?’.
Anyways, remember, the only ‘failed’ test is the one without learnings, and a ‘flat’ test doesn’t equal a test without learnings.
… “Street” Smarts
Reading this book feels almost like reading a textbook, it is so comprehensive, and written with so much precision. You feel more than empowered to go and run an experimentation program. But of course, the reality starts to diverge from the theory pretty quickly. The author of the book is drawing a lot on his experience at larger companies like Microsoft, Amazon, and Airbnb, and while the book covers the ‘crawl -> walk -> run’ stages of experimentation maturity, sometimes it takes some effort to get to crawling in the first place, and not even from the technical standpoint.
When I joined my first B2B product team, after my stints doing experimentation in a B2C context, the team was hoping to launch a redesigned part of the onboarding flow as a test - their first ever. There were quite a few new customers joining soon, meaning that there would be lots of new users going through onboarding, so getting to the right sample size would be trivial.
Except… the sales team told us we couldn’t test on any customers that joined in the previous 90 days, leaving us with a much more extended timeline to reach statistical significance. In my mind, it was so odd - why not try to accelerate the path for our new users to get a better experience? Wouldn’t the customers be excited to get in on some innovation?
Turns out, I needed to learn more about the workings of this product’s sales and enablement cycles. The customers’ implementation leads were getting extensive demos of what the ‘sign up’ experience would look like for the users. Make it even subtly divergent, and if their employees (end users) get confused, it would cause lots of swirl internally, and eventually - reaching our sales and support teams.
‘But how would we run our A/B test without sufficient sample?’, we asked.
‘Can customer X be ‘A’ and customer Y be ‘B’?, the enablement team said.
‘But this is not how any of this works…’ I thought to myself, frustratedly.
Eventually, we learned more about the risks around new customer launch, led an A/B testing ‘lunch and learn’ for the product and sales teams, and met in the middle - we were able to work with some new customers, but not others.
Unfortunately, the A/B test ended up being flat, which is a story for a another day (and another learning here is that for the first ever A/B test in the company, it is better to start with something that is more likely to be a shoo-in, to create momentum!)
Just take the MDE and just plug it into calculator!
Another thing that books and articles make sound so academic and precise, is the A/B test design.
You have the metric baseline, the historical ‘traffic’ to the experiment entry-point. You define the Minimum Detectable Effect (MDE) by looking at how much your target metric moved with the past tests or launches and combining that with ‘practical significance’. Then you use 95% confidence and 80% power (and don’t touch it, it is an industry standard! /s), plug it in the calculator… boom, here is your sample size.
In reality though, Minimum Detectable Effect is like.. Folklore.
If you haven’t experimented before on the same metric, it is really hard to assess how sensitive it would be. If the metric is fairly far from the business impact (e.g. some engagement metric that has a sufficient lag before propagating to churn rate movement), assessing practical significance also involves a bunch of assumptions… that require a few experiments each in the first place.
The way it often works in practice is you present the product manager with options: “If we run this for 4 weeks, we assess a to 5 p.p. With 95% confidence, if we want to learn faster - we can go for a larger MDE or be ok with a higher chance of a false positive.” Having this discussion in a way that your product partners aren’t feeling pelted with stats, is an art and science in itself.
Really, all it comes down to is - we need practice to figure it out. Putting some ‘reps’ under our belt when we have more questions than solution. It may take months or even years for things to really click in the way the book describes, and it can feel so frustrating that things aren’t as clear cut and scientific. But why not treat it as an experiment too? Very meta, but this approach very materially takes a lot of stress away!
Anyway, here are some reading materials…
To take it full circle… I still like reading, especially materials that have ‘case studies’ within. So of course I will share some resources if you are on a path to being a better ‘experimentator’!
If you want to get a cliff notes edition of Ronny’s book, hit me in the comments, I can share 23 pages of notes I compiled for our book club meeting!
I really like the book Statistics Done Wrong - it is written around research studies rather than product experimentation, but the stats are the same
If you follow some folks from Eppo on Linkedin - particularly Sven and Evan they share great recaps of their journal club pursuits and quality blog posts experimentation-related topics
There are some great courses out there too - Reforge’s Experimentation + Testing course and Stephanie Pancoast’s Practical A/B Testing course on Uplimit are both very high quality!
As for me, catch me at the Data Mishaps Night on March 7 talking about… data pipelines unraveling! See you then!
Hi is it possible to get your cliff notes to javiergonzalez7@gmail.com 🙏
Hey Elena, can you share your notes?
Excellent article btw