On today’s episode of #Growth, host Matt Bilotti is talking all about testing – A/B testing to be exact. He breaks down the best ways to test, teaches us all about Bayesian Statistics and chats through this and more with Guy Yalif, co-founder and CEO of Intellimize.
Subscribe & Tune In
Matt Bilotti: Hello and welcome to another episode of Hashtag #Growth. I am your host Matt Bilotti and I am super excited to be joined by one of the most knowledgeable guys I know on the topic of A/B testing to talk all about A/B testing and his name is Guy Yalif and I’m so excited to have him. Thanks for joining Guy.
Guy Yalif: Matt, thanks for having me. Great to be here.
Matt: Today what I want to talk about is I know people are pretty familiar on basic A/B testing. You run two things at the same time and one does better than the other and then you pick one and you run forward with that, but more recently I’ve been learning about this thing called Bayesian statistics and through one of the tools that we use at Drift, which Guy is the CEO of called Intellimize, it incorporates this Bayesian statistics approach, which as I’ve learned more and more about it and was very confused at the beginning as to what it meant and how exactly it worked.
As I learned more and more, I realized that it really feels like the future of A/B testing for a lot of different reasons and excited to go a little bit further into that. So, Guy, maybe if you want to give a quick intro on your background and then we can dive right into the topic.
Guy: Sounds great Matt and yes, it’ll be interesting to dig further into it. I am an aerospace engineer. I spent half of college coding AI to design airplanes. Thought I was going to do that for the rest of my life and love the idea. I then spent 10 years as a product guy and 10 years as a marketing guy before starting Intellimize with two longtime machine learning friends.
We automatically optimize websites by personalizing the experience for each individual visitor and Bayesian statistics is one of several techniques we use to help do that better.
Matt: Very cool. I also had no idea that you did the aerospace stuff. That’s amazing. All right, why don’t we jump in quick, high level for … I gave the basics of A/B testing, maybe if you would just want to give us a quick run through of from your perspective, what is A/B testing and why does it matter and how do you measure it and then we can jump a little bit further into Bayesian after that.
Guy: Sounds great. A/B testing is a great way for us as marketers, as growth professionals with a number on our head. We have to deliver more revenue, more customers, more leads to sales every day. It’s a great way for us to do data driven marketing and at its core, just as you said, we have an idea and we want to know if that experience is better to show everyone going forward then what’s on our site currently, so that we can go deliver more revenue, more customers and so on.
Just as you said, we show the current idea that’s on our site. We show the new idea and we flip a coin randomly allocating everyone to one or the other, 50-50 and we use math whose goal is to tell us, “Hey, are these two ideas performing the same? Are they not performing the same?” If they’re not performing the same, we then go look at the conversion rates of each one of those two ideas and assume that the higher observed conversion rate is in fact the higher performer.
We go to engineering, we asked them to code it into the bay site and we show it to everyone forevermore and the stats on which that’s based are the stats we all learned in college.
Matt: Very cool, and I know now getting a little bit deeper, people know the term statistical significance. When I first started doing A/B testing both on the website, in the product all across our business, there’s this concept of statistical significance. What exactly does that mean and what is the difference between 90%, 95%.
Guy: Statistical significance is a measure of how likely what we’re seeing is due to chance. If we’re seeing our new idea, let’s call it variation B, performing better than our original site variation A and we have 90% statistical significance, it’s saying that with 90% confidence, these two variations do not perform a like and we’re going to assume various B is better.
There’s a 10% chance that actually this is just random noise. If you up that to 95%, now there’s only a 5% chance what we’re seeing is really just a bunch of noise. Why does this matter? Because if you’re an organization running a bunch of A/B tests, eventually you’re going to run into one of those noisy ones.
You’re really saying, how comfortable am I drawing a conclusion when there’s really no conclusion to be drawn at all? If I run 20 tests and I’m using 90% statistical significance, on average two of those tests, I’m going to see, hey, there’s a conclusion here when there isn’t at all. Make sense?
Matt: Yes, makes a lot of sense. One of the things that we were talking about before you jumped on here was about this concept that as you’re running the test, you can be tricking yourself by checking it along the way. Checking it an hour from now versus four hours from now and explain how that can mess up your results and what exactly that means in the frame of statistical significance.
Guy: It’s a great question and it’s something just about everybody I know, including myself is guilty of when I run an A/B test. The statistics we’re using for these, the name for them is frequent test statistics. That’s a label that on its own is not as important, but the core underlying concepts are ones we all learn in college.
Those are predicated on the notion that we’re going to say, “Hey, sometime from now, let’s say it’s a week from now, I want to test whether these two variations perform the same.” That’s really what I’m doing. A week from now, I’m going to with 95% confidence know whether or not these two variations perform differently.
It turns out that with the existing math, our tests may, if we’re checking them all along the way, go above 90% statistical significance in some interim period and then go back down before we reach a week from now, which is when we’re supposed to be looking at the results given the math.
This results in false positives where we think we’ve got a winner delivering lift. But in reality when we deploy this winner to production, there’s no lift to be had. We could commit to waiting for a week but in reality, in a business setting, we likely feel pressure to call a winner quickly or keep the test running that’s inconclusive longer.
By checking the test every day, we’re more likely to have false positives or type one errors, declaring a winner when in fact had we given it enough time, there wouldn’t be a winner.
Matt: Okay. This right here is at the crux of … I’ve learned about this Bayesian statistics thing and it sounds like it’s really one of those things that helps address this core problem that most people aren’t thinking about or … at least when I started A/B testing I was like, “Whatever, that’s going to happen but I’m 90% sure and you will just run with it.
It’s really easy to just run forward. Can you talk about Bayesian statistics and it’s a really complicated topic, but maybe in some high level metaphor that’s easy to grasp and then we could talk a little bit as to how it then helps you avoid false winners.
Guy: Absolutely, and because we all ignore that in the existing stats, we have the situation where at the end of the quarter or the end of the year, those above us in the organization say, “If I added up all the wins you declared for the last quarter a year, man we should have doubled our revenue, but it only went up 20%. How did that happen?”
Matt: Been there.
Guy: Totally, right?
Matt: What is Bayesian statistics in a frame of a metaphor?
Guy: It is intended to mitigate this challenge of checking results early and it’s different than frequentist statistics in several ways. The frequentist statistics require that fixed time horizon that week. You may say, “Hey, I want to … ” Let’s pick a really simplistic example. “I want to decide does the sunrise in the east or does it rise in the west?”
With frequentist statistics, I may say, look, I need to wait a month to figure that out. I’ve got to wait a month before I check. With Bayesian statistics, you can look at the results at any point in time and make a decision based on the probability that it’s either rising in the east or the West. That’s one.
Two, frequentist statistics use just what I’ve seen in the test. With frequentist statistics, I may say, look, I pretend I’ve never seen the sun rise before and I’m going to observe it for 30 days and I’ll eventually conclude, hey, it probably rises in the east because 30 out of 30 it rose in the east.
With Bayesian statistics, you can use what you’ve learned before as a hypothesis that you are then refining and that gives you a couple of benefits. One, you often can reach a conclusion more quickly, which is something all of us want because you’ve got this initial hypothesis like, okay, we’ve seen the sun, it’s straight forward on your website. You probably have an idea of what’s your average conversion rate on your website. You can start your test already with that as a hypothesis rather than relearning that thing you already know.
Second, you probably can a better job depending on how you set things up of figuring out what’s driving Lyft if you incorporate what you’ve learned before. You can decipher better more quickly, did this variation perform better because it itself is better or really was it the visitor?
I don’t know, super wealthy person tends to buy more or context of the visitor. This ad campaign does much better and so we too found this useful in helping decipher what Lyft is really driven by the variation versus other things. The third way Bayesian statistics is different is it takes into account measurement error.
It recognizes that what I’m seeing may not in fact be the ground truth. With the sun rising in the east, it’s straight forward. It’s always going to rise in the east you’re always going to see that. With a variation, I may for example, have a variation who’s true conversion rate, meaning if I let it run forever, it’s 20%. I may show the variation to 10 visitors and let’s say four of them convert, in a frequentist world, you would say conversion rates 40% and I’d make decisions on that.
In a Bayesian world, I would say the conversion rate is 40% plus or minus some really large range. I’ll make it up 30%. You’ve got some really large confidence interval and if I show the variation to more people, I would expect the observed conversion rate to trend towards 20 and in a Bayesian world at the same time. That confidence interval would get tighter.
My prediction of what the true conversion rate would get tighter and tighter so that I gain more and more confidence that it’s within the range I’ve set up. In that way, it’s quite different than the thinking we all go through with frequentist statistics where we just say, look, that conversion rate is 20%, period. Makes sense?
Matt: Yes, it makes sense. It’s really a difference between at the end of the day, looking with frequencies like the AB testing that most of us use or know, we look at the conversion rate and we say it’s better or not. Whereas a Bayesian type system allows us to get closer and closer. Start with a range and get closer and closer rather than watch a number go up and down, you watch a range close down. You can confidently know that the range is getting tighter versus are you sure that the total average that it’s at right now is the true total average?
Guy: Matt, you’re exactly right and to put a finer point on it, the Bayesian approach in doing just what you said is a sequential one. What do I mean? You can come in with this hypothesis that hey, my conversion rate … or let’s say the sun rises in the West. That’s my hypothesis. For some reason I believe that. With Bayesians statistics, I’ll update my understanding about that hypothesis after every new data point.
I might see the sunrise in the East a couple of times and you know what? My hypothesis will start shifting. I’ll start believing, hey, it’s more possible that it could rise in the east. As I see it rise in the east more and more often, I will update my prediction. I will update my belief on what is actually happening step by step so that at any point I could stop and say, “Well, does it rise in the east or does it rise in the West?” That question, unlike frequentist is not … Does it cross a certain P value and then I’ve got a yes or no answer.
It’s not like that. It is I’ve got these two probability distributions, the probability distribution of it rising in the West is getting lower and lower and worse and worse. The probability distribution of it rising in the East is getting higher and higher and tighter and tighter and I can make the call about, “Hey, how much do these probability distributions overlap?” And I can decide am I comfortable with these probabilities concluding, “You know what? Is more likely rises in the east.” And I can do that at any point in time.
Matt: This is interesting and it’s … I feel it’s one of the harder points to grasp. I don’t know if I’ve fully got it either. Is there a way to explain it for DG who’s the star of our core seeking wisdom podcast. If he’s listening in, he would say that he’s not a math guy. What is a really easy way to explain your hypothesis is changing automatically. How exactly does that happen?
Guy: There are many implementations and not everybody agrees. Very smart people disagree on the exact right implementation but the core principle of that iterative approach is to say, “Look, my best guess is that this headline is going to convert at 10%.” As I see people come in and convert or not convert, I’m going to adjust that 10% on the fly after every visitor sees it to better approximate what I think the actual conversion rate of that headline is.
I can get better and better making guesses. We do it all the time, when I’m riding a bike, as a little kid who’s never ridden a bike before, I start riding and I go straight for a little while and then I fall over to the left and I think, all right, well now I’ve learned. I fell over to the left. I’m going to lean a little bit more to the right. I ride a little more to the right and then I fall over to the right.
I learn about that and I keep adjusting based on every turn of the pedal, a little over compensating one way, a little over compensating the other. Eventually finding that right balance, that notion of this iterative dialing it in and narrowing in of, “Hey, how should I maintain my balance on this bicycle.” Is similar to this iterative approach in Bayesian.
Matt: Got It. In context of how you would normally have your A/B and you have your A as a control and your B is your variant. Does that change much with this Bayesian approach? Is there still an A/B or is there an A-B-C-D-E-F? How does that work?
Guy: You can use frequentist or Bayesian statistics on an A/B test an ABCD test, a multivariate test. You can use it for all of them. At its core, the Bayesian approach is trying to solve a different problem. This does get a little wonky, You almost said it before, the null hypothesis, when you’re using frequentist based statistics, the actual question you’re answering is do A and B perform the same?
If they don’t perform the same, you don’t actually know which one’s performing better. You assume, let’s pretend B does better, that B is the higher performer, but actually the true performance of B has some distribution around it and hopefully it’s above A. What you do know is they don’t perform the same.
With Bayesian statistics. You’re not trying to answer the question, do they perform the same or not, you’re trying to answer how well does each one perform and how accurately can I predict that. If I can predict it very poorly, then there’s a lot of uncertainty and overlap in predicting that and I’m not sure, and as I get better and better at guessing, hey, you know what? It’s really in this tight range, I can make better decisions.
Additionally, Bayesian tries to minimize a notion of expected loss. What does that mean? If there’s a 10% probability that B is 10% better than A in Bayesians statistics, that’ll be treated the same as a 1% probability that B is 100% better than A, because those two multiply out to the same thing.
In Bayesian, you’re not just looking at the average difference, which is what you do in frequentist. Here you’re also saying what’s expected, what do I think will happen? You’re combining the probability that B is better than A along with the notion that if B is better than A, how much better will it be?
You’re combining those two notions and that has the effect of protecting you on the downside so that you don’t put into production ideas that turns out don’t really work. The con of that, the thing you pay for in that is you do have the potential for more false positives. You have the potential for something that is only okay to make it out into production.
It may not produce a big lift, but you are protecting yourself against having something that really sucks that will never make it out into production. That’s one of the implications of taking a Bayesian approach in that it’s not the panacea. It doesn’t solve everything. It does have pros and cons. This is one of the cons. In general, I happen to believe the pros obviously outweigh.
Matt: If I’m at a company and I’m looking at this Bayesian thing and I’m sitting here and saying, yes Guy, Matt, this sounds really cool, but where do I get started with this? It seems like it’s really easy for people to just continue on with the frequentist stuff. Not to say that that is the biggest mistake you’ll ever make or I’m sure there are people out there that are going to strongly believe that it’s not a mistake to continue with it, but where do they go and how do they think about it?
Guy: I think you’re spot on that continuing with frequentist is something many will do. In fact it’s got organizational credibility in most companies. Many of us just spent time driving buying on data driven marketing to begin with on A/B testing and this is the stats we’re reusing.
The motivation for doing this would be to not have that difficult moment at the end of the quarter or the end of the year where someone looks at us and says, “Hey, if I add up all your tests, we should have doubled revenue.” And in fact we’re really up, I don’t know, 10, 20, 30%. That’s the motivation for doing this.
To do it, you could hire very smart stats person or you could use one of the many tools out there that is starting to use or has been using a Bayesian approach to their statistics in return for having it happen less often that you bring a test to production and it’s not actually a winner because it’s okay to peak early.
In return for getting answers sooner and in return for being protected against the downside, the cost is now you have to get buy in again in [inaudible 00:18:48]. You will shift thinking from a P value to some probabilistic view and some minimum downside that’s acceptable. That’ll take some time, but the benefits can be significant if it’s a path you want to walk down.
Matt: I think it’s funny that you use the example of sitting there at the end of the year and someone saying, “Hey, if I add up all your tests, you said this thing was the 30% increase, this thing was a 20% increase. Where’s the number?” I have been there not only from someone asking it to me but also we sat down and we looked at all of our A/B test and we were like these things don’t add up. Shouldn’t they be, if this is this much better and that one’s better, I’ve totally been there. I’m sure there’s other people who’ve been there.
It really is an interesting challenge to get your organization or your team rallied around this new concept, this new approach. It might mean changing tools, it might mean reeducating people. It’s definitely a real part of the challenge.
Guy: I am with you and I have been there too and those conversations are not pleasant. They are challenging trust wise and we could theoretically all solve it by waiting for the fixed amount of time we’re supposed to wait for the frequentist test to run, but the reality of the business pressures day to day, don’t allow hardly any of us to go do that.
Matt: Okay, let’s say we got someone listening. I’ve got a bunch of people listening, Those bunches of people are saying, all right, I want to get started with this, sounds good. One, how do I start these conversations and start to get our organizations to move to this. Two, how do I actually start implementing this and then three, where can I go learn more about this concept if I maybe didn’t fully grasp everything. It is a tough subject. Where can I go to find that information?
Guy: In my humble opinion, you should begin by thinking through how much depth you want to bring the organization through. I mean even in this conversation here we’ve varied in altitude between, simple metaphors of how I’m going to learn to ride a bike to pretty detailed stuff about the maximum acceptable errors and a bunch of places in between. It may not be the case that you need to move the whole organization to understanding the full depth. Many of them may not care to.
They may not understand that about the frequentist approach that you have now, but to decide to share with others, hey we are going to shift our thinking to one that is more probabilistic and one where … We do need to decide what’s the minimum acceptable downside for a test so that we can have more tests that have upside more quickly.
That logic, that set of trade offs, you can start talking people through and on implementation, unless you have the statistician who can do this in house, they probably can help you a lot with driving the buy in if this is something they believe into. Otherwise, finding the tools that can help you do this, maybe it is your existing tool.
On learning more, if you do Google Bayesian statistics, you will see a lot of information. There is not unanimity one that Bayesian is better than frequentist. They’re very smart people who believe in both and two, even within folks who believe in Bayesian, there’s not one universal way to implement it that is appropriate for every different situation.
If you Google it, you’re going to see stuff from high level explanations down to simulations that are graphed, that get quite detailed and specific. My suggestion would be consume some of that and just as you said, go through it over and over again to build the intuition. If you want to dig in a little further, feel free to email me at email@example.com and my hunch is you’ll say the same Matt, so that we can continue that conversation and dig in much further.
Matt: Yes, I think it’s a fun way to have, I feel like every time I talk to someone about Bayesian and this future of A/B testing, I’m always learning something new whether I’m finally connecting another piece of the dotted line or someone points me to another interesting piece of content.
I’m super excited about this topic. I know it has saved us a bunch of accidental AB calls where we would’ve said B was better, but because we were using a Bayesian approach, it allowed us to be more certain that this thing is actually better and it’s better within this range, which is higher than this other range. It’s more of the true north of the conversion rates.
I’m a fan. Guy, I want to say thank you so much for joining today. I’ve learned a bunch. I hope our listeners … I was going to say viewers, there’s no one viewing you’re listening to this. I know our listeners certainly have, and for all of you who are out there listening, I’m always open to feedback. My emails firstname.lastname@example.org.
I really appreciate you tuning in. It means a lot, and if you have ideas for future topics, future guests, whatever it might be, feel free to send me a note and if you liked it, six star reviews in Seeking Wisdom Fashion, and I’ll catch you on the next episode. Guy, thank you again so much for joining today. I will talk to you again soon.
Guy: Thanks everyone for listening. Matt, thanks for the privilege of joining you. Looking forward to continuing the conversation offline.
Matt: All right, take care. Bye.