What exactly is Bias For Action
One of the Amazon’s leadership principle that’s understood wrong
Every Amazonian know what leadership principles are and how they are applied day in and day out. Every leadership principle’s final goal is to get better at customer experience and raise bar for customer obsession.
When I talk to potential candidates/ friends about our leadership principles, Bias for Action comes up almost every time and it’s understood in a very wrong way.
Speed matters in business. Many decisions and actions are reversible and do not need extensive study. We value calculated risk taking
Many are/ were in the impression that when there is an issue/ something happens, we just fix and forget until next issue pops up and we keep on repeating. But, that is not true. Yes speed matters but, that doesn’t mean compromise on high bar or customer experience. This is a delicate balance.
Bias for Action always is always followed by Dive Deep and Think Big.
To understand the bigger picture, Bias for Action will always be followed by Dive Deep and Thinking Long Term. I will present a simple example that I hope will help with better understanding.
As a Data Engineer my responsibility is to ensure availability of data in a timely manner with highest of data quality. With Amazon’s scale, I have to constantly look for opportunities and way to scale the infrastructure. However, its not a one or two day task.
Last year I was challenged with a peculiar problem. Suddenly queries that are executing on our redshift ETL and reporting clusters are taking way too long to execute and peak preparation is just getting started. Users are frustrated because they weren’t able to get the reports in a timely manner. I was tasked with fixing the issue.
Now, I can take my sweet amount of time, come up with a plan and say to my customers that it will take XYZ hours/days/months to fix the issue. But, that’s going to have a significant impact on business, customer experience and I will end up losing trust from customer. This is how I used the power of Bias for Action.
The quick solution (Bias for Action).
When queries on redshift are taking too long, it means queries aren’t getting the right amount of resources to execute in a timely manner. The underlying reasons could be several, but, with the initial data that I have, I was able to figure out its because of the data volume and not so optimized queries. I cannot go and tune each and every query nor try to shrink/ apply data retentions to so many tables. So, what is the quick fix?
Increased redshift capacity. Now, this is a calculated risk I took. But, I know, I can scale it down once I know the exact reasons and how to mitigate them long term. The only downside right now is incurring additional costs for redshift. But, users were able to start getting their reports in a timely manner. Time took to implement this fix is 1 day. What follows next are important.
Now as things were back to sort of normal, I started diving deep into why there is a sudden performance impact. I did not have enough insights into what may have caused it. So, I started working on pulling on insights into our infrastructure which basically tells me
- How may queries are executing in a given time period.
- What queries are executing longer.
- How may tables are there, how often a table is queried, how are the stats on the table, growth of the table, health of table.
- What are the adhoc queries that users are running.
- What ETL tools are we using.
- How many scheduled ETL jobs are running on a given day, average run times, success rate, failure rate, amount of data a process is querying and retrieving etc…
- Status of data refresh and how often SLAs are missed.
As this is for internal purposes only, I have created some reports to refresh at a certain cadence so that my team can continuously monitor. Now, I have all the data I need to see how things are moving along. I let the processes run for a month so I could look for trends and make an informed decision. With all the data I had, I was able to pin point all the issues, why they happened and how they happened. This took almost 6 weeks for me. This what Dive Deep is about. Getting to the bottom of the problem. It’s time for next step.
After having all the data points, I meticulously went through the data and identified the scaling issues and what are the design changes that we had to make so our infrastructure scales for the longer term. This scaling effort will take time but, can be done in parallel without impacting our customers and these changes are already underway.
Now, given this situation, if I tell my customers that wait for 6 months and you will get your data in time, how does that sound? Instead, I temporarily increased our cluster capacity. This bought me enough time to thoroughly investigate underlying issues and come up with a solution that will scale for longer term. Customer Experience stays the same. But, my team will have improved our operational excellence and tech debt which is a win-win because, now we have more time to engage with our customers and provide better solutions. This is how we raise the bar every day.