Rakuten Rewards (formerly Ebates) is an online loyalty program with nearly 15 million active members, and more than 3,500 online merchants and services. This is our story, but a similar scenario is playing out at organizations across industries and around the world.
Before we moved to the Snowflake cloud data warehouse, we were using on-premise technology (SQL Server, followed by Hadoop), and we were still struggling to meet our service-level agreements (SLAs). Technical maintenance work was taking an increasing amount of our engineering resources, and we were risking the creation of data silos (something we had worked hard for years to prevent). We were spending too much time debugging and tuning our environment, and it was preventing us from performing critical business analysis and data solution engineering.
We needed to be able to handle a variety of different data workloads, but we didn’t want to separate the data. We needed to be able to load the data fast (e.g. as close to real-time as possible) to hit our SLAs; and we needed to allow for high volumes of concurrent use. And of course, we needed to do all of this cost-effectively to stay within our budget targets. Moving to the cloud and using the Snowflake data warehouse enabled us to maximize ROI and to scale with the rest of Rakuten Rewards.
Four techniques we used to maximize our Snowflake investment
Outlined below are some specific ways that we have been able to manage costs on Snowflake successfully at Rakuten Rewards. These approaches are not specific to Rakuten Rewards and can be replicated across many businesses that are struggling with these same issues.
1 Create a Snowflake warehouse for each business unit
Each business unit (e.g. sales, marketing, product), ETL component, and third-party tool such as MessageGears or Segment, should have its own Snowflake warehouse. Warehouses in Snowflake are separate processing clusters, I tend to think of them as groups of CPU and memory. A warehouse in Snowflake does not limit data access or produce silos of data. Data remains centrally stored and can be accessed through any warehouse, as long as the user has the necessary permissions. Having more distinct warehouses is helpful because it allows you to isolate workloads. And even better, having multiple distinct warehouses allows you to tune the size (and therefore cost) of the warehouse to reflect the workload it is processing. For example, the sales team might have its own warehouse, which is completely separate from the product team’s warehouses, which are similarly separate to the marketing team’s. Each warehouse can be sized to meet the needs of the specific team or task, and the workload of one team does not interfere with that of the other teams around them.
2 Control cloud data warehouse sizes
Instead of giving each team control over changing warehouse sizes, we gave each team multiple warehouses of different sizes. For example, you might give the sales team three warehouses, a small, a medium, and a large. Why do I advise against giving individual teams control over their own warehouse sizes? It can be very tempting to increase the size of a warehouse, especially if you have a slow running job or two. This increases the CPU and memory available to the job, and can speed the job up. However, the bigger the warehouse, the bigger the cost per hour. So, if a team has control over warehouse size, and increases the size of the warehouse to handle a particularly tricky job, but then forgets to turn it back down… your costs can start to ramp up very quickly indeed.
The other benefit to centralizing the control over warehouse size is that you can proactively work on “resizing” the warehouses to the right size. For example, we have found multiple workloads where we've been able to significantly downsize the warehouse, and we were able to do that without impacting the job run times. This saves money, and has no impact on performance. Often these jobs require heavy I/O work, and little compute. However, this is not a straightforward linear process, and smaller is not always better. We’ve also had situations where we’ve increased the warehouse size and saved money. If you have jobs with heavy compute needs, you may find that a larger cluster will finish the job so much faster that you save money compared to running the same job on a smaller cluster. In the end, the cost of the bigger data warehouse can pay for itself in terms of speed, even though it is much more expensive per hour. Setting up your warehouses in a way that allows for experimentation, and easy tuning is critical.
We also found it important to set a kill switch on queries. Our team has a kill switch that activates at two hours; any query running for more than two hours is automatically cancelled. This maximum run time is a reflection of the size of the data sets we manage, and the amount of work we have done to create a fully featured analytics layer on top of the raw data lake. It is rare that our colleagues try to run a workload for more than two hours, and in the very rare cases where it does come up, we proactively help them tune the code. In reality this kill switch fires very infrequently, and we’ve never had a situation where we couldn’t tune the code.
What do you need to make all of this work? It’s a good idea to establish a centralized cost management and oversight team. Staffing this team doesn’t need to be a full-time commitment if your company is not in a position to fund that. This team helps to create a centralized overview of Snowflake cost management and can act deliberately to prevent run-away costs.
3 Increase metadata visibility
Another critical part of our cost management approach, was to visualize metadata tables with Tableau. Snowflake provides a large amount of very helpful metadata, which is easily available within a series of Snowflake tables and views. Hooking Tableau up to these tables and views enables you to better visualize all of your costs, and analyze your workloads down to the query-level data. Furthermore, you can build and distribute standardized reporting to send to your cost oversight team, making it easy to spot spikes or anomalies in your costs or usage patterns.
With this enhanced visibility, you can make monitoring costs and usage much easier for the oversight team. Issues can be spotted immediately and efficiently, enabling fast action.
4 Pre-aggregate and join granular, raw data
Aggregations and joins are inevitably expensive. There are a variety of ways to try and control these costs, but one of the ways that we have had a lot of success is with derived data warehouse tables. We’ve saved a meaningful percentage of our total costs just by doing that. Effectively we have scanned our workload for repetitive joins or aggregations, and then, instead of processing those joins repeatedly during the workday, we run the joins and aggregates a few times a day, store the results in a table, and ask users to use those tables instead of going back to the raw data. It’s rarely a challenge when we ask people to move their jobs to these tables, since their SQL becomes easier and faster.
Working with AtScale takes this to another level. The AtScale tool does this same type of evaluation, but it does it automatically and drops any unneeded aggregates if they are not getting used. The result is a set of pre-joined, pre-aggregated tables, that can power a large volume of reporting work for the company.
Implementing these techniques can help to provide an enterprise level business intelligence platform with the power needed to unlock meaningful insights and data solutions; and you can do that without needing to trade off control or cost.
And this isn’t the end, there are further tips and tricks for maximizing productivity whilst maintaining costs, that there is not space to review here. For example:
The cloud is essential to modern business, but benefits like scalability and flexibility aren’t achieved in a vacuum. Picking the right cloud data warehouse for your organization and keeping your eye on productivity, usage patterns and costs is critical. Once you choose a data warehouse, make the effort to test and iterate on how you can optimize the usage of your data warehouse. Maximize ROI by reducing the unnecessary costs associated with business analysis. Together, these are the keys to unleashing the full value of the cloud.