Our Data team recently bought Snowflake (a cloud-based data warehouse), and as we have started using it we have developed some really interesting design ideas that I’m excited to share. But first, a bit of context.

Data Lake vs. Data Warehouse

If your enterprise has a lot of data and needs to do analytical queries, you probably have a data lake or a data warehouse. The chart below does a good job of covering the high-level differences between the two. An important takeaway from the chart is that data lakes and warehouses really serve different purposes and users. What tends to happen is that enterprises pick one option, which leads to pain for certain users/use cases—or the enterprise has both a lake and a warehouse, which leads to complexity. What is needed is technology that can flexibly support both data lake and data warehouse use cases at scale. Enter the data lakehouse.

 

Describing a Data Lakehouse

A data lakehouse is simply described as a data lake where either a schema-on-read or a schema-on-write approach can be used, and when schema-on-write is used the structure dictates how data is stored at the physical storage layer. Doing so should also improve performance.

There are technologies that facilitate something like a data lakehouse. PrestoDB and Redshift Spectrum are examples that support federation, which allows heterogeneous data sources to be queried through the same interface. Data virtualization technologies also provide support for federation. While federation abstracts away the complexity of data integration (good), schema-on-read versus schema-on-write can only be accomplished through different technologies (not good). Having a single technology that natively supports schema-on-read and schema-on-write simplifies things (very nice). As we have adopted Snowflake, it has been exciting to see a data technology that natively supports the idea of a data lakehouse.

Motivation for a Data Lakehouse

Photo by Don Graham. Used under CC BY-SA 2.0.

Jellyvision’s flagship product is the online benefits counselor ALEX. Users visit the site, answer some questions, and get a benefits recommendation appropriate to their unique situation. Most users will only experience one path through the tool, but the User Research & Analytics team needs to know them all, and the many thousands upon thousands of ways users can experience the application creates unique challenges for understanding user behavior. We chose to develop a custom Business Intelligence tool to handle this diversity of experience. At its core, the tool helps the User Research & Analytics team understand in aggregate how users answered questions as they navigate through our products. While previous tools provided a narrow range of pre-calculated roll-ups, we designed our new BI tool for on-demand aggregation, to empower our analysts to explore whatever patterns are relevant for the moment. To make sure they have a good workflow, the analysis experience should feel interactive—so we wanted question/answer queries to return in single digit seconds (ideally under 2 seconds). For most of our KPIs, a data lake has no trouble meeting the two second threshold. For our data related to how users answered questions, the amount of data can be 10x the amount compared to other KPIs and the logic is not simple rollups. This scenario presents the perfect use case for a data lakehouse, a desire to tune performance for a specific use case without having to over provision hardware.

Snowflake supports both unstructured/semi-structured formats and views, which allows it to be used as a data lake with a schema-on-read approach. Our data lake approach is to create a single master table for all events and to create views that will be used by applications and business users. Where we need to improve query performance, we can migrate data for specific use cases into tables that will serve as smaller lakes. For a handful of use cases, we expect to take an additional step and model data so that Snowflake starts using schema-on-write. The use of views allow for all of these steps to be taken without having to change logic in downstream systems/people.

Ultimately, Snowflake’s data lakehouse characteristics let us iteratively deploy structural changes where needed. The ability to efficiently deploy changes iteratively helps us reduce effort and deliver results faster.




Tagged with: