homegenius Improves Speed and Quality of Data Pipelines with Snowpark for Python
The homegenius family of companies is a technology group that utilizes the latest tools in AI to create smarter real estate experiences for homebuyers, real estate agents, and other real estate industry participants.
The family of companies, a part of Radian Group, relies on data science and machine learning to build and deploy models using a variety of data from different sources to bring technology to all parts of the home-buyer experience.
homegenius’ data scientists found themselves stuck spending their time managing data instead of unlocking its potential, a challenge faced by many organizations.
Before Snowflake, the homegenius companies had data processes which took days or weeks to complete using legacy processes running on SQL servers. Hours, and sometimes days, were required to acquire and load data into the appropriate locations, empowering data scientists to start their work. It was not uncommon for data scientists to spend a lot of time working on developing models only to find a problem in the data during validation that required additional time to fix and rerun. This became a bottle neck for a growing advanced analytics team developing cutting edge technologies.
"The data scientists would spend a week or more working on their models, only to discover issues with the data," said Noah Goodrich, Principal Data Engineer. "They had to backtrack and identify the problem in our process."
homegenius’ data challenges
homegenius’ data engineering team had three big data challenges it needed to solve, according to Goodrich. The data science team needed data transformations to happen quicker, the quality of data validations to be better, and the turnaround time for pipeline testing to be faster.
The team first tried to use a different vendor to store and process billions of rows. “After 6 weeks trying to make their solution work, we converted to Snowflake in a day. Everything just worked,” Goodrich said. “We had spent thousands of dollars on huge clusters that took many hours to kind of complete our biggest jobs. In comparison, a 5XL warehouse in Snowflake completed that same workload in less than 10 minutes. For me personally, that’s the biggest thing that Snowflake has delivered to our organization — the speed at which we can process and deliver things has gone from days and weeks to as little as minutes,” Goodrich added.
The data engineers are now able to conduct preemptive, automated data validation so the data scientists can say with confidence that all their data is correct, updated, and in a healthy state.
Enter Snowpark — bringing data processing to the next level in the same Snowflake environment
Goodrich, and the data engineering team wanted to work with Python as their programming language of choice for easier testing and better speed and velocity for research and development. Early on, the team was pulling data into PySpark and then pushing the data back into Snowflake, but experienced challenges with pipeline speed and quality.
They needed a data platform with near-unlimited scale, concurrency, and performance, with a single unified view of data to easily discover and securely share governed data and execute diverse analytics workloads. The data engineers turned to Snowpark, a set of libraries and runtimes in Snowflake that allows data engineers, data scientists, and data developers to securely process and deploy Python code. The move significantly improved performance because it kept the data and the processing on a single platform.
“We wanted to switch to Snowpark for performance reasons and it was so easy to do,” Goodrich said. “Converting our PySpark code to Snowpark was as simple as a change in an import statement.”
For homegenius’ data science programs, Snowflake has significantly sped up their work. The company can now rebuild datasets from scratch, even large ones, in as little as 30 minutes. This saves the family of companies days, or even weeks, compared to their previous solution.
Snowpark, at its core, helps collaboration among data engineers, data scientists, and other teams by allowing them to work on the same data within a unified Snowflake platform. This streamlines the architecture by natively supporting diverse programming languages chosen by each team member. It helps teams, like Goodrich's, build scalable, optimized pipelines, apps, and machine learning workflows while keeping costs low and maintenance requirements minimal. In some cases, the homegenius family of companies is now able to merge data from Snowflake and external API sources and processing natively on a single Snowflake platform.
"The biggest benefit Snowflake brought is the speed at which we can deliver," said Goodrich. "We've transitioned from processing tasks that took days to now completing them within minutes."
Snowpark also made it easy for the homegenius companies to implement better unit tests, allowing the team to validate queries against full data sets within minutes. Snowflake has opened research and development opportunities for the companies’ data scientists as well — helping them conduct experiments on data sets that were previously too large to fully utilize.
"Once I had that proof of concept built out and had transitioned everything to using Snowpark for our data pipeline and processing, we were finally able to move forward quickly with some initiatives that had stalled," Goodrich said. "One large personal regret was not relying on Snowflake earlier to empower our team to be even more successful."