Feature engineering is often complex and time-intensive. A subset of data preparation for machine learning workflows within data engineering, feature engineering is the process of using domain knowledge to transform data into features that ML algorithms can understand. Regardless of how much algorithms continue to improve, feature engineering continues to be a difficult process that requires human intelligence with domain expertise. In the end, the quality of feature engineering often drives the quality of a machine learning model.
Examples of Feature Engineering
Assume two input variables, customers’ salaries and ages, and a target variable, the likelihood of purchasing a product. In this dataset, salaries range between $30,000 and $200,000, and ages are between 10 and 90. You understand that an age difference of 20 years is more significant than a salary difference of $20, but to an algorithm, they are just numbers it needs to fit a curve through. To enable ML systems to provide useful outcomes, you need to teach them to allow for these kinds of variable weights. In this case, you could do this by scaling (e.g., make both salary and age a range between 0.0 and 1.0). You can also use binning (AKA, bucketing) to place values into one of a fixed number of value ranges (e.g., salary bands with values of 0 to 6).
Another example of a feature might be a customer score (derived from raw data) for a churn model or a calculated variable called “length of time customer.” These may be based on structured data. Similarly there may be user activity that comes in the form of semi-structured data that can define a calculate feature such as "is active in last month". And to take advantage of every data type, unstructured data needs to be processed, normalized, and converted into numeric values that a machine learning algorithm can understand.
Automated Feature Engineering
Data scientists spend a lot of time on feature engineering constructing new derivative attributes that can better represent the problem being solved. Domain experience is often required, as is an understanding of the parameter requirements for each model. If data scientists experience compute bottlenecks, the iterative process is elongated, costing valuable time and resources.
A significant trend in machine learning products is support for automated feature engineering in which tooling can automatically develop features. Some tools address tasks such as inputting missing values for specific algorithms, calculating certain functions (mean, etc.), or calculating ratios. Some are more sophisticated. Some only work on relational data. Some are best for image data.
Although automation does not replace the data scientist, it can assist and cut down on time spent developing new features. The domain expert can review the features provided by the tool and select the features that may provide predictive value. Given the productivity benefits, many organizations are adopting a hybrid approach that utilizes manual and automated feature engineering.
Snowflake for Feature Engineering
Data scientists need powerful compute resources for feature engineering. Other tools available today for data preparation and feature engineering are either highly inefficient or overly complicated to operate, resulting in brittle, expensive, and time-consuming data pipelines. But with Snowflake, data engineers and data scientists can perform feature engineering on large, petabyte-size datasets without the need for sampling.
Snowpark is a developer framework for Snowflake that allows data engineers, data scientists, and data developers to execute pipelines feeding ML models and applications faster and more securely in a single platform using SQL, Python, Java, and Scala. Snowpark supports feature engineering, model training and inference with ease. Additionally, Snowpark-optimized warehouses have compute nodes with 16x the memory and 10x the local cache compared with standard warehouses. Snowpark-optimized warehouses enable data scientists and other data teams to further streamline ML pipelines by having compute infrastructure that can effortlessly execute memory-intensive operations such as statistical analysis, feature engineering transformations, model training, and inference within Snowflake at scale.
To learn more about the Snowflake Data Cloud, visit our page.