5 Steps to Data Diversity: More Diverse Data Makes for Smarter AI
In an iconic Top Gun scene, Charlie tells Maverick that a maneuver is impossible. Maverick replies, “The data on the MIG is inaccurate.” In the more recent sequel, despite his extensive, firsthand knowledge, Maverick is told “the future’s coming and you’re not in it.” While flying may be more automated now, the importance of accurate and diverse data for aviation safety remains — and is likely even more critical. In two recent airplane accidents, automated systems aboard a Boeing 737 MAX made decisions based on inaccurate data. The systems relied on a single sensor input to make critical flight control decisions; erroneous data led to catastrophe. Having limited data sources increases risk.
With more of our lives being influenced by automated decisions based on probabilistic models, the quality and accuracy of data are paramount. Decisions need as much input as possible, and the same is true for AI models. Getting a second opinion is a common practice for humans — it must be true for automation, too. We need data diversity.
Getting a second opinion is a common practice for humans — it must be true for automation, too. We need data diversity.
Data diversity is one means of mitigating the risk of your AI models capturing internal bias and “hallucinating" or, plainly speaking, making mistakes. Industry analysts predicted a shift in focus to data that provided a more complete situational analysis or 360-degree view, and coined the term “wide data” a few years ago to differentiate it from “big data.” I find the term “diverse data” easier to understand. Diverse data comes from sources previously inaccessible or untapped, from partners and customers, from data providers or from automation itself. Diverse data provides a broader view and helps avoid the potential blinders traditional sources can perpetuate.
To ensure your AI models are trained with as much data as possible, here are five best practices for greater data diversity:
1. Break down internal silos to access cross-functional sources. Historically, data was isolated within applications or systems distributed across an organization. Data marts built to serve specific analytic purposes perpetuated the distributed nature of data. The first step to breaking down internal silos is to establish enterprise-wide data repositories and governance policies that streamline and facilitate appropriate access. To further encourage data use and reuse, adopt data product thinking, processes to facilitate their design and delivery, and teams to build and deploy them. An end-user-facing data catalog or marketplace can improve discoverability and access.
2. Transform unstructured data to expand available internal data. To ensure that all data is made available, organizations must adopt tools to transform unstructured data into usable formats. Documents, emails, images, videos and voice recordings provide valuable input for training. For example, retailers and their suppliers analyze product reviews to identify customer sentiments and ideally understand causality. It’s not enough to know that someone bought a product; they want to know why. The growing interest in Why? drives demand for more data. Transcripts of interactions with customer service (or of cockpit voice recordings) help build a more complete context for prediction or causal inferences.
3. Collaborate with partners to access different data sources. Data collaboration enables organizations to expand access to data across their business ecosystem. With the vagaries of consumer demand and challenges to global supply, retailers struggle to predict demand and optimize inventories. They need to gather real-time insights from every corner of their supply network to get the full context. Retailers like Aldi and Instacart share data with suppliers to improve demand forecasting, prevent the dreaded out-of-stock scenario and improve marketing. Vehicle manufacturers like Scania share data with fleet operators to improve maintenance and product design. Even patient data can be shared to accelerate diagnostics, personalize treatments and improve outcomes. Data clean rooms facilitate privacy-preserving data collaboration for use cases in industries such as healthcare and life sciences.
4. Acquire and access third-party external data sources. Stories of institutional bias in housing, hiring, loans and policing are not new, and will likely continue to surface. There is a growing realization that AI models capture historical biases. Expanding data sources can help. If an HR department wants to identify a profile for a specific role in its organization, for example, using only internal data would capture the characteristics of past employees in that role. To get a more representative picture of who they might hire, the HR team would want to incorporate some external data to eliminate potential existing bias. For example, ADP Payroll and Demographic Data or Workforce Data Analytics from Revelio offer potential sources of diverse data to enable broader representation. Models could either be trained on these external sources or leverage them as references via retrieval-augmented generation (RAG). Stay tuned for more to come on that from Snowflake.
5. Consider creating synthetic data. Another approach is to create synthetic data to balance representation. If bias is expected or observed, new data can be created to increase under-represented characteristics. For example, an online AI video editor developed a diversity fine-tuned (DFT) model to improve minority representation. Their model was trained on synthetic data that varies in perceived skin tones and genders constructed from diverse text prompts. These text prompts are constructed from multiplicative combinations of ethnicities, genders, professions, age groups, and so on, resulting in diverse synthetic data. Compared to baselines, DFT models generate more people with perceived darker skin tones and more women. A request for an image of a business person would be more likely to include, for example, women in headscarves or a doctor with darker skin.
While Snowflake wasn’t involved in the aforementioned DFT modeling, Synthetic data can be created in Snowflake at scale with automated SQL creation based on a standardized specification. And with Snowpark for Python running Faker — a Python library to generate realistic yet synthetic data — you can build training prompts with any combination and any distribution of attributes directly inside of Snowflake, just using SQL.
Diversify your data
Data diversity is one element of an effective and responsible AI strategy. Check out some of the other essentials for the AI journey.
We’re excited that many Snowflake customers are innovating responsibly with AI. Take a look at our AI-focused data leaders to watch in 2024.