In the rapidly evolving world of Artificial Intelligence (AI), data quality and organization have become pivotal. Our recent study sheds light on how well-structured data can significantly enhance the performance of Generative AI (GenAI) applications. Leveraging the capabilities of ChatGPT-4, we embarked on a series of experiments to demonstrate how different data structures impact the efficacy of a GenAI chatbot. Our findings were compelling, revealing that better-organized data sources dramatically reduce hallucinations and improve the accuracy of responses.
Unlocking the Power of Structured Data in GenAI: A Comparative Study
The Experiment Setup
Our objective was to create a chatbot that could engage in meaningful conversations with a specific dataset. To achieve this, we designed a Retrieval-Augmented Generation (RAG) demo where the chatbot, powered by ChatGPT-4, could answer any user prompt by dynamically pulling in context from a relational database. This RAG setup allowed the chatbot to use additional data as context during its response generation process, enhancing the relevance and accuracy of its outputs. For the demonstration, we utilized Streamlit to build an interactive demo app. The app underwent testing on various data platforms, such as Snowflake, Databricks, and Microsoft. We briefly tested Snowflake Artic and DBRX LLMs, but ultimately chose to use OpenAI GPT-4 for consistency in the experiment.
To effectively manage and organize the data for this experiment, we used VaultSpeed to design and build a Data Vault integration layer. Additionally, VaultSpeed Studio was employed to derive a virtual star schema and flat tables on top. This approach allowed us to maintain consistent and well-integrated data before it was transformed into the respective structures.We prepared three types of data inputs:
1. Source Lake: Here, data was simply dumped in the data platform as an exact copy of the source without any preprocessing, integration or organization.
2. Star Schema Dimensional Model: This data source was organized into a star schema, which is a common database structure used in data warehousing. It was derived virtually on top of our Data Vault model using VaultSpeed Studio.
3. Flat Table: This data source involved transforming the data into a single, denormalized table, also derived virtually using VaultSpeed Studio.
The addition of the RAG mechanism into our experimental setup was pivotal. It allowed the language model to leverage specific, structured data from the underlying databases, directly impacting the chatbot's ability to provide precise and contextually relevant answers based on the structured data fed into it.
Integration of Diverse Data Sources
For our demonstration, we simulated a fictive vehicle dealership environment to test the chatbot's performance in a complex, real-world setting. This setup required the integration of data from a variety of sources, including an Enterprise Resource Planning (ERP) system, a Customer Relationship Management (CRM) tool, and a visitor tracking system. By consolidating these diverse data streams into our data vault, we were able to create a comprehensive dataset that reflects the multifaceted operations typical of a vehicle dealership. This not only enhanced the realism of our testing environment but also allowed us to thoroughly assess how well our RAG-based chatbot could handle intricate queries across different data structures and sources. The integration was crucial for demonstrating the chatbot’s ability to navigate and retrieve information from a rich, multi-layered data environment, mirroring the challenges faced by real-world business applications.
All sources integrated in a Data Vault model
Methodology
We executed three distinct prompts against each of the three data structures. These prompts ranged from simple factual questions to more complex queries requiring contextual understanding and reasoning. To test the setup, we executed each of the three prompts 15 times against each of the three data structures.
The responses were then evaluated and ranked on a scale from 1 to 5:
- 1 (Poor): The answer was incorrect or irrelevant.
- 2 (Fair): The answer was partially correct or somewhat relevant.
- 3 (Good): The answer was somewhat accurate but could be more detailed or precise.
- 4 (Very Good): The answer was detailed and accurate, with minor improvements possible.
- 5 (Excellent): The answer was perfectly accurate and highly relevant.
Breakdown of Prompts and Responses
1. Prompt: "What was the total sales revenue for the last quarter?"
- Landing Zone: The responses often included unrelated figures or failed to provide a concrete number.
- Star Schema: The chatbot provided a closer estimate but occasionally confused sales figures from different periods, suggesting some difficulty in accurately querying related data.
- Flat Table: The answer was precise and accurate, matching the actual sales revenue and demonstrating the effectiveness of a denormalized table for straightforward querying.
2. Prompt: "List the top three sold products in each dealership."
- Landing Zone: The response was vague and listed incorrect products.
- Star Schema: The chatbot identified top products correctly for some dealerships but not all.
- Flat Table: The response was most often accurate and comprehensive, correctly listing the top three sold products in each dealership.
3. Prompt: "What is the customer loyalty card number of Jodi Edmundson?"
- Landing Zone: The response was often incorrect or returned no results.
- Star Schema: The chatbot provided the correct card number in some instances but not consistently, indicating some effectiveness but also variability in data retrieval success.
- Flat Table: The answer was accurate and consistent across almost all executions
Example: a prompt where the source lake produced an unsatisfying answer (1), while the flattened table provided a correct result (5).
Findings
The results were very revealing:
- Source Lake: Despite ChatGPT-4's impressive ability to generate queries that made a good attempt at finding the answer, this data source proved challenging for the chatbot. The average score hovered around 1.8, with a noticeable frequency of scores at the lower end (1 and 2), indicating significant inaccuracies and irrelevances in responses. The variance in scores was also high, with outliers frequently occurring above 3, which suggests that while generally poor, performance could occasionally spike to better outcomes.
- Star Schema Dimensional Model: Showing marked improvement over the Source Lake, the Star Schema Model achieved an average score of approximately 2.5. The distribution of scores was more centered around 2 and 3, with fewer instances of the lowest score (1) and occasional higher scores (4), reflecting a moderate level of accuracy and relevance. Outliers in this category typically hovered around 4, suggesting instances where the chatbot performed quite well.
- Flat Table: This structure demonstrated superior performance, with an average score of 3.5, indicating a robust ability to generate accurate and relevant responses. The scores were predominantly 3 and above, with a significant portion achieving the highest score of 5. This high scoring indicates that the Flat Table data structure facilitated the most consistent and accurate responses from the chatbot, with minimal outliers and a tight score distribution around the upper quartile.
Overview of the results
These findings underline the critical importance of data structure in enhancing the performance of AI applications. The simpler the data organization, particularly as seen with the Flat Table, the higher the accuracy and reliability of the chatbot's responses. This consistency is vital for applications requiring high levels of precision and reliability in automated decision-making.
Conclusion
Our study underscores the critical importance of data organization in AI applications. The stark contrast between the performance of the chatbot using different data structures highlights a key insight: well-prepared data sources enhance the accuracy and relevance of AI-generated responses. While results based on GPT-4 were quite amazing, even with the Source Lake, the amount of variation in the answers was significantly higher in the Source Lake than in the other two options.
Kurt Janssens, AI Product Owner at VaultSpeed, adds: "These results showcase the significant potential of data automation in enhancing Retrieval-Augmented Generation systems. By automating the structuring and integration of data, we not only streamline the developmental pipeline but also boost the operational efficiency and accuracy of AI applications. Remarkably, the same app, built in the same amount of hours, will deliver substantially better results thanks to these automated processes."
Implications for Businesses
For businesses leveraging AI for data-driven decision-making, these findings are particularly relevant. Investing time and resources in structuring data can lead to more reliable insights, reducing the risk of making decisions based on inaccurate information. Whether it’s a data warehouse in a star schema or a denormalized flat table, the structure of your data can be the difference between insightful analytics and misguided conclusions.
Dirk Vermeiren, CTO, comments on the results: "The profound impact of structured data on AI performance is evident. A well-organized data foundation is not just beneficial; it’s essential for fueling effective and reliable AI systems."
Future Directions
While our study focused on a specific GenAI application, the implications extend to various AI and machine learning contexts. Future research could explore:
- The impact of other data formats to feed the chatbot app.
- The scalability of structured data benefits across different AI applications.
- The role of automated data preparation tools in enhancing data quality.
In conclusion, as AI continues to transform industries, the foundational role of data structure cannot be overstated. Our study provides a roadmap for businesses and developers aiming to harness GenAI's full potential through well-organized, structured data.