Description

Every data practitioner knows the quality of your data is of the utmost importance, hence the prevalent expression “garbage in, garbage out.” Characterizing the noise-to-signal ratio of data is difficult. However, data quality problems are particularly acute when working with natural language data, which can contain sparse information and lack context or substance. Traditional approaches to assessing the quality of natural language data make use of problematic heuristics such as character counts or entropy-based measures which do not directly contextualize whether a message can be understood atomistically. This talk will introduce a model-based approach to measuring the quality of a natural language message we call InfoQ. This model addresses a series of use cases, from improving the quality of data used for model training to ensuring only the most valuable data is scored by targeted models or included in context windows for generative AI.

Key topics to be covered:

  • Model Development and Architecture: Dive into specifics on the development of this model, including how we conceptualized the problem, collected the training data, and continued to iterate.
  • Application for filtering uninformative data: Discuss how InfoQ can be used to filter out ‘low quality’ data. We’ll also go through a use case where this application contributed to improvements in model performance for a Generative AI-based summarization task.  This approach also can minimize hallucinations by removing low-fidelity content that may produce inaccurate inferences.
  • Application for dynamic ordering of data: Discuss how InfoQ can be used to effectively order datasets based on quality. We’ll also explore a use case where this application was useful for a very complex legal domain task.
  • Future Iteration: Offer ways we plan to develop this model further, including continued maintenance of the training data and additional refinements.

This talk is intended for data industry professionals working in the natural language space, who may be interested in hearing how to develop their own solutions to detect meaningful message data and improve the quality of their targeted or generative AI models.

Details

July 11, 2024

8:40 am

-

9:15 am

Delaware

Add to Calendar

Track:

AI & ML

Level:

Intermediate

Tags

Data Quality
Data Quality
GenAI
GenAI
Models
Models

Presenters

Nicole Basinski
Data Scientist
Aware

Bio

Nicole Basinski is a Data Scientist on the Behavioral Intelligence team at Aware, a SaaS startup founded in Columbus, Ohio. She has industry experience in the development of product-driven machine learning models in the natural language space, as well as the handling of massive datasets. Nicole obtained a BS in Data Analytics from The Ohio State University in 2021, and completed the Erdos Institute Data Science Bootcamp in 2022. She has also been involved with the DataConnect conference for three years, and is honored to be a part of the WIA community.