Behavior-Aware Data Valuation for LLMs at Scale
EB2 3001 890 Oval Drive, RaleighTitle: Behavior-Aware Data Valuation for LLMs at Scale Abstract: Large Language Models (LLMs) depend on massive datasets whose quality and influence remain largely opaque. Data valuation offers principled methods to quantify how training data contributes to model performance and behavior. Yet, scaling classical approaches such as influence functions to trillion-token corpora continues to be a…