Abstract
We present EnDive (English Dialect Variability Evaluation), a benchmark designed to assess the fairness and robustness of large language models across English dialects. EnDive spans five major English dialects—African American Vernacular English (AAVE), Indian English, British English, Australian English, and Standard American English—covering tasks including sentiment analysis, natural language inference, and question answering. We find significant performance disparities across dialects, with models consistently underperforming on AAVE and Indian English inputs.