On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

Link: https://arxiv.org/abs/2204.12708

PDF: https://aclanthology.org/2022.findings-naacl.168.pdf

Findings of the Association for Computational Linguistics: NAACL 2022, pages 2182 – 2194
July 10-15, 2022

Graphic:

Abstract:

Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out “easy” instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to “throw the baby out with the bathwater” and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.

Author(s): Roy Schwartz, Gabriel Stanovsky

Publication Date: July 2022

Publication Site: arXiV

Analyzing Census Data in Excel

Link: https://www.census.gov/data/academy/courses/excel.html

Description

Excel is a very popular tool among all data users. It can be leveraged to unlock the value of open data of all kinds, and it is particularly well-suited to transforming, analyzing, and visualizing Census data. This course will show how to use Excel to access, manipulate, and visualize Census data.  It will also tools for doing advanced statistical analysis.

After completing this course, you will be able to:
    ✓    Access data from the Census Bureau using the American FactFinder
    ✓    Format tables for data analysis
    ✓    Perform basic and advanced analysis of Census data using Excel
    ✓    Create data visualizations such as sparklines, hierarchical charts, and histograms

Author(s): Andy Hecktman, Alexandra Barker

Date Accessed: 27 February 2021

Publication Site: U.S. Census Bureau