[109] Data Falsificada (Part 1): “Clusterfake”

Link: https://datacolada.org/109

Graphic:

Excerpt:

Two summers ago, we published a post (Colada 98: .htm) about a study reported within a famous article on dishonesty (.htm). That study was a field experiment conducted at an auto insurance company (The Hartford). It was supervised by Dan Ariely, and it contains data that were fabricated. We don’t know for sure who fabricated those data, but we know for sure that none of Ariely’s co-authors – Shu, Gino, Mazar, or Bazerman – did it [1]. The paper has since been retracted (.htm).

That auto insurance field experiment was Study 3 in the paper.

It turns out that Study 1’s data were also tampered with…but by a different person.

That’s right:
Two different people independently faked data for two different studies in a paper about dishonesty.

The paper’s three studies allegedly show that people are less likely to act dishonestly when they sign an honesty pledge at the top of a form rather than at the bottom of a form. Study 1 was run at the University of North Carolina (UNC) in 2010. Gino, who was a professor at UNC prior to joining Harvard in 2010, was the only author involved in the data collection and analysis of Study 1 [2].

Author(s): Uri Simonsohn, Leif Nelson, and Joseph Simmons

Publication Date: 17 Jun 2023

Publication Site: Data Colada

Batch-dependent safety of the BNT162b2 mRNA COVID-19 vaccine

Link: https://onlinelibrary.wiley.com/doi/full/10.1111/eci.13998

Graphic:

Excerpt:

Vaccination has been widely implemented for mitigation of coronavirus disease-2019 (Covid-19), and by 11 November 2022, 701 million doses of the BNT162b2 mRNA vaccine (Pfizer-BioNTech) had been administered and linked with 971,021 reports of suspected adverse effects (SAEs) in the European Union/European Economic Area (EU/EEA).1 Vaccine vials with individual doses are supplied in batches with stringent quality control to ensure batch and dose uniformity.2 Clinical data on individual vaccine batch levels have not been reported and batch-dependent variation in the clinical efficacy and safety of authorized vaccines would appear to be highly unlikely. However, not least in view of the emergency use market authorization and rapid implementation of large-scale vaccination programs, the possibility of batch-dependent variation appears worthy of investigation. We therefore examined rates of SAEs between different BNT162b2 vaccine batches administered in Denmark (population 5.8 million) from 27 December 2020 to 11 January 2022.

….

A total of 7,835,280 doses were administered to 3,748,215 persons with the use of 52 different BNT162b2 vaccine batches (2340–814,320 doses per batch) and 43,496 SAEs were registered in 13,635 persons, equaling 3.19 ± 0.03 (mean ± SEM) SAEs per person. In each person, individual SAEs were associated with vaccine doses from 1.531 ± 0.004 batches resulting in a total of 66,587 SAEs distributed between the 52 batches. Batch labels were incompletely registered or missing for 7.11% of SAEs, leaving 61,847 batch-identifiable SAEs for further analysis of which 14,509 (23.5%) were classified as severe SAEs and 579 (0.9%) were SAE-related deaths. Unexpectedly, rates of SAEs per 1000 doses varied considerably between vaccine batches with 2.32 (0.09–3.59) (median [interquartile range]) SAEs per 1000 doses, and significant heterogeneity (p < .0001) was observed in the relationship between numbers of SAEs per 1000 doses and numbers of doses in the individual batches. Three predominant trendlines were discerned, with noticeable lower SAE rates in larger vaccine batches and additional batch-dependent heterogeneity in the distribution of SAE seriousness between the batches representing the three trendlines (Figure 1). Compared to the rates of all SAEs, serious SAEs and SAE-related deaths per 1.000 doses were much less frequent and numbers of these SAEs per 1000 doses displayed considerably greater variability between batches, with lesser separation between the three trendlines (not shown).

Author(s): Max Schmeling, Vibeke Manniche, Peter Riis Hansen

Publication Date: 30 Mar 2023

Publication Site: European Journal of Clinical Investigation

Child Mortality Rate, under age five – doc v11

Link: https://www.gapminder.org/data/documentation/gd005/

Graphic:

Excerpt:

Documentation — version 11

This page describes how Gapminder has combined data from multiple sources into one long coherent dataset with Child mortality under age 5, for all countries for all years between 1800 to 2100.

Data » Online spreadsheet with data for countries, regions and global total — v11

SUMMARY DOCUMENTATION OF V11

Sources

— 1800 to 1950: Gapminder v7  (In some cases this is also used for years after 1950, see below.) This was compiled and documented by Klara Johansson and Mattias Lindgren from many sources but mainly based on www.mortality.org and the series of books called International Historical Statistics by Brian R Mitchell, which often have historic estimates of Infant mortality rate which were converted to Child mortality through regression. See detailed documentation of v7 below.

— 1950 to 2016: UNIGME, is a data collaboration project between UNICEF, WHO, UN Population Division and the World Bank. They released new estimates of child mortality for countries and a global estimate on September 19, 2019, and the data is available at www.childmortality.org. In this dataset, 70% of all countries have estimates between 1970 and 2018, while roughly half the countries also reach back to 1960 and 17% reach back to 1950.

— 1950 to 2100: UN POPWorld Population Prospects 2019 provides annual data for Child mortality rate for all countries in the annually interpolated demographic indicators, called WPP2019_INT_F01_ANNUAL_DEMOGRAPHIC_INDICATORS.xlsx, accessed on January 12, 2020.

Publication Date: accessed 22 March 2023

Publication Site: Gapminder

Insurtech Regs, ‘Dark Pattern’ Spottting on NAIC’s To-Do List

Link: https://www.thinkadvisor.com/2022/12/16/insurtech-regs-dark-pattern-spottting-on-naics-to-do-list/

Excerpt:

In August [2022], Birny Birnbaum, the executive director of the Center for Economic Justice, asked the [NAIC] Market Regulation committee to train analysts to detect “dark patterns” and to define dark patterns as an unfair and deceptive trade practice.

The term “dark patterns” refers to techniques an online service can use to get consumers to do things they would otherwise not do, according to draft August meeting notes included in the committee’s fall national meeting packet.

Dark pattern techniques include nagging; efforts to keep users from understanding and comparing prices; obscuring important information; and the “roach motel” strategy, which makes signing up for an online service much easier than canceling it.

Author(s): Allison Bell

Publication Date: 16 Dec 2022

Publication Site: Think Advisor

Bring ChatGPT INSIDE Excel to Solve ANY Problem Lightning FAST

Link: https://www.youtube.com/watch?v=kQPUWryXwag&ab_channel=LeilaGharani

Video:

Description:

OpenAI inside Excel? How can you use an API key to connect to an AI model from Excel? This video shows you how. You can download the files from the GitHub link above. Wouldn’t it be great to have a search box in Excel you can use to ask any question? Like to create dummy data, create a formula or ask about the cast of the The Sopranos. And then artificial intelligence provides the information directly in Excel – without any copy and pasting! In this video you’ll learn how to setup an API connection from Microsoft Excel to Open AI’s ChatGPT (GPT-3) by using Office Scripts. As a bonus I’ll show you how you can parse the result if the answer from GPT-3 is in more than 1 line. This makes it easier to use the information in Excel.

Author(s): Leila Gharani

Publication Date: 6 Feb 2023

Publication Site: Youtube

On the Limitations of Dataset Balancing: The Lost Battle Against Spurious Correlations

Link: https://arxiv.org/abs/2204.12708

PDF: https://aclanthology.org/2022.findings-naacl.168.pdf

Findings of the Association for Computational Linguistics: NAACL 2022, pages 2182 – 2194
July 10-15, 2022

Graphic:

Abstract:

Recent work has shown that deep learning models in NLP are highly sensitive to low-level correlations between simple features and specific output labels, leading to overfitting and lack of generalization. To mitigate this problem, a common practice is to balance datasets by adding new instances or by filtering out “easy” instances (Sakaguchi et al., 2020), culminating in a recent proposal to eliminate single-word correlations altogether (Gardner et al., 2021). In this opinion paper, we identify that despite these efforts, increasingly-powerful models keep exploiting ever-smaller spurious correlations, and as a result even balancing all single-word features is insufficient for mitigating all of these correlations. In parallel, a truly balanced dataset may be bound to “throw the baby out with the bathwater” and miss important signal encoding common sense and world knowledge. We highlight several alternatives to dataset balancing, focusing on enhancing datasets with richer contexts, allowing models to abstain and interact with users, and turning from large-scale fine-tuning to zero- or few-shot setups.

Author(s): Roy Schwartz, Gabriel Stanovsky

Publication Date: July 2022

Publication Site: arXiV

Data Challenges in Building a Facial Recognition Model and How to Mitigate Them

Link: https://www.soa.org/resources/research-reports/2023/data-facial-rec/

PDF: https://www.soa.org/49022b/globalassets/assets/files/resources/research-report/2023/dei107-facial-recognition-challenges.pdf

Graphic:

Excerpt:

This paper is an introduction to AI technology designed for actuaries to understand how the technology works, the potential risks it could introduce, and how to mitigate risks. The author focuses on data bias as it is one of the main concerns of facial recognition technology. This research project was jointly sponsored by the Diversity Equity and Inclusion Research and the Actuarial Innovation and Technology Strategic Research Programs

Author(s): Victoria Zhang, FSA, FCIA

Publication Date: Jan 2023

Publication Site: SOA Research Institute

More and Better Uses Ahead for Governments’ Financial Data

Link: https://www.governing.com/finance/more-and-better-uses-ahead-for-governments-financial-data

Excerpt:

In its lame duck session last month, Congress tucked a sleeper section into its 4,000-page omnibus spending bill. The controversial Financial Data Transparency Act (FDTA) swiftly came out of nowhere to become federal law over the vocal but powerless objections of the state and local government finance community. Its impact on thousands of cities, counties and school districts will be a buzzy topic at conferences all this year and beyond. Meanwhile, software companies will be staking claims in a digital land rush.

The central idea behind the FDTA is that public-sector organizations’ financial data should be readily available for online search and standardized downloading, using common file formats. Think of it as “an http protocol for financial data” that enables an investor, analyst, taxpayer watchdog, constituent or journalist to quickly retrieve key financial information and compare it with other numbers using common data fields. Presently, online users of state and local government financial data must rely primarily on text documents, often in PDF format, that don’t lend themselves to convenient data analysis and comparisons. Financial statements are typically published long after the fiscal year’s end, and the widespread online availability of current and timely data is still a faraway concept.

…..

So far, so good. But the devil is in the details. The first question is just what kind of information will be required in this new system, and when. Most would agree that a complete download of every byte of data now formatted in voluminous governmental financial reports and their notes is overwhelming, unnecessary and burdensome. Thus, a far more incremental and focused approach is a wiser path. For starters, it may be helpful to keep the initial data requirements skeletal and focus initially on a dozen or more vital fiscal data points that are most important to financial statement users. Then, after that foundation is laid, the public finance industry can build out. Of course, this will require that regulators buy into a sensible implementation plan.

The debate over information content requirements should focus first on “decision-useful information.” Having served briefly two decades ago as a voting member of the Governmental Accounting Standards Board (GASB), contributing my professional background as a chartered financial analyst, I can attest that almost every one of their meetings included a board member reminding others that required financial statement information should be decision-useful. A key question, of course, is “useful to whom?”

Author(s): Girard Miller

Publication Date: 17 Jan 2023

Publication Site: Governing

Government Financial Reporting – Data Standards and the Financial Data Transparency Act

Link: https://xbrl.us/events/230124/

Date and Time of upcoming event: 3:00 PM ET Tuesday, January 24, 2023 (60 Minutes)

Description:

The U.S. Congress passed legislation on December 15, 2022 that includes requirements for the Securities and Exchange Commission to adopt data standards related to municipal securities. The Financial Data Transparency Act (FDTA) aims to improve transparency in government reporting, while minimizing disruptive changes and requiring no new disclosures. The University of Michigan’s Center for Local State and Urban Policy (CLOSUP) has partnered with XBRL US to develop open, nonproprietary financial data standards that represent government financial reporting which could be freely leveraged to support the FDTA. The Annual Comprehensive Financial Reporting (ACFR) Taxonomy today represents general purpose governments, as well as some special districts, and can be expanded upon to address all types of governments that issue debt securities. CLOSUP has also conducted pilots with local entities including the City of Flint.

Attend this 60-minute session to explore government data standards, find out how governments can create their own machine-readable financial statements, and discover what impact this legislation could have on government entities. Most importantly, discover how machine-readable data standards can benefit state and local government entities by reducing costs and increasing access to time-sensitive information for policy making.

Presenters:

  • Marc Joffe, Public Policy Analyst, Public Sector Credit
  • Stephanie Leiser, Fiscal Health Project Lead, Center for Local, State and Urban Policy (CLOSUP), University of Michigan’s Ford School of Public Policy
  • Campbell Pryde, President and CEO, XBRL US
  • Robert Widigan, Chief Financial Officer, City of Flint

Publication Site: XBRL.us

The most common restaurant cuisine in every state, and a chain-restaurant mystery

Link: https://www.washingtonpost.com/business/2022/09/29/chain-restaurant-capitals/

Graphic:

Excerpt:

The places that drive the most tend to have the same high share of chain restaurants regardless of whether they voted for Trump or Biden. As car commuting decreases, chain restaurants decrease at roughly the same rate, no matter which candidate most residents supported.

If the link between cars and chains transcends partisanship, why does it look like Trump counties have more chain restaurants? It’s at least in part because he won more of the places with the most car commuters!

About 83 percent of workers commute by car nationally, but only 80 percent of folks in Biden counties do so, compared with 90 percent of workers in Trump counties. The share of car commuters ranges from 55 percent in the deep-blue New York City metro area to 96 percent around bright red Decatur, Ala.

Author(s): Andrew Van Dam

Publication Date: 1 Oct 2022

Publication Site: WaPo

The amazing power of “machine eyes”

Link: https://erictopol.substack.com/p/the-amazing-power-of-machine-eyes

Graphic:

Excerpt:

Today’s report on AI of retinal vessel images to help predict the risk of heart attack and stroke, from over 65,000 UK Biobank participants, reinforces a growing body of evidence that deep neural networks can be trained to “interpret” medical images far beyond what was anticipated. Add that finding to last week’s multinational study of deep learning of retinal photos to detect Alzheimer’s disease with good accuracy. In this post I am going to briefly review what has already been gleaned from 2 classic medical images—the retina and the electrocardiogram (ECG)—as representative for the exciting capability of machine vision to “see” well beyond human limits. Obviously, machines aren’t really seeing or interpreting and don’t have eyes in the human sense, but they sure can be trained from hundreds of thousand (or millions) of images to come up with outputs that are extraordinary. I hope when you’ve read this you’ll agree this is a particularly striking advance, which has not yet been actualized in medical practice, but has enormous potential.

Author(s): Eric Topol

Publication Date: 4 Oct 2022

Publication Site: Eric Topol’s substack, Ground Truths