Using First Name Information to Improve Race and Ethnicity Classification




This paper uses a recent first name list to improve on a previous Bayesian classifier, the Bayesian Improved Surname Geocoding (BISG) method, which combines surname and geography information to impute missing race and ethnicity. The proposed approach is validated using a large mortgage lending dataset for whom race and ethnicity are reported. The new approach results in improvements in accuracy and in coverage over BISG for all major ethno-racial categories. The largest improvements occur for non-Hispanic Blacks, a group for which the BISG performance is weakest. Additionally, when estimating disparities in mortgage pricing and underwriting among ethno-racial groups with regression models, the disparity estimates based on either BIFSG or BISG proxies are remarkably close to those based on actual race and ethnicity. Following evaluation, I demonstrate the application of BIFSG to the imputation of missing race and ethnicity in the Home Mortgage Disclosure Act (HMDA) data, and in the process, offer novel evidence that race and ethnicity are somewhat correlated with the incidence of missing race/ethnicity information.


Ioan Voicu
Office of the Comptroller of the Currency (OCC)

Publication Date: February 22, 2016

Publication Site: SSRN

Suggested Citation:

Voicu, Ioan, Using First Name Information to Improve Race and Ethnicity Classification (February 22, 2016). Available at SSRN: or



You can’t compare results from Bayesian and frequentist methods because the results are different kinds of things. Results from frequentist methods are generally a point estimate, a confidence interval, and/or a p-value.


In contrast, the result from Bayesian methods is a posterior distribution, which is a different kind of thing from a point estimate, an interval, or a probability. It doesn’t make any sense to say that a distribution is “the same as” or “close to” a point estimate because there is no meaningful way to compute a distance between those things. It makes as much sense as comparing 1 second and 1 meter.

Author(s): Allen Downey

Publication Date: 25 April 2021

Publication Site: Probably Overthinking It

Covid, false positives and conditional probabilities…



In courtrooms, mixing up the probability of “A given B’” with “B given A” is known as the “prosecutor’s fallacy”. In 1999, a court convicted Sally Clark of the murder of her two sons, in part because a medical expert claimed the chance of two accidental cot deaths was one in 73m. Even if this number was right – which it isn’t – it did not reflect the chance she was innocent. A double murder was also very rare: the relative likelihood of the two explanations was key and with new evidence and better statistical reasoning, an appeal court quashed the conviction.

There was controversy after a recent Observer headline referred to Bayes’s theorem as “obscure”. His ideas may be little known by the public, but they are growing among scientists. Many complex analyses done during the pandemic have been “Bayesian”, including modelling lockdown effects, the ONS infection survey, and Pfizer-BioNTech’s vaccine trial. The term “credible interval”, rather than “confidence interval”, is the giveaway.

Last week, Cass Business School announced the renaming of its institution after Bayes and his theorem. The obscure tomb in nearby Bunhill Fields is worth a visit.

Author(s): David Spiegelhalter, Anthony Masters

Publication Date: 25 April 2021

Publication Site: The Guardian

The obscure maths theorem that governs the reliability of Covid testing



This is important to know when thinking about “lateral flow tests” (LFTs), the rapid Covid tests that the government has made available to everyone in England, free, up to twice a week. The idea is that in time they could be used to give people permission to go into crowded social spaces – pubs, theatres – and be more confident that they do not have, and so will not spread, the disease. They’ve been used in secondary schools for some time now.

There are concerns over LFTs. One is whether they’ll miss a large number of cases, because they’re less sensitive than the slower but more precise polymerase chain reaction (PCR) test. Those concerns are understandable, although defenders of the test say that PCR testing is too sensitive, able to detect viral material in people who had the disease weeks ago, while LFTs should, in theory, only detect people who are infectious.

But another concern is that they will tell people that they do have the disease when in fact they don’t – that they will return false positives.

Author(s): Tom Chivers

Publication Date: 18 April 2021

Publication Site: The Guardian