Speaking to Al Jazeera English for a piece entitled: “The Power and Politics of Data Visualisation” three contributors looked at how data is often presented as objective truth, but the way it is presented, interpreted, and contextualized can distort its original purpose. Turning data into graphics people can understand is increasingly important, but viewers also should also be better informed and more careful in recognizing the nature of uncertainty in these visualizations.
The piece looks at how important it is to be able to trust the data, yet it’s equally important that viewers understand that the visualization of the data can be influenced by human decisions on the collection, interpretation, and depiction of the data. Dr. Cairo says “Data visualizations are some of the best tools that we have to understand the world if we use them well and we interpret them well, but that doesn’t mean that those numbers are the whole story. We also need to use logic and scientific reasoning.”
Publication Date: 26 February 2021
Publication Site: Institute for Data Science & Computing at University of Miami
Importantly, a working definition of data science narrows the scope of research. Instead of considering all possible types of data analysis that one may wish to conduct, we look closely at the types of analyses data scientists carry out. This distinction is important as the specific steps that, say, an experimental physicist takes to analyze data are different, even though they share commonalities, than the analytic steps a data scientist may take. Which leads to an important follow on: what exactly is data science work?
There have been several industry standards for breaking down data science work. The first was the KDD (or Knowledge in Data Discovery) method, that over time was modified and expanded upon by others. From these derivations, as well as studies that interview data scientists, we created a framework that has four higher order processes (preparation, analysis, deployment, and communication) and 14 lower order processes. Using the red stroke outline we also highlighted the specific areas where data visualization already plays a prominent role in data science work. In our research article we provide detailed definitions and examples of these processes.
Administering Covid-19 vaccines comes with a valuable perk for retail pharmacies: access to troves of consumer data.
Chains such as CVS Health Corp., Walmart Inc. and Walgreens-Boots Alliance, Inc. are collecting data from millions of customers as they sign up for shots, enrolling them in patient systems and having recipients register customer profiles.
The retailers say they are using the information to promote their stores and services, tailor marketing and keep in touch with consumers. The companies also say the information is critical in streamlining vaccinations and improving record-keeping, while ensuring only qualified people are receiving shots.
Nomograms are a trending term in evidence-based medicine, and COVID-19 research is no exception. In this context, a nomogram is usually a web-based tool, a graphic interface, or an on-line calculator in which patient data on several variables is entered as input, and a single summary statistic is calculated as output, such as the likelihood of successful response to treatment. Many medical researchers and data scientists have put forward nomograms derived from multivariate clinical progression models, to assist in decisions about COVID-19 triage.
Is this enthusiasm for reducing complex clinical decisions to the use of multivariate calculators a leap forward in personalized medicine, enabled by modern computing? There is a sketchy “black box” side to all this, to say nothing of the risk of incorporating statistical design errors or untenable inferential claims into a nomogram being rolled out for immediate, untested use in the middle of pandemic. So let’s treat the history of the “number needed to treat” as a “teachable moment” in the history of nomograms in medicine. What have we learned so far?
Every day for almost a year, hundreds of COVID Tracking Project contributors from all walks of life have compiled, published, and interpreted vitally important COVID-19 data as a service to their fellow Americans. On March 7, the one-year anniversary of our founding, we will release our final daily update and our data compilation will stop. Documentation, analysis, and archival work will continue for another two months, and we will bring the project to a close in May.
That we were able to carry the data through a full year is a testament to the generosity of the foundations and firms that gave us the resources we needed, to the counsel of our advisory board, to The Atlantic’s support for our highly unusual organization, and above all to the devotion of our contributors. But the work itself—compiling, cleaning, standardizing, and making sense of COVID-19 data from 56 individual states and territories—is properly the work of federal public health agencies. Not only because these efforts are a governmental responsibility—which they are—but because federal teams have access to far more comprehensive data than we do, and can mandate compliance with at least some standards and requirements. We were able to build good working relationships with public health departments in states governed by both Republicans and Democrats, and these relationships helped bring much more data to into public view. But ultimately, the best we could hope to do with unstandardized state data was to build a bridge over the data gaps—and the good news is that we believe we can now see the other side.
Our tracker uses data from a number of statistical bureaus, government departments and academic projects. For many of the countries, we have imported total_deaths from the Human Mortality Database, which collates detailed weekly breakdowns from official sources around the world. For other countries, you can find a full list of sources and links in a file called list_of_sources.csv, as well as spreadsheets in the /source-data/ folder.
Numerous observational studies have attempted to identify risk factors for infection with SARS-CoV-2 and COVID-19 disease outcomes. Studies have used datasets sampled from patients admitted to hospital, people tested for active infection, or people who volunteered to participate. Here, we highlight the challenge of interpreting observational evidence from such non-representative samples. Collider bias can induce associations between two or more variables which affect the likelihood of an individual being sampled, distorting associations between these variables in the sample. Analysing UK Biobank data, compared to the wider cohort the participants tested for COVID-19 were highly selected for a range of genetic, behavioural, cardiovascular, demographic, and anthropometric traits. We discuss the mechanisms inducing these problems, and approaches that could help mitigate them. While collider bias should be explored in existing studies, the optimal way to mitigate the problem is to use appropriate sampling strategies at the study design stage.
Author(s): Gareth J. Griffith, Tim T. Morris, Matthew J. Tudball, Annie Herbert, Giulia Mancano, Lindsey Pike, Gemma C. Sharp, Jonathan Sterne, Tom M. Palmer, George Davey Smith, Kate Tilling, Luisa Zuccolo, Neil M. Davies & Gibran Hemani
One problem may be the way we teach statistics to data scientists and public health professionals. Multivariable regression is often mistaken for a silver bullet that magically controls away confounding for all variables at once, as long as no confounder is left out. This is what statisticians call the “Table 2 fallacy,” because the adjusted effect sizes in a multivariable model are so often reported in Table 2. Many medical professionals learn to read research articles critically for understanding without ever having been introduced to the Table 2 fallacy.
Confounding is often taught as a purely mathematical concept, but that misses the point. Throwing a large set of variously interrelated variables into a big stepwise regression model might be expected to work, if all you know about confounding is that you should “never leave a confounder out” of your analysis.
Difficult trade-offs therefore need to be made, and this is where things can be deadly controversial—pun intended—when lives and livelihoods are involved, especially on a massive scale. As Leonelli asks, “What are the priorities underpinning alternative construals of ‘life with covid’? … Whose advice should be followed, whose interests should be most closely protected, which losses are acceptable and which are not?” Such questions clearly cannot (and should not) be answered by data science or data scientists alone, but the data science community has both the ability and responsibility to establish scientific and persuasive evidence to help to reach sustainable compromises that are critical for maintaining a healthy human ecosystem.
One antidote to this is true experimentation in which treatment is randomly assigned within the homogenous target population. Experimentation, particularly A/B tests, have become a mainstay of industry data science, so why observational causal inference matters?
Some situations you cannot test due to ethics or reputational risk
Even when you can experiment, understanding observational causal inference can help you better identify biases and design your experiments
Testing can be expensive. There are direct costs (e.g. testing a marketing promotion) of instituting a policy that might not be effective, implementation costs (e.g. having a tech team implement a new display), and opportunity costs (e.g. holding out a control group and not applying what you hope to be a profitable strategy as broadly as possible)2
Randomized experimentation is harder than it sounds! Sometimes experiments may not go as planned, but treating the results as observational data may help salvage some information value
Data collection can take time. We may want to read long-term endpoints like customer retention or attrition after many year. When we long to read an experiment that wasn’t launched three years ago, historical observational data can help us get a preliminary answer sooner
It’s not either-or but both-and. Due to the financial and temporal costs of experimentation, causal inference can also be a tool to help us better prioritize what experiments are worth running