P3D6: Ethics in modeling and scraping
_Remember that we will be spending 2-3 more weeks on this data. We start Python next week.__
The Statistical Crisis in Science
Andrew Gelman and Eric Loken help us understand The Statistical Crisis in Science by examing p-values in the context of forking paths.
Questions
- What do p-values represent in scientific literature?
- What is the technical definition? Not even scientists can easily explain p-values
- What do they mean by ‘forking-paths’?
- What do they mean by ‘many degrees of freedom’?
- How does this paper apply to data science?
Quotes
Fisher offered the idea of p-values as a means of protecting researchers from declaring truth based on patterns in noise. In an ironic twist, p-values are now often used to lend credence to noisy claims based on small samples.
Our main point in the present article is that it is possible to have multiple potential comparisons (that is, a data analysis whose details are highly contingent on data, invalidating published p-values) without the researcher performing any conscious procedure of fishing through the data or explicitly examining multiple comparisons.
The problem resides in the one-to-many mapping from scientific to statistical hypotheses.
many degrees of freedom remain in their specific decisions: how strictly to set the criteria regarding the age of the women included, the hues considered as “red or shades of red,” the exact window of days to be considered high risk for conception, choices of potential interactions to examine, whether to combine or contrast results from different groups, and so on.
In this garden of forking paths, whatever route you take seems predetermined, but that’s because the choices are done implicitly. The researchers are not trying multiple tests to see which has the best p-value; rather, they are using their scientific common sense to formulate their hypotheses in a reasonable way, given the data they have. The mistake is in thinking that, if the particular path that was chosen yields statistical significance, this is strong evidence in favor of the hypothesis.
Our contribution is simply to note that because the justification for p-values lies in what would have happened across multiple data sets, it is relevant to consider whether any choices in analysis and interpretation are data dependent and would have been different given other possible data.
If necessary, one must step back to a sharper distinction between exploratory and confirmatory data analysis, recognizing the benefits and limitations of each.
On the Ethics of Web Scraping and Data Journalism
Many have thoughts on the ethics of web scraping and data journalism. Lam Thuy thinks that web scraping is a tool, not a crime. What ethical boundary have you drawn?
Questions
- What boundary would you define for web scraping?
Quotes
If a regular user can’t access it, [programmers] shouldn’t try to get it.