Multa Verba: methodology

Showing posts with label methodology. Show all posts

Monday, March 15, 2021

Socially Integrated Computing: A Vision for Computer Science

What is Computer Science?

Computer science as a field started in the early 20th century with the first general purpose electronic computers, and the theoretical models of Turing, Church and others. As ideas of computational organization and architecture were developed and notions of (hardware and software) layers of abstraction evolved, the field burgeoned and grew in many directions, spawning many subfields (architecture, programming languages, algorithmic design, complexity theory, and so on). Overall, though, the fundamental object of study in the field is the computational system, which can be thought of as a kind of machine inside a box — we feed inputs into the box, and receive outputs from the box (computed by the machine).

The evolution of the field can be understood as the progressively deeper understanding of, and the ability to engineer, such computational systems. This can be best understood as growth along three interrelated dimensions: power, trustworthiness, and reach.

Power is the ability to perform larger and more difficult computations. Faster and larger computing machines increase power, as does the development of faster algorithms. Complexity bounds enable us to better understand the limits of computational power, and programming language structures enable effective construction of more powerful software systems.

Trustworthiness refers to what extent we can ensure that our systems are reliable and trusted by users at the task(s) they are designed and deployed to perform. Cybersecurity measures work to ensure trustworthiness in the face of external threats; program verification and software engineering techniques to ensure it against (inevitable) human error. Consideration of human factors is also essential for computational systems to be trusted by users who are not privy to the system’s internal workings or design process.

Reach is about how widely computational systems touch and influence human activities. The earliest electronic computers had limited reach, serving a limited number of needs of government and large businesses, while today we carry powerful hand-held computers serving a great variety of personal and social needs, and the growth of IoT promises to increase computing’s reach even more.

Growth in each of these dimensions requires, influences, and constraints growth in the others. Increasing power or reach opens up new gaps in trustworthiness that need to be addressed, while increasing trustworthiness and reach tends to require more computational power.

Current Trends

We can understand the implications of current trends in computing via this tripartite lens, as well as the relationships between these developments.

Recent successes in machine learning with big data have increased the reach of computing systems into new segments of society and the economy, therefore also highlighting new aspects of trust that need to be addressed, in achieving fairness and dealing with bias in analytical and decision support systems.
Edge computing is about increasing computational power by redistribution of computation, concomitantly enabling the increase of reach, thus also requiring deeper attention to questions of trust in cybersecurity.
The complementary tech areas of Internet of things and intelligent spaces also increase reach greatly into our lives and daily activities, raising questions of trust related both to cybersecurity and to how understandable and predictable their responses are to us.
Similarly, digital manufacturing increases the reach of computing into the physical realm and gives us new questions of trust in cybersecurity and complex logistics management system that need to be solved, as does digital twinning.
Development of more realistic virtual and augmented reality systems promise a radical increase in computing’s reach, and profound social impact. This has been enabled by increases in computing power, and to reach full realism (however defined) will require even more power (cf. edge computing). It will also raise new questions of trust, particularly around privacy, as well as the effects of trust breaches when computing is tightly integrated with everyday activities
Quantum computing promises greater power, and specifically power which undermines public-key cryptography, a main pillar of current computational trustworthiness.
On the other hand, blockchain and other distributed ledger enable the creation of distributed systems that can be trusted, even if individual actors using the system are not (zero-trust information security), and that are robust to losses on the network (due to distribution). This uses the power of the network to create systems that increase trust and thus enables increased computational reach (via new financial and other applications).

The Next Step

Where, then, is the edge of the field where innovative research will move the field forward?

At the macro level, we see a great deal of possibility in all three main themes: Computational power has reached the limits of Moore’s Law, and so is ripe for a deep conceptual shift, whether through quantum, parallelism, non-von architectures, or some idea yet to be discovered; Computational trust is strained (at best) by increases in power and reach of computing - cybersecurity is an eternal arms-race, catalyzed further by demands for efficiency, and increased use of automation (whether AI or otherwise) raises new and evolving questions of algorithmic trust; Computational reach has increased enormously through the ubiquity of computing devices, both personal and IoT, which spread computation to new aspects of our societies, economies, and lives.

In fact, if we consider the effects of many of the current trends identified above we see one theme emerging overall — the growing intertwining of computational systems with human systems (individual, organizational, social). Personal devices (phones, watches, etc.), IoT, and intelligent spaces deeply connect our daily and minute-to-minute activities with computational adjuncts; as virtual/augmented reality becomes good enough for broad use, this integration will leap even further. In industry, digital twinning and digital manufacturing put computational models at the center of physical industry. And, while use of information technology has a long history in finance, the way blockchain has enabled creation of non-governmental currencies, and how modern data analytics has increased the power and reach of high-frequency trading and development of complex financial instruments, has transformed the industry (and left regulation somewhat behind for now). Similarly, e-discovery is transforming staid law offices, computer vision food safety as well as law enforcement, and machine learning systems are being applied (rightly or wrongly) to a whole host of societal, political, and business problems.

Thus, the broad paradigm shift that is needed is a change in how we think about the object that we study in computer science. Rather than it being a metaphorical glass box containing just a computational mechanism, receiving inputs and providing outputs to an external user

we need to expand the box we consider to include also the user and their behavior, as part of a larger, more complex computational system

and even beyond the user, also their social/organizational context

This system has enormously more degrees of freedom, and is not fully controllable by the computer scientist, but cannot be ignored, due to the tight interconnections between the core computational system (hardware, algorithm, data structures) and the information flow and incentives induced in its human users. This view of the proper object of study of our field we may term socially integrated computing, in that the human and social context (understood broadly to include all relevant human-human connections and interactions) are taken as integral to the computational system to be analyzed and designed.

There is, of course, much work in this vein already, in HCI, in social network analysis, in agent-based modeling, in computational economics and mechanism design, and so on. I believe, though, that these disparate types of research work can best be understood together as aspects of a shift in how we view the field as a whole — socially integrated computing — and that such a unifying view ought to transform it, and its effect on the world, for good.

Saturday, January 19, 2019

There is always bias (or, binary numbers are not the villain)

Twain Liu just wrote a piece on Quartz, entitled "Aristotle’s binary philosophies created today’s AI bias". This article is riddled with buzzword-laden arguments by feeling, such as summarizing "the very system on which all modern technology is built" as:

1 = true = rational = right = male
0 = false = emotional = left = female

The false dichotomies built into this asseveration rattle the brain. Indeed, the entire essay has this flavor, and it is not even false.

But since the piece has gotten a fair bit of attention, I feel the need to respond to the key claim of the piece. The entire argument rests on the dual assertion that the fact that computers use binary numbers (1s and 0s) as the basis for their operation is (a) based on Aristotle's (elitist, sexist) philosophy, and (b) the fundamental reason why algorithmic systems are biased. Hence, new computer systems not based on Aristotelian "binary" logic can be universal, unbiased pure goodness.

Well.

First off, the "computers are binary and essentially invented by Aristotle" claim is a load of argle-bargle and pure applesauce. (Clickbait headlines in the Atlantic notwithstanding.) When electronic computers were first being developed in the 40s and 50s, different systems were experimented with (including ternary (three-valued) logic), but binary was the most practical for a simple reason. With binary logic, you can represent a 0 by "voltage close to 0" and 1 by "voltage close to maximum". When you introduce more possible values, the system becomes more sensitive to noise, and hence less reliable. (There are other technical reasons for binary computing, and there are some other reasons to prefer ternary systems, but this is enough for my purposes.)

Now, to bias. Binary numbers have nothing whatsoever to do with algorithmic bias. The binary number system does not limit you to using only 1 or 0 for values you need to represent (after all, you could not specify an address to Google Maps just as a 1 or 0, say). Indeed, you can represent as many different values as you like by stringing bits together. You can have as many categories of whatever you like as you like. Any computer scientist would recognize this aspect of the claim to be laughable.

Algorithmic bias is due to the simple fact that all decision systems have biases. (Indeed, it is impossible to learn anything from experience without some sort of bias.) No real system has perfect information, and any decision made on the basis of imperfect information is biased in some way. The question is not "Can we create unbiased algorithms?" but "Do we know what our algorithm's biases are?" and "Can we mitigate the ones we do not like?"

Utopian visions like Ms. Liu's that if we just had the right philosophy, we could build computer systems that will be universal and unbiased, pure purveyors of algorithmic goodness, are false and actually dangerous. They promote the technocratic idea that there are unbiased algorithms out there, if we could just find them, and so keep our focus on algorithmic development.

However, bias is inevitable. The way to combat pernicious bias is through continuous monitoring to discover instances of problematic bias and exercise of good judgment to adjust systems (whether algorithms, training data, how the systems are used, etc.) to mitigate the bad effects while maintaining the good ones. The proper way to combat algorithmic bias (which some are working on) is to develop better ways of detecting and characterizing such bias, and the societal institutions and incentives that enable dealing with deleterious such biases. (And this leads into questions of value systems and politics, which cannot be avoided in this arena. There is no royal road.)

Visions of simple solutions derived from proper thinking are seductive. But the necessary condition for developing and maintaining diversity-enhancing technologies will be, I'm afraid, eternal vigilance.

Monday, January 7, 2019

Open Letter to PSU VP of Research Mark McLellan

Peter Boghossian is an assistant professor of philosophy at Portland State University. He recently, with two non-academic colleagues, published an account of an effort they made to probe peer-review methods within certain fields of inquiry that they term "grievance studies". Briefly, they wrote academic articles based on fanciful theories and hypotheses, matching as well as possible the style of writing and argumentation in the fields they addressed, and managed to get several articles accepted at leading journals. After doing so, they published their account of their effort, revealing the deception. This, they argue, has implications regarding the reliability of peer-review in those fields and perhaps regarding the legitimacy of the fields' methods themselves. I express no opinion regarding their study or conclusions.

What I am writing about is the response of Boghossian's institution, which was to investigate him for research impropriety, and ultimately to determine that Boghossian's "efforts to conduct human subjects research at PSU without a submitted nor approved protocol is a clear violation of the policies of [his] employer."

Unless the facts are substantially different from what has been published, this case raises concern about academic freedom and freedom of inquiry. It is debatable at best whether Boghossian's work required IRB review at all, and even if it had, the situation is not one to rise to the level of research malfeasance. If any readers have more information about the case, please let me know.

Below is the letter I wrote about the case this morning to Prof. Mark McLellan, Vice President for Research and Graduate Studies at Portland State University. Obviously, I speak only for myself, not for my institution.

Dear Prof. McLellan,

I have read with some concern of the investigation and conviction of Prof. Peter Boghossian of unethical research practices. This is a serious charge and as such warrants proper due process and full consideration of all relevant facts and circumstances. For the reasons I will detail below, I believe this not to have been the case here, and I urge reconsideration of this case, for the sake not only of Prof. Boghossian, but rather of the reputation of Portland State and the institution of the IRB.

I also note that, generally speaking, a first accusation or offense of this kind (lack of IRB review for research that did not result in proven tangible harm) will result in a warning and discussion with the faculty member, before proclaiming a determination that they have unambiguously violated ethical norms and university policy. Consider, for example, the very long time and repeated discoveries of egregious and intentional research malfeasance (far beyond anything that Prof. Boghossian is accused of) that were necessary before Dr. Wansink was finally censured by Cornell.

In the case of Prof. Boghossian, there are three essential questions whose answers would determine whether the project was subject to IRB review, and whether the project as conducted was unethical in any way.

First, was the project "research"? I believe the answer here is indeed "yes", since the project was undertaken to develop knowledge and disseminate it, in this case about the peer-review practices in certain fields of inquiry.

Second, did the project involve "human subjects"? Clearly, the fabricated research studies used as experimental probes did not. The reviewers of these articles, while part of the phenomenon under study (the "peer-review system") also were not human subjects per PSU's Human Subjects Research Review Committee Policy, which states:

A human subject is a living individual about whom an investigator obtains data, either from intervention or interaction with the individual, or through records which contain identifiable private information.

Since the peer reviewers were entirely anonymous and not identifiable, the investigators cannot be considered to have been obtaining data about them - no private information whatsoever was gathered, and the reviewers were performing their usual professional function. Thus they cannot have been considered human subjects by this definition, and the research was not subject to IRB review.

Furthermore, even if the project had been reviewed, it would have been exempt under 45 CFR 46.101, as "Research involving the use of educational tests (cognitive, diagnostic, aptitude, achievement), survey procedures, interview procedures or observation of public behavior" with anonymous subjects. As such, at worst Prof. Boghossian should be admonished to seek IRB review for such research in the future.

Third was the accusation of "fabricating data" due to one of the fabricated research articles containing made-up statistics about canine sexual activity. Clearly, since the article was not intended to remain a part of the research literature, but to be unmasked as false, there was no intent to deceive the research community. As such, this was not fabrication or falsification of research data or results.

Taken together, the facts seem clear that Prof. Boghossian's project never warranted IRB review at all, or if it did, would have been exempt. In any case, the only potential consequence should be a discussion with him regarding the importance of undergoing IRB review for future such projects. I urge that Portland State rescind its determination that he violated university policy, and restore his professional and academic standing within the university to the status quo ante.

I would be happy of course to discuss this matter further if it would be of use.

Sincerely,

Shlomo Engelson Argamon
Professor of Computer Science
Director, Master of Data Science
Illinois Institute of Technology

Tuesday, October 17, 2017

Calls to Jettison "Statistical Significance"

I recently wrote an opinion piece for American Scientist's Macroscope blog, arguing (inter alia) that the notion of "statistical significance" harms the epistemology of science, and should therefore be jettisoned. After I'd submitted the piece, I saw other articles, letters, and posts saying the same thing, arguing from a number of technical and sociological bases. I will gather here, to the best of my ability, a list of these articles as a resource for the community. If you know of others that I've missed, please add them in the comments.

Don’t Strengthen Statistical Significance—Abolish It, by Shlomo Engelson Argamon
Abandon Statistical Significance, by Blakeley B. McShane, David Gal, Andrew Gelman, Christian Robert, Jennifer L. Tackett
Justify Your Alpha: A Response to “Redefine Statistical Significance”, by Daniel Lakens, et al.
Abandon Statistical Significance (blog post), by Andrew Gelman
Remove, rather than redefine, statistical significance, by Valentin Amrhein and Sander Greenland
Time to Abandon Statistical Significance, by Lars P. Syll

Friday, July 28, 2017

Perverse Incentives from the New "p < 0.005" Proposal

Here is a short addendum to my post of yesterday, commenting on why I think that the proposal recently mooted to require a p < 0.005 for "statistical significance" is misguided and likely problematic.

There is another problem with that (admittedly well-intentioned) proposal, which arises because wide-spread implementation of the proposal would create a perverse incentive which could easily degrade the quality of the scientific literature by much more than it improves it.

As the authors acknowledge, lowering the alpha needed to declare "significance" would require larger sample sizes, and so data collection would entail greater difficulty and expense. They estimate that samples would need to be 70% larger - that may well be for studies of a single variable, though research that involves interaction effects would likely need a greater increase in the amount of data. Regardless, the manpower and expense would increase notably. (The authors see this as a potential benefit, in that "considerable resources would be saved by not performing future studies based on false premises.")

However, this extra work and expense creates a perverse incentive for researchers. The more you pay (in time, effort, money, etc.), the more you want to recoup your costs by producing publishable research. The harder such results are to produce, the harder you will look for them, after you have sunk the costs of getting the data.

Now, this is not a bad thing, if there is no way to cheat unwittingly. (Let's ignore entirely the possibility of fraud.) But if it is still possible (as it is today) to unknowingly p-hack, traipse through the garden of forking paths, HARK, etc., raising the cost of gathering a dataset will inevitably lead to a rise in p-hacked, forking pathed, and HARKed results.

The very change that was meant to improve the quality and integrity of the research literature will rather act to degrade it.

It seems to me that this underscores the fundamental need to address how incentives interact with methodologies and standards. Otherwise we are, at best, spitting in the wind, and at worst, using gasoline to put out the fire.

Wednesday, July 26, 2017

New "p < 0.005" Standard Considered Harmful

Several days ago, an article preprint was posted on PsyArXiv (to appear in Nature Human Behavior), arguing that a significant step towards solving the reproducibility crisis in many fields will be to reduce the standard significance threshold to require p < 0.005 (rather than the current standard of 0.05) for a result to be termed "statistically significant". Results with higher p values could only be termed "suggestive" under this proposal.

I, along with a number of other scientists, was recently interviewed for a brief article in Nature | News which clearly, and concisely, sets out arguments for and against this proposal as a way to address the reproducibility crisis. I'd like to expand here on my very compressed comments as quoted in that article.

On the one hand, the proposal to reduce the "significance threshold" to 0.005 will indeed be helpful, in that the change will reduce the occurrence of false positives, and thus enhance the likelihood that effects termed "significant" will be reproducible, and so reduce the number of non-existent effects reported in the literature as real.

However, the proposal does nothing whatsoever to change the fundamental problem, which is the entire framework of judging experimental results by a binary criterion of "significance". There are two key problems with the regnant view of "statistical significance":

It does not account for effect size - a large effect with less-significant p may have much more "real-life significance" than a tiny effect with more-significant p, and
It is very easy to (even inadvertently) "p-hack" and find putatively significant effects due to the very large number of possible parameter settings for any experiment in a sufficiently rich and theoretically-underdetermined domain.

Given that, this new recommendation may even do significant harm, in that improving research outcomes somewhat (by lowering false positive rates) without addressing the fundamental problem could lead to a false sense of security and entrench bad methodological attitudes.

To expand on this, the problem I have with the new recommendation is that it maintains the exact same conceptual framework of statistical significance testing that has been a major driver of the replicability crisis.

In statistical significance testing, a level of confidence (previously 0.05, per the new proposal 0.005) is defined, and then the question asked is: Is the probability that the null hypothesis produced the experimental results less than that confidence level? If it is, the null hypothesis is rejected, and the hypothesis is considered statistically significant.

This framework does not account at all for the question of effect size - an effect may be statistically significant, but practically speaking negligible and irrelevant. For example, often the null hypothesis is of the form "x = 0" for some model parameter x. Any such null hypothesis is almost certainly false, since the actual value of x is probably at least slightly different from 0, and so some statistically significant result can nearly always be extracted with enough data and ingenuity, even though the result may not be significant in any meaningful way in the real world.

More fundamentally, no matter what confidence level you choose, if there are enough different ways to design your experiment, it becomes highly likely that at least one of them will give a statistically significant result just by chance. And unless every single detail of the experiment is decided and registered ahead of time, it is virtually impossible to ensure, except by care and good judgement by the researchers, that no choices are made along the way that could (unwittingly) enhance the significance of the final experimental results, since there are so many different choices. This phenomenon was termed the "garden of forking paths" by Gelman and Loken, and illustrated nicely by XKCD. (This problem is inherent in any metric-based decision criterion for identifying effects, and is not specific to statistical significance testing.)

Finally, the goal and the strength of the "p < 0.005" proposal is that it will reduce the number of false positive results in the literature. However, by the same token, there will be fewer reports of "less statistically significant" results, which will cause overall problems for the science, due to the "file-drawer" effect. The problem is that the vast majority of negative results are never reported, which gives a biased picture of the state of knowledge. By lowering the p-threshold, the file-drawer bias will be even greater, since fewer results, with only lower p-values, are published.

So how should we proceed?

There is no simple answer to this question, as there are several interrelated problems: p-hacking, effect sizes, and literature bias, as I've mentioned above, but also the question of how the incentive structures of modern science may bias the creation and evaluation of new knowledge (the file-drawer problem is an example of this). All of these factors contribute to the "reproducibility crisis", and all need to be addressed. Any simple answer to one of these, such as changing the threshold p-value, will not solve the problem, and may in fact make things worse by making it seem solved when it is not.

Any meaningful solution must involve a combination of new methodological standards, institutional structures, and incentives. Reducing the p threshold is simple and could be one useful methodological standard, though it would perpetuate problematic concepts, as I've said. Another would be introducing some notions from Bayesian statistics, such as (i) using Bayes factors, which are more easily interpreted in most situations than a p-value, as a standard reported metric, or (ii) the use of explicit priors, to make assumptions clearer, or (iii) explicit loss functions, to account explicitly for the importance of effect size and the varying weights of different kinds of errors.

It is essential that the scientific community have good incentives for publishing replications, especially those that fail, and for publishing negative results. (And there cannot be penalties for early-stage researchers for doing so!) There need to be more and better institutional structures for preregistering studies, like clinicaltrials.gov, but set up with the appropriate community standards and consequences, so that they actually work.

Just as during the early and laborious development of effective scientific practices and institutions several hundred years ago, there is no royal road to solving the crisis. A great deal of hard work lies ahead of us, but past experience has shown that it will (eventually) pay off in more solid understanding of our world.