Machine Psychology — Exploring A Paradox in Revealed Preferences
My modest goal in a working paper titled A Paradox in Machine Preference (pdf) is the invigoration of a languishing field: Thurstonian choice modeling. L. L. Thurstone coined the Law of Comparative Judgement in 1927 and was a pioneer in psychometrics (others include Galton, Cattell, and Binet). That effort parallels one we need to undertake today to understand machine preferences.
The revealed preferences of machines, and people
In this paper I show that elicited token probabilities from BERT and RoBERTa suggest the superiority of Thurstone preference modeling over a stalwart in the literature: Luce’s Choice Axiom. That is a small start on the road to understanding how machines decide, and — to the extent that they merely reflect our own predilections — it is also a niche in observational psychology: a use of an imperfect but large new statistical mirror held up to humanity.
I taunt the field of machine psychometrics into existence with potentially poor execution, so that others may push back against my methodology and, in particular, my technique of studying revealed machine preferences. Flaws aside, the new laboratory provided by large language model interrogation is surely a rich one and I used special purpose language models such as BERT and RoBERTa that make for excellent companions in the game of “fill in the blanks’’.
Thurstone models have languished in large part due to difficulties in computation recently solved, at least in part, in a paper I authored titled Inferring Relative Ability From Winning Probability in Multi-Entrant Contests. Therein, I provided a means of rapidly calibrating contest models — including those involving hundreds of thousands of variables, as needed. I called it the horse race problem and I sometimes refer to the solution as the Fast Ability Transform.
As also noted in that paper (pdf) the applications of contest models are widespread and include many forms of digital commerce, web search and even over-the-counter trading. They can be used to model changes in market share when a competitor leaves the market. Another application is to Bayesian networks where inequality evidence is to be introduced (as with competitors to the Elo rating system, for example).
Some kinds of applications are very strongly motivated because it is clear that people, horses or firms are contesting something (as with bids for art or bonds, or luring customers with a superior product) and we may be able to measure, and update, the variables that determine who wins.
However, there are other applications where the performance variables are less well defined or can be introduced as a useful fiction. One example is the study of human choice and preference: a central area in economics and decision theory with goals that include prediction of combinatorial revealed preferences. Here it is arguably less compelling that choice is truly a contest in the same explicit sense — at least for “typical” performance distributions. (Though see my caveat in the paper about exponential contests conforming to Luce’s Choice Axiom).
Choice as a contest
Thurstone’s idea was to model choice probability as the likelihood of winning a contest when performance is stochastic. Item attractiveness is analogous to the relative ability — such as average time for a horse to complete a race.
To illustrate, let us suppose we have a green, a yellow and a blue item to choose from and this ranking reflects our ordered preference. If we furthermore know our own selection probabilities, with green being chosen 60% of the time, say, then we can arrange that these likelihoods are exactly matched by contest winning probabilities where performances are normally distributed. For example, let’s pretend that the value taken by the green auxiliary variable, whose density is shown in Figure 1, will be the lowest of the three on 60% of occasions.
Choice as an Urn
A different way to model choice is much simpler and almost requires no explanation. We assume that each item is represented by multiple balls in an urn. We draw one ball at random and, if the choice is red as shown, we proceed to remove all the red balls in order to make subsequent predictions regarding preference for blue over green.
This is sometimes referred to as Luce’s Choice Axiom. The probability of selecting a particular item is proportional to its assigned utility or weight, and crucially, this probability remains consistent regardless of the presence or absence of other items in the urn. This principle ensures that choices are made independently of irrelevant alternatives, aligning neatly with the structure of analytically convenient assumptions made in many fields including microeconomics, when combinatorial selection or conditional selection probabilities are called for.
Choosing between choice models
The two choice models described do not, of course, exhaust the possibilities but it is not unreasonable, in my view, to regard them as natural candidates for the title of “most reasonable first approximation”. Neither has any degrees of freedom, once we fix the single item selection probabilities, so they make default out-of-the-box predictions about any other preferences that might be revealed. It is a fair match race.
As an example of a falsifiable prediction, both models can imply the probability of choosing two items from a collection. Unfortunately, almost nobody gives the Thurstone model a run because it ostensibly involves solving a system of non-linear equations looking like the following, where the a are locations to be solved for.
Let’s set that difficulty aside and assume it has been solved (because it has been, for very large n). Instead, we turn only to the question of how the two theories might be empirically tested.
Here no experiment is perfect but what I’ve done in the paper is use two different styles of masked prompts. Consider the following questions sent to BERT where the answer returned will be a list of possible words to substitute for [MASK], together with the token probabilities.
Note that the qualifier Western reduces the choice set, creating an opportunity for the two different models of preference to infer token probabilities for the second question using token probabilities returned in answer to the first. The actual probabilities for the second question are shown in the fourth column of Table 1 below:
The Luce and Thurstonian columns present competing predictions.
And the winner is …
I’ll cut to the chase and say that based on my preliminary experiments involving roughly a hundred question categories “choice is a contest” beats “choice is an urn” far more often than the opposite.
This has led me to the heuristic that Thurstone Type V models (contests where variables have similar variances) are a better approximate mental model than the simpler Lucian alternative. Details are in the (paper) so I won’t belabor them here, but you can get a sense of the dominance in Figure 2.
The Paradox. Do we need a new type of layer?
Now I cannot help remark on the mild irony in this experimental result, given that both BERT and RoBERTa use Softmax functions to determine token probabilities.
Another way to say this is that they use explicit normalization, and this might seduce one into thinking that they are being engineered to obey Luce’s Choice Axiom. After all, one could force a neural network to eliminate the i’th item from consideration by driving the corresponding input to zero (something we can’t do to humans in a graduate student experiment without involving the School of Medicine).
It is an elementary but possibly profound fact in this context that Softmax functions (not to mention Logic and most of statistics) are suggestive of Luce’s Choice Axiom. This is not, of course, to preclude the possibility of training models on different choice sets (and in classification problems that is usually what occurs).
Yet, if you subscribe to my experimental finding, you should not kid yourself that any commonly used machine learning or statistical model can be extended in the most obvious way to reliably provide combinatorial or set-conditional probabilities. (By “obvious way” I refer to the repeated application of Luce’s Choice Axiom — taking balls out of the urn).
There are various parameterized hacks of Luce, if I may call them that, such as the use of power transforms or the Plackett-Luce model, but they are inelegant, unconvincing physically, and create a rather complicated mess once overlapping choice sets are introduced. As the Bard wrote “Oh what a tangled web we weave …”
Now of course, large language models do get things right and neural networks are universal approximators, but this work occurs in a complex opaque manner. It is surely true that the subtle flaw in the Luce-respecting Softmax functions — the one I have have highlighted with this experiment — forces the rest of the network to do more work.
This leads me to wonder if a new type of neural network layer might actually be useful and, in the process, advance the set of tools we have at our disposal to partially explain large models or build surrogates for some part of what they do. Possibly it could utilize the “multiplicity calculus” I introduced in my previous paper, or someone might find a better way.
Until then, this little paradox might serve as a reminder in psychology. If even the Luce-engineered machines don’t want to obey Luce’s Choice Axiom, what makes us think the rest of us do?