# Bayesian updating of higher-order joint probabilities

Bayes’ rule uses a signal and the assumed joint probability distribution of signals and events to estimate the probability of an event of interest. Call this event a first-order event and the signal a first-order signal. Which joint probability distribution is the correct one is a second-order event, so second-order events are first-order probability distributions over first-order events and signals. The second-order signal consists of a first-order event and a first-order signal.

If the particular first-order joint probability distribution puts higher probability on the co-occurrence of this first-order event and signal than other first-order probability distributions, then observing this event and signal increases the likelihood of this particular probability distribution. The increase is by applying Bayes’ rule to update second-order events using second-order signals, which requires assuming a joint probability distribution of second-order signals and events. This second-order distribution is over first-order joint distributions and first-order signal-event pairs.

The third-order distribution is over second-order distributions and signal-event pairs. A second-order signal-event pair is a third-order signal. A second-order distribution is a third-order event.

A joint distribution of any order n may be decomposed into a marginal distribution over events and a conditional distribution of signals given events, where both the signals and the events are of the same order n. The conditional distribution of any order n>=2 is known by definition, because the n-order event is the joint probability distribution of (n-1)-order signals and events, thus the joint probability of a (n-1)-order signal-event pair (i.e., the n-order signal) given the n-order event (i.e., the (n-1)-order distribution) is the one listed in the (n-1)-order distribution.

The marginal distribution over events is an assumption above, but may be formulated as a new event of interest to be learned. The new signal in this case is the occurrence of the original event (not the marginal distribution). The empirical frequencies of the original events are a sufficient statistic for a sequence of new signals. To apply Bayes’ rule, a joint distribution over signals and the distributions of events needs to be assumed. The joint distribution itself may be learned from among many, over which there is a second-order joint distribution. Extending the Bayesian updating to higher orders proceeds as above. The joint distribution may again be decomposed into a conditional over signals and a marginal over events. The conditional is known by definition for all orders, now including the first, because the probability of a signal is the probability of occurrence of an original event, which is given by the marginal distribution (the new event) over the original events.

Returning to the discussion of learning the joint distributions, only the first-order events affect decisions, so only the marginal distribution over first-order events matters directly. The joint distributions of higher orders and the first-order conditional distribution only matter through their influence on updating the first-order marginal distribution.

The marginal of order n is the distribution over the (n-1)-order joint distributions. After reducing compound lotteries, the marginal of order n is the average of the (n-1)-order joint distributions. This average is itself a (n-1)-order joint distribution, which may be split into an (n-1)-order marginal and conditional, where if n-1>=2, the conditional is known. If the conditional is known, then the marginal may be again reduced as a compound lottery. Thus the hierarchy of marginal distributions of all orders collapses to the first-order joint distribution. This takes us back to the start – learning the joint distribution. The discussion above about learning a (second-order) marginal distribution (the first-order joint distribution) also applies. The empirical frequencies of signal-event pairs are the signals. Applying Bayes’ rule with some prior over joint distributions constitutes regularisation of the empirical frequencies to prevent overfitting to limited data.

Regularisation is itself learned from previous learning tasks, specifically the risk of overfitting in similar learning tasks, i.e. how non-representative a limited data set generally is. Learning regularisation in turn requires a prior belief over the joint distributions of samples and population averages. Applying regularisation learned from past tasks to the current one uses a prior belief over how similar different learning tasks are.

# A “chicken paper” example

The Nobel prize winner Ed Prescott introduced the term “chicken paper” to describe a certain kind of economics research article to the audience at ANU in a public lecture. For background, a macroeconomics paper commonly models the economy as a game (in the game theory sense) between households, sometimes adding the government, firms or banks as additional players. A chicken paper relies on three assumptions: 1) households like chicken, 2) households cannot produce chicken, 3) the government can provide chicken. Prescott’s point was to criticize papers that prove that the intervention of the government in the economy improves welfare. For some papers, such criticism on the grounds of “assuming the result” is justified, for some, not. This applies more broadly than just in macroeconomics.

One example that I think fits Prescott’s description is Woodford (2021, forthcoming in the American Economic Review), pages 10-11:We suppose that units are unable to credibly promise to repay, except to the extent that the government allows them to issue debt up to a certain limit, the repayment of which is guaranteed by the government. (We assume also that the government is able to force borrowers to repay these guaranteed debts, rather than bearing any losses itself.)” The “units” that Woodford refers to are households, which are also the only producers of goods in the model. Such combined producer-consumers are called yeoman farmers and are a reasonable simplification for modelling purposes.

The inefficiency that the government solves in Woodford (2021) is the one discussed in Hirshleifer (1971) section V (page 568) that public information destroys mutually beneficial trading and insurance opportunities. In Woodford (2021), a negative shock to exactly one industry out of N in the economy occurs and becomes public at time 0 before trade opens. Thus the industries cannot trade contingent claims to insure against this shock. They are informed of the shock before trade. However, the government can make a transfer at time 0 to the shock-affected industry and tax it back later from all industries.

If the government also has to start its subsidizing and taxing after trade opens, it can still provide “retrospective insurance” as Woodford calls it by taxes and subsidies. Market-based “insurance” would also work: the affected industry borrows against the collateral of the government subsidy that is anticipated to arrive in the same period.

# Contraception increases high school graduation – questionable numbers

In Stevenson et al 2021 “The impact of contraceptive access on high school graduation” in Science Advances, some numbers do not add up. In the Supplementary Material, Table S1 lists the pre-intervention Other, non-Hispanic cohort size in the 2010 US Census and 2009 through 2017 1-year American Community Survey data as 300, but Table S2 as 290 = 100+70+30+90 (Black + Asian + American Indian + Other/Multiple Races). The post-intervention cohort size is 200 in Table S1, but 230 in Table S2, so the difference is in the other direction (S2 larger) and cannot be due to the same adjustment of one Table for both cohorts, e.g. omitting some racial group or double-counting multiracial people. The main conclusions still hold with the adjusted numbers.

It is interesting that the graduation rate for the Other race group is omitted from the main paper and the Supplementary Material Table S3, because by my calculations, in Colorado, the Other graduation rate decreased after the CFPI contraception access expansion, but in the Parallel Trends states (the main comparison group of US states that the authors use), the Other graduation rate increased significantly. The one missing row in the Table is exactly the one in which the results are the opposite to the rest of the paper and the conclusions of the authors.

# Animal experiments on whether pose and expression control mood

Amy Cuddy promoted power poses which she claimed boosted confidence and success. Replication of her results failed (the effects were not found in other psychology studies), then succeeded again, so the debate continues. Similarly, adopting a smiling expression makes people happier. Measuring the psychological effects of posture and expression is complicated in humans. For example, due to experimenter demand effects. Animals are simpler and cheaper to experiment with, but I did not find any animal experiments on power poses on Google Scholar on 28.03.2021.

The idea of the experiment is to move the animal into a confident or scared pose and measure the resulting behaviour, stress hormones, dominance hormones, maybe scan the brain. Potentially mood-affecting poses differ between animals, but are well-known for common pets. Lifting a dog’s tail up its back is a confident pose. Moving the tail side to side or putting the chest close to the ground and butt up in a “play-with-me bow” is happy, excited. Putting the dog’s tail between the legs is scared. Moving the dog’s gums back to bare its teeth is angry. Arching a cat’s back is angry. Curling the cat up and half-closing its eyes is contented.

The main problem is that the animal may resist being moved into these poses or get stressed by the unfamiliar treatment. A period of habituation training is needed, but if the pose has an effect, then part of this effect realises during the habituation. In this case, the measured effect size is attenuated, i.e. the pre- and post-treatment mood and behaviour look similar.

A similar experiment in people is to have a person or a robot move the limbs of the participants of the experiment into power poses instead of asking them to assume the pose. The excuse or distraction from the true purpose of the experiment may be light physical exercise, physical therapy or massage. This includes a facial massage, which may stretch the face into a smile or compress into a frown. The usual questionnaires and measurements may be administered after moving the body or face into these poses or expressions.

# Paying pharmaceutical firms for capacity is problematic

Castillo et al. 2021 (doi:10.1126/science.abg0889) make many valid points, e.g., vaccine production should be greatly expanded using taxpayer money because the quicker recovery from the pandemic more than pays for the expansion. Castillo et al. also suggest paying pharmaceutical manufacturers for the capacity they install instead of the quantity they produce. The reasoning of the authors is that producers are delaying installing capacity and the delivery of their promised vaccine quantities to save costs and to supply higher-paying buyers first, because the penalties for delaying are small. Producers refuse to sign contracts with larger penalties.

What the authors do not mention is that the same problems occur when paying for capacity. In addition, the capacity needs to be monitored, which is more difficult than checking the delivered quantity. Before large-scale production, how to detect the „Potemkin capacity” of installing cheap production lines unsuitable for large quantities? The manufacturer may later simply claim technical glitches when the production line does not work. Effective penalties are needed, which in turn requires motivating the producer to sign a contract containing these, just like for a quantity contract.

Paying in advance for capacity before the vaccine is proven to work insures firms against the risk of failure, as Castillo et al. say. The problem is that such advance payment also attracts swindlers who promise a miracle cure and then run with the money – there is adverse selection in who enters the government’s capacity contract scheme. Thus capacity contracts should be restricted to firms with a good established reputation. However, vaccines from innovative entrants may also be needed, which suggests continuing to use quantity contracts at least for some firms. If the law requires treating firms equally, then they should all be offered a similar contract.

# Identifying unmeasurable effort in contests

To distinguish unmeasurable effort from unmeasurable exogenous factors like talent or environmental interference in contests, assumptions are needed, even for partial identification when overall performance can be objectively measured (e.g., chess move quality evaluated by a computer). Combining one of the following assumptions with the additive separability of effort and the exogenous factors provides sign restrictions on coefficient estimates. Additive separability means that talent or the environment changes performance the same way at any effort level.

One such identifying assumption is that effort is greatest when it makes the most difference – against an equal opponent. By contrast, effort is lower against much better and much worse opponents.

A similar identifying assumption is that if there is personal conflict between some contest participants but not others, then effort is likely higher against a hated opponent than a neutral one.

The performance of a given contestant against an equal opponent compared to against an unequal one is a lower bound on how much effort affects performance. Similarly, the performance against a hated rival compared to against a neutral contestant is a lower bound on the effect of effort. The lower bound is not the total influence of effort, because even against an unequal neutral opponent, effort is still positive.

# Moon phase and sleep correlation is not quite a sine wave

Casiraghi et al. (2021) in Science Advances (DOI: 10.1126/sciadv.abe0465) show that human sleep duration and onset depends on the phase of the moon. Their interpretation is that light availability during the night caused humans to adapt their sleep over evolutionary time. Casiraghi et al. fit a sine curve to both sleep duration and onset as functions of the day in the monthly lunar cycle, but their Figure 1 A, B for the full sample and the blue and orange curves for the rural groups in Figure 1 C, D show a statistically significant deviation from a sine function. Instead of same-sized symmetric peaks and troughs, sleep duration has two peaks with a small trough between, then a large sharp trough which falls more steeply than rises, then two peaks again. Sleep onset has a vertically reflected version of this pattern. These features are statistically significant, based on the confidence bands Casiraghi and coauthors have drawn in Figure 1.

The significant departure of sleep patterns from a sine wave calls into question the interpretation that light availability over evolutionary time caused these patterns. What fits the interpretation of Casiraghi et al. is that sleep duration is shortest right before full moon, but what does not fit is that the duration is longest right after full and new moons, but shorter during a waning crescent moon between these.

It would better summarise the data to use the first four terms of a Fourier series instead of just the first term. There seems little danger of overfitting, given N=69 and t>60.

A questionable choice of the authors is to plot the sleep duration and onset of only the 35 best-fitting participants in Figure 2. A more honest choice yielding the same number of plots would pick every other participant in the ranking from the best fit to the worst.

In the section Materials and Methods, Casiraghi et al. fitted both a 15-day and a 30-day cycle to test for the effect of the Moon’s gravitational pull on sleep. The 15-day component was weaker in urban communities than rural, but any effect of gravity should be the same in both. By contrast, the effect of moonlight should be weaker in urban communities, but the urban community data (Figure 1 C, D green curve) fits a simple sine curve better than rural. It seems strange that sleep in urban communities would correlate more strongly with the amount of moonlight, like Figure 1 shows.

# Clinical trials of other drugs in other species to predict a drug’s effect in humans

Suppose we want to know whether a drug is safe or effective for humans, but do not have data on what it does in humans, only on its effects in mice, rats, rhesus macaques and chimpanzees. In general, we can predict the effect of the drug on humans better with the animal data than without it. Information on “nearby” realisations of a random variable (effect of the drug) helps predict the realisation we are interested in. The method should weight nearby observations more than observations further away when predicting. For example, if the drug has a positive effect in animals, then predicts a positive effect in humans, and the larger the effect in animals, the greater the predicted effect in humans.

A limitation of weighting is that it does not take into account the slope of the effect when moving from further observations to nearer. For example, a very large effect of the drug in mice and rats but a small effect in macaques and chimpanzees predicts the same effect in humans as a small effect in rodents and a large one in monkeys and apes, if the weighted average effect across animals is the same in both cases. However, intuitively, the first case should have a smaller predicted effect in humans than the second, because moving to animals more similar to humans, the effect becomes smaller in the first case but larger in the second. The idea is similar to a proportional integral-derivative (PID) controller in engineering.

The slope of the effect of the drug is extra information that increases the predictive power of the method if the assumption that the similarity of effects decreases in genetic distance holds. Of course, if this assumption fails in the data, then imposing it may result in bias.

Assumptions may be imposed on the method using constrained estimation. One constraint is the monotonicity of the effect in some measure of distance between observations. The method may allow for varying weights by adding interaction terms (e.g., the effect of the drug times genetic similarity). The interaction terms unfortunately require more data to estimate.

Extraneous information about the slope of the effect helps justify the constraints and reduces the need for adding interaction terms, thus decreases the data requirement. An example of such extra information is whether the effects of other drugs that have been tested in these animals as well as humans were monotone in genetic distance. Using information about these other drugs imposes the assumption that the slopes of the effects of different drugs are similar. The similarity of the slopes should intuitively depend on the chemical similarity of the drugs, with more distant drugs having more different profiles of effects across animals.

The similarity of species in terms of the effects drugs have on them need not correspond to genetic similarity or the closeness of any other observable characteristic of these organisms, although often these similarities are similar. The similarity of interest is how similar the effects of the drug are across these species. Estimating this similarity based on the similarity of other drugs across these animals may also be done by a weighted regression, perhaps with constraints or added interaction terms. More power for the estimation may be obtained from simultaneous estimation of the drug-effect-similarity of the species and the effect of the drug in humans. An analogy is demand and supply estimation in industrial organisation where observations about each side of the market give information about the other side. Another analogy is duality in mathematics, in this case between the drug-effect-similarity of the species and the given drug’s similarity of effects across these species.

The similarity of drugs in terms of their effects on each species need not correspond to chemical similarity, although it often does. The similarity of interest for the drugs is how similar their effects are in humans, and also in other species.

The inputs into the joint estimation of drug similarity, species similarity and the effect of the given drug in humans are the genetic similarity of the species, the chemical similarity of the drugs and the effects for all drug-species pairs that have been tested. In the matrix where the rows are the drugs and the columns the species, we are interested in filling in the cell in the row “drug of interest” and the column “human”. The values in all the other cells are informative about this cell. In other words, there is a benefit from filling in these other cells of the matrix.

Given the duality of drugs and species in the drug effect matrix, there is information to be gained from running clinical trials of chemically similar human-use-approved drugs in species in which the drug of interest has been tested but the chemically similar ones have not. The information is directly about the drug-effect-similarity of these species to humans, which indirectly helps predict the effect of the drug of interest in humans from the effects of it in other species. In summary, testing other drugs in other species is informative about what a given drug does in humans. Adapting methods from supply and demand estimation, or otherwise combining all the data in a principled theoretical framework, may increase the information gain from these other clinical trials.

Extending the reasoning, each (species, drug) pair has some unknown similarity to the (human, drug of interest) pair. A weighted method to predict the effect in the (human, drug of interest) pair may gain power from constraints that the similarity of different (species, drug) pairs increases in the genetic closeness of the species and the chemical closeness of the drugs.

Define Y_{sd} as the effect of drug d in species s. Define X_{si} as the observable characteristic (gene) i of species s. Define X_{dj} as the observable characteristic (chemical property) j of drug d. The simplest method is to regress Y_{sd} on all the X_{si} and X_{dj} and use the coefficients to predict the Y_{sd} of the (human, drug of interest) pair. If there are many characteristics i and j and few observations Y_{sd}, then variable selection or regularisation is needed. Constraints may be imposed, like X_{si}=X_i for all s and X_{dj}=X_j for all d.

Fused LASSO (least absolute shrinkage and selection operator), clustered LASSO and prior LASSO seem related to the above method.

# Leader turnover due to organisation performance is underestimated

Berry and Fowler (2021) “Leadership or luck? Randomization inference for leader effects in politics, business, and sports” in Science Advances propose a method they call RIFLE for testing the null hypothesis that leaders have no effect on organisation performance. The method is robust to serial correlation in outcomes and leaders, but not to endogenous leader turnover, as Berry and Fowler honestly point out. The endogeneity is that the organisation’s performance influences the probability that the leader is replaced (economic growth causes voters to keep a politician in office, losing games causes a team to replace its coach).

To test whether such endogeneity is a significant problem for their results, Berry and Fowler regress the turnover probability on various measures of organisational performance. They find small effects, but this underestimates the endogeneity problem, because Berry and Fowler use linear regression, forcing the effect of performance on turnover to be monotone and linear.

If leader turnover is increased by both success (get a better job elsewhere if the organisation performs well, so quit voluntarily) and failure (fired for the organisation’s bad performance), then the relationship between turnover and performance is U-shaped. Average leaders keep their jobs, bad and good ones transition elsewhere. This is related to the Peter Principle that an employee is promoted to her or his level of incompetence. A linear regression finds a near-zero effect of performance on turnover in this case even if the true effect is large. How close the regression coefficient is to zero depends on how symmetric the effects of good and bad performance on leader transition are, not how large these effects are.

The problem for the RIFLE method of Berry and Fowler is that the small apparent effect of organisation performance on leader turnover from OLS regression misses the endogeneity in leader transitions. Such endogeneity biases RIFLE, as Berry and Fowler admit in their paper.

The endogeneity may explain why Berry and Fowler find stronger leader effects in sports (coaches in various US sports) than in business (CEOs) and politics (mayors, governors, heads of government). A sports coach may experience more asymmetry in the transition probabilities for good and bad performance than a politician. For example, if the teams fire coaches after bad performance much more frequently than poach coaches from well-performing competing teams, then the effect of performance on turnover is close to monotone: bad performance causes firing. OLS discovers this monotone effect. On the other hand, if politicians move with equal likelihood after exceptionally good and bad performance of the administrative units they lead, then linear regression finds no effect of performance on turnover. This misses the bias in RIFLE, which without the bias might show a large leader effect in politics also.

The unreasonably large effect of governors on crime (the governor effect explains 18-20% of the variation in both property and violent crime) and the difference between the zero effect of mayors on crime and the large effect of governors that Berry and Fowler find makes me suspect something is wrong with that particular analysis in their paper. In a checks-and-balances system, the governor should not have that large of influence on the state’s crime. A mayor works more closely with the local police, so would be expected to have more influence on crime.

# Dilution effect explained by signalling

A persuader who believes her first argument to be strong enough to convince everyone does not waste valuable time to add other arguments. Listeners evaluate arguments partly by the confidence they believe the speaker has in these claims. This is rational Bayesian updating because a speaker’s conviction in the correctness of what she says is positively correlated with the actual validity of the claims.

A countervailing effect is that a speaker with many arguments has spent significant time studying the issue, so knows more precisely what the correct action is. If the listeners believe the bias of the persuader to be small or against the action that the arguments favour, then the audience should rationally believe a better-informed speaker more.

An effect in the same direction as dilution is that a speaker with many arguments in favour of a choice strongly prefers the listeners to choose it, i.e. is more biased. Then the listeners should respond less to the persuader’s effort. In the limit when the speaker’s only goal is always for the audience to comply, at any time cost of persuasion, then the listeners should ignore the speaker because a constant signal carries no information.

Modelling

The listeners choose either to do what the persuader wants or not. The persuader receives a benefit B if the listeners comply, otherwise receives zero.

The persuader always presents her first argument, otherwise reveals that she has no arguments, which ends the game with the listeners not doing what the persuader wants. The persuader chooses whether to spend time at cost c>0, c<B to present her second argument, which may be strong or weak. The persuader knows the strength of the second argument but the listeners only have the common prior belief that the probability of a strong second argument is p0. If the second argument is strong, then the persuader is confident, otherwise not.

If the persuader does not present the second argument, then the listeners receive an exogenous private signal in {1,0} about the persuader’s confidence, e.g. via her subconscious body language. The probabilities of the signals are Pr(1|confident) =Pr(0|not) =q >1/2. If the persuader presents the second argument, then the listeners learn the confidence with certainty and can ignore any signals about it. Denote by p1 the updated probability that the audience puts on the second argument being strong.

If the speaker presents a strong second argument, then p1=1, if the speaker presents a weak argument, then p1=0, if the speaker presents no second argument, then after signal 1, the audience updates their belief to p1(1) =p0*q/(p0*q +(1-p0)*(1-q)) >p0 and after signal 0, to p1(0) =p0*(1-q)/(p0*(1-q) +(1-p0)*q) <p0.

The listeners prefer to comply (take action a=1) when the second argument of the persuader is strong, otherwise prefer not to do what the persuader wants (action a=0). At the prior belief p0, the listeners prefer not to comply. Therefore a persuader with a strong second argument chooses max{B*1-c, q*B*1 +(1-q)*B*0} and presents the argument iff (1-q)*B >c. A persuader with a weak argument chooses max{B*0-c, (1-q)*B*1 +q*B*0}, always not to present the argument. If a confident persuader chooses not to present the argument, then the listeners use the exogenous signal, otherwise use the choice of presentation to infer the type of the persuader.

One extension is that presenting the argument still leaves some doubt about its strength.

Another extension has many argument strength levels, so each type of persuader sometimes presents the second argument, sometimes not.

In this standard model, if the second argument is presented, then always by the confident type. As is intuitive, the second argument increases the belief of the listeners that the persuader is right. Adding countersignalling partly reverses the intuition – a very confident type of the persuader knows that the first argument already reveals her great confidence, so the listeners do what the very confident persuader wants. The very confident type never presents the second argument, so if the confident type chooses to present it, then the extra argument reduces the belief of the audience in the correctness of the persuader. However, compared to the least confident type who also never presents the second argument, the confident type’s second argument increases the belief of the listeners.