Monthly Archives: November 2021

Bayesian updating of higher-order joint probabilities

Bayes’ rule uses a signal and the assumed joint probability distribution of signals and events to estimate the probability of an event of interest. Call this event a first-order event and the signal a first-order signal. Which joint probability distribution is the correct one is a second-order event, so second-order events are first-order probability distributions over first-order events and signals. The second-order signal consists of a first-order event and a first-order signal.

If the particular first-order joint probability distribution puts higher probability on the co-occurrence of this first-order event and signal than other first-order probability distributions, then observing this event and signal increases the likelihood of this particular probability distribution. The increase is by applying Bayes’ rule to update second-order events using second-order signals, which requires assuming a joint probability distribution of second-order signals and events. This second-order distribution is over first-order joint distributions and first-order signal-event pairs.

The third-order distribution is over second-order distributions and signal-event pairs. A second-order signal-event pair is a third-order signal. A second-order distribution is a third-order event.

A joint distribution of any order n may be decomposed into a marginal distribution over events and a conditional distribution of signals given events, where both the signals and the events are of the same order n. The conditional distribution of any order n>=2 is known by definition, because the n-order event is the joint probability distribution of (n-1)-order signals and events, thus the joint probability of a (n-1)-order signal-event pair (i.e., the n-order signal) given the n-order event (i.e., the (n-1)-order distribution) is the one listed in the (n-1)-order distribution.

The marginal distribution over events is an assumption above, but may be formulated as a new event of interest to be learned. The new signal in this case is the occurrence of the original event (not the marginal distribution). The empirical frequencies of the original events are a sufficient statistic for a sequence of new signals. To apply Bayes’ rule, a joint distribution over signals and the distributions of events needs to be assumed. The joint distribution itself may be learned from among many, over which there is a second-order joint distribution. Extending the Bayesian updating to higher orders proceeds as above. The joint distribution may again be decomposed into a conditional over signals and a marginal over events. The conditional is known by definition for all orders, now including the first, because the probability of a signal is the probability of occurrence of an original event, which is given by the marginal distribution (the new event) over the original events.

Returning to the discussion of learning the joint distributions, only the first-order events affect decisions, so only the marginal distribution over first-order events matters directly. The joint distributions of higher orders and the first-order conditional distribution only matter through their influence on updating the first-order marginal distribution.

The marginal of order n is the distribution over the (n-1)-order joint distributions. After reducing compound lotteries, the marginal of order n is the average of the (n-1)-order joint distributions. This average is itself a (n-1)-order joint distribution, which may be split into an (n-1)-order marginal and conditional, where if n-1>=2, the conditional is known. If the conditional is known, then the marginal may be again reduced as a compound lottery. Thus the hierarchy of marginal distributions of all orders collapses to the first-order joint distribution. This takes us back to the start – learning the joint distribution. The discussion above about learning a (second-order) marginal distribution (the first-order joint distribution) also applies. The empirical frequencies of signal-event pairs are the signals. Applying Bayes’ rule with some prior over joint distributions constitutes regularisation of the empirical frequencies to prevent overfitting to limited data.

Regularisation is itself learned from previous learning tasks, specifically the risk of overfitting in similar learning tasks, i.e. how non-representative a limited data set generally is. Learning regularisation in turn requires a prior belief over the joint distributions of samples and population averages. Applying regularisation learned from past tasks to the current one uses a prior belief over how similar different learning tasks are.

How to learn whether an information source is accurate

Two sources may be used to check each other over time. One of these sources may be your own senses, which show whether the event that the other source predicted occurred or not. The observation of an event is really another signal about the event. It is a noisy signal because your own eyes may lie (optical illusions, deepfakes).

First, one source sends a signal about the event, then the second source sends. You will never know whether the event actually occurred, but the second source is the aggregate of all the future information you receive about the event, so may be very accurate. The second source may send many signals in sequence about the event, yielding more info about the first source over time. Then the process repeats about a second event, a third, etc. This is how belief about the trustworthiness of a source is built.

You cannot learn the true accuracy of a source, because the truth is unavailable to your senses, so you cannot compare a source’s signals to the truth. You can only learn the consistency of different sources of sensory information. Knowing the correlation between various sensory sources is both necessary and sufficient for decision making, because your objective function (utility or payoff) is your perception of successfully achieving your goals. If your senses are deceived so you believe you have achieved what you sought, but actually have not, then you get the feeling of success, but if your senses are deceived to tell you you have failed, then you do not feel success even if you actually succeeded. The problem with deception arises purely from the positive correlation between the deceit and the perception of deceit. If deceit increases the probability that you later perceive you have been deceived and are unhappy about that perception, then deceit may reduce your overall utility despite initially increasing it temporarily. If you never suspect the deception, then your happiness is as if the deception was the truth.

Your senses send signals to your brain. We can interpret these signals as information about which hypothetical state of the world has occurred – we posit that there exists a world which may be in different states with various probabilities and that there is a correlation between the signals and these states. Based on the information, you update the probabilities of the states and choose a course of action. Actions result in probability distributions over different future sensations, which may be modelled as a different sensation in each state of the world, which have probabilities attached. (Later we may remove the states of the world from the model and talk about a function from past perceptions and actions into future perceptions. The past is only accessible through memory. Memory is a current perception, so we may also remove time from the model.)

You prefer some future sensations to others. These need not be sensory pleasures. These could be perceptions of having improved the world through great toil. You would prefer to choose an action that results in preferable sensations in the future. Which action this is depends on the state of the world.

To estimate the best action (the one yielding the most preferred sensations), you use past sensory signals. The interpretation of these signals depends on the assumed or learned correlation between the signals and the state. The assumption may be instinctive from birth. The learning is really about how sensations at a point in time are correlated with the combination of sensations and actions before that point. An assumption that the correlation is stable over time enables you to use past correlation to predict future correlation. This assumption in turn may be instinctive or learned.

The events most are interested in distinguishing are of the form “action A results in the most preferred sensations”, “action B causes the most preferred sensations”, “action A yields the least preferred sensations”. Any event that is useful to know is of a similar form by Blackwell’s theorem: information is useful if and only if it changes decisions.

The usefulness of a signal source depends on how consistent the signals it gives about the action-sensation links (events) are with your future perceptions. These future perceptions are the signals from the second source – your senses – against which the first source is checked. The signals of the second source have the form “memory of action A and a preferred sensation at present”. Optimal learning about the usefulness of the first source uses Bayes’ rule and a prior probability distribution on the correlations between the first source and the second. The events of interest in this case are the levels of correlation. A signal about these levels is whether the first source gave a signal that coincided with later sensory information.

If the first source recommended a “best action” that later yielded a preferred sensation, then this increases the probability of high positive correlation between the first source and the second on average. If the recommended action was followed by a negative sensation, then this raises the probability of a negative correlation between the sources. Any known correlation is useful information, because it helps predict the utility consequences of actions.

Counterfactuals should be mentioned as a side note. Even if an action A resulted in a preferred sensation, a different action B might have led to an even better sensation in the counterfactual universe where B was chosen instead. Of course, B might equally well have led to a worse sensation. Counterfactuals require a model to evaluate – what the output would have been after a different input depends on the assumed causal chain from inputs to outputs.

Whether two sources are separate or copies is also a learnable event.

Exaggerating vs hiding emotions

In some cultures, it was a matter of honour not to show emotions. Native American warriors famously had stony visages. Victorian aristocracy prided themselves in a stiff upper lip and unflappable manner. Winston Churchill describes in his memoirs how the boarding school culture, enforced by physical violence, was to show no fear. In other cultures, emotions are exaggerated. Teenagers in North America from 1990 to the present are usually portrayed as drama queens, as are arts people. Everything is either fabulous or horrible to them, no so-so experiences. I have witnessed the correctness of this portrayal in the case of teenagers. Jane Austen’s “Northanger Abbey” depicts Victorian teenagers as exaggerating their emotions similarly to their modern-day counterparts.

In the attention economy, exaggerating emotions is profitable to get and keep viewers. Traditional and social media portray situations as more extreme than these really are in order to attract eyeballs and clicks. Teenagers may have a similar motivation – to get noticed by their peers. Providing drama is an effective way. The notice of others may help attract sex partners or a circle of followers. People notice the strong emotions of others for evolutionary reasons, because radical action has a higher probability of following than after neutral communication. Radical action by others requires a quick accurate response to keep one’s health and wealth or take advantage of the radical actor.

A child with an injury or illness may pretend to suffer more than actually to get more care and resources from parents, especially compared to siblings. This is similar to the begging competition among bird chicks.

Exaggerating both praise and emotional punishment motivates others to do one’s bidding. Incentives are created by the difference in the consequences of different actions, so exaggerating this difference strengthens incentives, unless others see through the pretending. Teenagers may exaggerate their outward happiness and anger at what the parents do, in order to force the parents to comply with the teenager’s wishes.

On the other hand, in a zero-sum game, providing information to the other player cannot increase one’s own payoff and usually reduces it. Emotions are information about the preferences and plans of the one who shows these. In an antagonistic situation, such as negotiations or war between competing tribes, a poker face is an information security measure.

In short, creating drama is an emotional blackmail method targeting those with aligned interests. An emotionless front hides both weaknesses and strengths from those with opposed interests, so they cannot target the weakness or prepare for the precise strength.

Whether teenagers display or hide emotion is thus informative about whether they believe the surrounding people to be friends or enemies. A testable prediction is that bullied children suppress emotion and pretend not to care about anything, especially compared to a brain scan showing they actually care and especially when they are primed to recall the bullies. Another testable prediction is that popular or spoiled children exaggerate their emotions, especially around familiar people and when they believe a reward or punishment is imminent.

Signalling the precision of one’s information with emphatic claims

Chats both online and in person seem to consist of confident claims which are either extreme absolute statements (“vaccines don’t work at all”, “you will never catch a cold if you take this supplement”, “artificial sweeteners cause cancer”) or profess no knowledge (“damned if I know”, “we will never know the truth”), sometimes blaming the lack of knowledge on external forces (“of course they don’t tell us the real reason”, “the security services are keeping those studies secret, of course”, “big business is hiding the truth”). Moderate statements that something may or may not be true, especially off the center of all-possibilities-equal, and expressions of personal uncertainty (“I have not studied this enough to form an opinion”, “I have not thought this through”) are almost absent. Other than in research and official reports, I seldom encounter statements of the form “these are the arguments in this direction and those are the arguments in that direction. This direction is somewhat stronger.” or “the balance of the evidence suggests x” or “x seems more likely than not-x”. In opinion pieces in various forms of media, the author may give arguments for both sides, but in that case, concludes something like “we cannot rule out this and we cannot rule out that”, “prediction is difficult, especially now in a rapidly changing world”, “anything may happen”. The conclusion of the opinion piece does not recommend a moderate course of action supported by the balance of moderate-quality evidence.

The same person confidently claims knowledge of an extreme statement on one topic and professes certainty of no knowledge at all on another. What could be the goal of making both extreme and no-knowledge statements confidently? If the person wanted to pretend to be well-informed, then confidence helps with that, but claiming no knowledge would be counterproductive. Blaming the lack of knowledge on external forces and claiming that the truth is unknowable or will never be discovered helps excuse one’s lack of knowledge. The person can then pretend to be informed to the best extent possible (a constrained maximum of knowledge) or at least know more than others (a relative maximum).

Extreme statements suggest to an approximately Bayesian audience that the claimer has received many precise signals in the direction of the extreme statement and as a result has updated the belief far from the average prior belief in society. Confident statements also suggest many precise signals to Bayesians. The audience does not need to be Bayesian to form these interpretations – updating in some way towards the signal is sufficient, as is behavioural believing that confidence or extreme claims demonstrate the quality of the claimer’s information. A precisely estimated zero, such as confidently saying both x and not-x are equally likely, also signals good information. Similarly, being confident that the truth is unknowable.

Being perceived as having precise information helps influence others. If people believe that the claimer is well-informed and has interests more aligned than opposed to theirs, then it is rational to follow the claimer’s recommendation. Having influence is generally profitable. This explains the lack of moderate-confidence statements and claims of personal but not collective uncertainty.

A question that remains is why confident moderate statements are almost absent. Why not claim with certainty that 60% of the time, the drug works and 40% of the time, it doesn’t? Or confidently state that a third of the wage gap/racial bias/country development is explained by discrimination, a third by statistical discrimination or measurement error and a third by unknown factors that need further research? Confidence should still suggest precise information no matter what the statement is about.

Of course, if fools are confident and researchers honestly state their uncertainty, then the certainty of a statement shows the foolishness of the speaker. If confidence makes the audience believe the speaker is well-informed, then either the audience is irrational in a particular way or believes that the speaker’s confidence is correlated with the precision of the information in the particular dimension being talked about. If the audience has a long history of communication with the speaker, then they may have experience that the speaker is generally truthful, acts similarly across situations and expresses the correct level of confidence on unemotional topics. The audience may fail to notice when the speaker becomes a spreader of conspiracies or becomes emotionally involved in a topic and therefore is trying to persuade, not inform. If the audience is still relatively confident in the speaker’s honesty, then the speaker sways them more by confidence and extreme positions than by admitting uncertainty or a moderate viewpoint.

The communication described above may be modelled as the claimer conveying three-dimensional information with two two-dimensional signals. One dimension of the information is the extent to which the statement is true. For example, how beneficial is a drug or how harmful an additive. A second dimension is how uncertain the truth value of the statement is – whether the drug helps exactly 55% of patients or may help anywhere between 20 and 90%, between which all percentages are equally likely. A third dimension is the minimal attainable level of uncertainty – how much the truth is knowable in this question. This is related to whether some agency is actively hiding the truth or researchers have determined it and are trying to educate the population about it. The second and third dimensions are correlated. The lower is the lowest possible uncertainty, the more certain the truth value of the statement can be. It cannot be more certain than the laws of physics allow.

The two dimensions of one signal (the message of the claimer) are the extent to which the statement is true and how certain the claimer is of the truth value. Confidence emphasises that the claimer is certain about the truth value, regardless of whether this value is true or false. The claim itself is the first dimension of the signal. The reason the third dimension of the information is not part of the first signal is that the claim that the truth is unknowable is itself a second claim about the world, i.e. a second two-dimensional signal saying how much some agency is hiding or publicising the truth and how certain the speaker is of the direction and extent of the agency’s activity.

Opinion expressers in (social) media usually choose an extreme value for both dimensions of both signals. They claim some statement about the world is either the ultimate truth or completely false or unknowable and exactly in the middle, not a moderate distance to one side. In the second dimension of both signals, the opinionated people express complete certainty. If the first signal says the statement is true or false, then the second signal is not sent and is not needed, because if there is complete certainty of the truth value of the statement, then the statement must be perfectly knowable. If the first signal says the statement is fifty-fifty (the speaker does not know whether true or false), then in the second signal, the speaker claims that the truth is absolutely not knowable. This excuses the speaker’s claimed lack of knowledge as due to an objective impossibility, instead of the speaker’s limited data and understanding.