Tag Archives: Bayes’ rule

Bayesian updating of higher-order joint probabilities

Bayes’ rule uses a signal and the assumed joint probability distribution of signals and events to estimate the probability of an event of interest. Call this event a first-order event and the signal a first-order signal. Which joint probability distribution is the correct one is a second-order event, so second-order events are first-order probability distributions over first-order events and signals. The second-order signal consists of a first-order event and a first-order signal.

If the particular first-order joint probability distribution puts higher probability on the co-occurrence of this first-order event and signal than other first-order probability distributions, then observing this event and signal increases the likelihood of this particular probability distribution. The increase is by applying Bayes’ rule to update second-order events using second-order signals, which requires assuming a joint probability distribution of second-order signals and events. This second-order distribution is over first-order joint distributions and first-order signal-event pairs.

The third-order distribution is over second-order distributions and signal-event pairs. A second-order signal-event pair is a third-order signal. A second-order distribution is a third-order event.

A joint distribution of any order n may be decomposed into a marginal distribution over events and a conditional distribution of signals given events, where both the signals and the events are of the same order n. The conditional distribution of any order n>=2 is known by definition, because the n-order event is the joint probability distribution of (n-1)-order signals and events, thus the joint probability of a (n-1)-order signal-event pair (i.e., the n-order signal) given the n-order event (i.e., the (n-1)-order distribution) is the one listed in the (n-1)-order distribution.

The marginal distribution over events is an assumption above, but may be formulated as a new event of interest to be learned. The new signal in this case is the occurrence of the original event (not the marginal distribution). The empirical frequencies of the original events are a sufficient statistic for a sequence of new signals. To apply Bayes’ rule, a joint distribution over signals and the distributions of events needs to be assumed. The joint distribution itself may be learned from among many, over which there is a second-order joint distribution. Extending the Bayesian updating to higher orders proceeds as above. The joint distribution may again be decomposed into a conditional over signals and a marginal over events. The conditional is known by definition for all orders, now including the first, because the probability of a signal is the probability of occurrence of an original event, which is given by the marginal distribution (the new event) over the original events.

Returning to the discussion of learning the joint distributions, only the first-order events affect decisions, so only the marginal distribution over first-order events matters directly. The joint distributions of higher orders and the first-order conditional distribution only matter through their influence on updating the first-order marginal distribution.

The marginal of order n is the distribution over the (n-1)-order joint distributions. After reducing compound lotteries, the marginal of order n is the average of the (n-1)-order joint distributions. This average is itself a (n-1)-order joint distribution, which may be split into an (n-1)-order marginal and conditional, where if n-1>=2, the conditional is known. If the conditional is known, then the marginal may be again reduced as a compound lottery. Thus the hierarchy of marginal distributions of all orders collapses to the first-order joint distribution. This takes us back to the start – learning the joint distribution. The discussion above about learning a (second-order) marginal distribution (the first-order joint distribution) also applies. The empirical frequencies of signal-event pairs are the signals. Applying Bayes’ rule with some prior over joint distributions constitutes regularisation of the empirical frequencies to prevent overfitting to limited data.

Regularisation is itself learned from previous learning tasks, specifically the risk of overfitting in similar learning tasks, i.e. how non-representative a limited data set generally is. Learning regularisation in turn requires a prior belief over the joint distributions of samples and population averages. Applying regularisation learned from past tasks to the current one uses a prior belief over how similar different learning tasks are.

Signalling the precision of one’s information with emphatic claims

Chats both online and in person seem to consist of confident claims which are either extreme absolute statements (“vaccines don’t work at all”, “you will never catch a cold if you take this supplement”, “artificial sweeteners cause cancer”) or profess no knowledge (“damned if I know”, “we will never know the truth”), sometimes blaming the lack of knowledge on external forces (“of course they don’t tell us the real reason”, “the security services are keeping those studies secret, of course”, “big business is hiding the truth”). Moderate statements that something may or may not be true, especially off the center of all-possibilities-equal, and expressions of personal uncertainty (“I have not studied this enough to form an opinion”, “I have not thought this through”) are almost absent. Other than in research and official reports, I seldom encounter statements of the form “these are the arguments in this direction and those are the arguments in that direction. This direction is somewhat stronger.” or “the balance of the evidence suggests x” or “x seems more likely than not-x”. In opinion pieces in various forms of media, the author may give arguments for both sides, but in that case, concludes something like “we cannot rule out this and we cannot rule out that”, “prediction is difficult, especially now in a rapidly changing world”, “anything may happen”. The conclusion of the opinion piece does not recommend a moderate course of action supported by the balance of moderate-quality evidence.

The same person confidently claims knowledge of an extreme statement on one topic and professes certainty of no knowledge at all on another. What could be the goal of making both extreme and no-knowledge statements confidently? If the person wanted to pretend to be well-informed, then confidence helps with that, but claiming no knowledge would be counterproductive. Blaming the lack of knowledge on external forces and claiming that the truth is unknowable or will never be discovered helps excuse one’s lack of knowledge. The person can then pretend to be informed to the best extent possible (a constrained maximum of knowledge) or at least know more than others (a relative maximum).

Extreme statements suggest to an approximately Bayesian audience that the claimer has received many precise signals in the direction of the extreme statement and as a result has updated the belief far from the average prior belief in society. Confident statements also suggest many precise signals to Bayesians. The audience does not need to be Bayesian to form these interpretations – updating in some way towards the signal is sufficient, as is behavioural believing that confidence or extreme claims demonstrate the quality of the claimer’s information. A precisely estimated zero, such as confidently saying both x and not-x are equally likely, also signals good information. Similarly, being confident that the truth is unknowable.

Being perceived as having precise information helps influence others. If people believe that the claimer is well-informed and has interests more aligned than opposed to theirs, then it is rational to follow the claimer’s recommendation. Having influence is generally profitable. This explains the lack of moderate-confidence statements and claims of personal but not collective uncertainty.

A question that remains is why confident moderate statements are almost absent. Why not claim with certainty that 60% of the time, the drug works and 40% of the time, it doesn’t? Or confidently state that a third of the wage gap/racial bias/country development is explained by discrimination, a third by statistical discrimination or measurement error and a third by unknown factors that need further research? Confidence should still suggest precise information no matter what the statement is about.

Of course, if fools are confident and researchers honestly state their uncertainty, then the certainty of a statement shows the foolishness of the speaker. If confidence makes the audience believe the speaker is well-informed, then either the audience is irrational in a particular way or believes that the speaker’s confidence is correlated with the precision of the information in the particular dimension being talked about. If the audience has a long history of communication with the speaker, then they may have experience that the speaker is generally truthful, acts similarly across situations and expresses the correct level of confidence on unemotional topics. The audience may fail to notice when the speaker becomes a spreader of conspiracies or becomes emotionally involved in a topic and therefore is trying to persuade, not inform. If the audience is still relatively confident in the speaker’s honesty, then the speaker sways them more by confidence and extreme positions than by admitting uncertainty or a moderate viewpoint.

The communication described above may be modelled as the claimer conveying three-dimensional information with two two-dimensional signals. One dimension of the information is the extent to which the statement is true. For example, how beneficial is a drug or how harmful an additive. A second dimension is how uncertain the truth value of the statement is – whether the drug helps exactly 55% of patients or may help anywhere between 20 and 90%, between which all percentages are equally likely. A third dimension is the minimal attainable level of uncertainty – how much the truth is knowable in this question. This is related to whether some agency is actively hiding the truth or researchers have determined it and are trying to educate the population about it. The second and third dimensions are correlated. The lower is the lowest possible uncertainty, the more certain the truth value of the statement can be. It cannot be more certain than the laws of physics allow.

The two dimensions of one signal (the message of the claimer) are the extent to which the statement is true and how certain the claimer is of the truth value. Confidence emphasises that the claimer is certain about the truth value, regardless of whether this value is true or false. The claim itself is the first dimension of the signal. The reason the third dimension of the information is not part of the first signal is that the claim that the truth is unknowable is itself a second claim about the world, i.e. a second two-dimensional signal saying how much some agency is hiding or publicising the truth and how certain the speaker is of the direction and extent of the agency’s activity.

Opinion expressers in (social) media usually choose an extreme value for both dimensions of both signals. They claim some statement about the world is either the ultimate truth or completely false or unknowable and exactly in the middle, not a moderate distance to one side. In the second dimension of both signals, the opinionated people express complete certainty. If the first signal says the statement is true or false, then the second signal is not sent and is not needed, because if there is complete certainty of the truth value of the statement, then the statement must be perfectly knowable. If the first signal says the statement is fifty-fifty (the speaker does not know whether true or false), then in the second signal, the speaker claims that the truth is absolutely not knowable. This excuses the speaker’s claimed lack of knowledge as due to an objective impossibility, instead of the speaker’s limited data and understanding.

Clinical trials of other drugs in other species to predict a drug’s effect in humans

Suppose we want to know whether a drug is safe or effective for humans, but do not have data on what it does in humans, only on its effects in mice, rats, rhesus macaques and chimpanzees. In general, we can predict the effect of the drug on humans better with the animal data than without it. Information on “nearby” realisations of a random variable (effect of the drug) helps predict the realisation we are interested in. The method should weight nearby observations more than observations further away when predicting. For example, if the drug has a positive effect in animals, then predicts a positive effect in humans, and the larger the effect in animals, the greater the predicted effect in humans.

A limitation of weighting is that it does not take into account the slope of the effect when moving from further observations to nearer. For example, a very large effect of the drug in mice and rats but a small effect in macaques and chimpanzees predicts the same effect in humans as a small effect in rodents and a large one in monkeys and apes, if the weighted average effect across animals is the same in both cases. However, intuitively, the first case should have a smaller predicted effect in humans than the second, because moving to animals more similar to humans, the effect becomes smaller in the first case but larger in the second. The idea is similar to a proportional integral-derivative (PID) controller in engineering.

The slope of the effect of the drug is extra information that increases the predictive power of the method if the assumption that the similarity of effects decreases in genetic distance holds. Of course, if this assumption fails in the data, then imposing it may result in bias.

Assumptions may be imposed on the method using constrained estimation. One constraint is the monotonicity of the effect in some measure of distance between observations. The method may allow for varying weights by adding interaction terms (e.g., the effect of the drug times genetic similarity). The interaction terms unfortunately require more data to estimate.

Extraneous information about the slope of the effect helps justify the constraints and reduces the need for adding interaction terms, thus decreases the data requirement. An example of such extra information is whether the effects of other drugs that have been tested in these animals as well as humans were monotone in genetic distance. Using information about these other drugs imposes the assumption that the slopes of the effects of different drugs are similar. The similarity of the slopes should intuitively depend on the chemical similarity of the drugs, with more distant drugs having more different profiles of effects across animals.

The similarity of species in terms of the effects drugs have on them need not correspond to genetic similarity or the closeness of any other observable characteristic of these organisms, although often these similarities are similar. The similarity of interest is how similar the effects of the drug are across these species. Estimating this similarity based on the similarity of other drugs across these animals may also be done by a weighted regression, perhaps with constraints or added interaction terms. More power for the estimation may be obtained from simultaneous estimation of the drug-effect-similarity of the species and the effect of the drug in humans. An analogy is demand and supply estimation in industrial organisation where observations about each side of the market give information about the other side. Another analogy is duality in mathematics, in this case between the drug-effect-similarity of the species and the given drug’s similarity of effects across these species.

The similarity of drugs in terms of their effects on each species need not correspond to chemical similarity, although it often does. The similarity of interest for the drugs is how similar their effects are in humans, and also in other species.

The inputs into the joint estimation of drug similarity, species similarity and the effect of the given drug in humans are the genetic similarity of the species, the chemical similarity of the drugs and the effects for all drug-species pairs that have been tested. In the matrix where the rows are the drugs and the columns the species, we are interested in filling in the cell in the row “drug of interest” and the column “human”. The values in all the other cells are informative about this cell. In other words, there is a benefit from filling in these other cells of the matrix.

Given the duality of drugs and species in the drug effect matrix, there is information to be gained from running clinical trials of chemically similar human-use-approved drugs in species in which the drug of interest has been tested but the chemically similar ones have not. The information is directly about the drug-effect-similarity of these species to humans, which indirectly helps predict the effect of the drug of interest in humans from the effects of it in other species. In summary, testing other drugs in other species is informative about what a given drug does in humans. Adapting methods from supply and demand estimation, or otherwise combining all the data in a principled theoretical framework, may increase the information gain from these other clinical trials.

Extending the reasoning, each (species, drug) pair has some unknown similarity to the (human, drug of interest) pair. A weighted method to predict the effect in the (human, drug of interest) pair may gain power from constraints that the similarity of different (species, drug) pairs increases in the genetic closeness of the species and the chemical closeness of the drugs.

Define Y_{sd} as the effect of drug d in species s. Define X_{si} as the observable characteristic (gene) i of species s. Define X_{dj} as the observable characteristic (chemical property) j of drug d. The simplest method is to regress Y_{sd} on all the X_{si} and X_{dj} and use the coefficients to predict the Y_{sd} of the (human, drug of interest) pair. If there are many characteristics i and j and few observations Y_{sd}, then variable selection or regularisation is needed. Constraints may be imposed, like X_{si}=X_i for all s and X_{dj}=X_j for all d.

Fused LASSO (least absolute shrinkage and selection operator), clustered LASSO and prior LASSO seem related to the above method.

Dilution effect explained by signalling

Signalling confidence in one’s arguments explains the dilution effect in marketing and persuasion. The dilution effect is that the audience averages the strength of a persuader’s arguments instead of adding the strengths. More arguments in favour of a position should intuitively increase the confidence in the correctness of this position, but empirically, adding weak arguments reduces people’s belief, which is why drug advertisements on US late-night TV list mild side effects in addition to serious ones. The target audience of these ads worries less about side effects when the ad mentions more slight problems with the drug, although additional side effects, whether weak or strong, should make the drug worse.

A persuader who believes her first argument to be strong enough to convince everyone does not waste valuable time to add other arguments. Listeners evaluate arguments partly by the confidence they believe the speaker has in these claims. This is rational Bayesian updating because a speaker’s conviction in the correctness of what she says is positively correlated with the actual validity of the claims.

A countervailing effect is that a speaker with many arguments has spent significant time studying the issue, so knows more precisely what the correct action is. If the listeners believe the bias of the persuader to be small or against the action that the arguments favour, then the audience should rationally believe a better-informed speaker more.

An effect in the same direction as dilution is that a speaker with many arguments in favour of a choice strongly prefers the listeners to choose it, i.e. is more biased. Then the listeners should respond less to the persuader’s effort. In the limit when the speaker’s only goal is always for the audience to comply, at any time cost of persuasion, then the listeners should ignore the speaker because a constant signal carries no information.

Modelling

Start with the standard model of signalling by information provision and then add countersignalling.

The listeners choose either to do what the persuader wants or not. The persuader receives a benefit B if the listeners comply, otherwise receives zero.

The persuader always presents her first argument, otherwise reveals that she has no arguments, which ends the game with the listeners not doing what the persuader wants. The persuader chooses whether to spend time at cost c>0, c<B to present her second argument, which may be strong or weak. The persuader knows the strength of the second argument but the listeners only have the common prior belief that the probability of a strong second argument is p0. If the second argument is strong, then the persuader is confident, otherwise not.

If the persuader does not present the second argument, then the listeners receive an exogenous private signal in {1,0} about the persuader’s confidence, e.g. via her subconscious body language. The probabilities of the signals are Pr(1|confident) =Pr(0|not) =q >1/2. If the persuader presents the second argument, then the listeners learn the confidence with certainty and can ignore any signals about it. Denote by p1 the updated probability that the audience puts on the second argument being strong.

If the speaker presents a strong second argument, then p1=1, if the speaker presents a weak argument, then p1=0, if the speaker presents no second argument, then after signal 1, the audience updates their belief to p1(1) =p0*q/(p0*q +(1-p0)*(1-q)) >p0 and after signal 0, to p1(0) =p0*(1-q)/(p0*(1-q) +(1-p0)*q) <p0.

The listeners prefer to comply (take action a=1) when the second argument of the persuader is strong, otherwise prefer not to do what the persuader wants (action a=0). At the prior belief p0, the listeners prefer not to comply. Therefore a persuader with a strong second argument chooses max{B*1-c, q*B*1 +(1-q)*B*0} and presents the argument iff (1-q)*B >c. A persuader with a weak argument chooses max{B*0-c, (1-q)*B*1 +q*B*0}, always not to present the argument. If a confident persuader chooses not to present the argument, then the listeners use the exogenous signal, otherwise use the choice of presentation to infer the type of the persuader.

One extension is that presenting the argument still leaves some doubt about its strength.

Another extension has many argument strength levels, so each type of persuader sometimes presents the second argument, sometimes not.

In this standard model, if the second argument is presented, then always by the confident type. As is intuitive, the second argument increases the belief of the listeners that the persuader is right. Adding countersignalling partly reverses the intuition – a very confident type of the persuader knows that the first argument already reveals her great confidence, so the listeners do what the very confident persuader wants. The very confident type never presents the second argument, so if the confident type chooses to present it, then the extra argument reduces the belief of the audience in the correctness of the persuader. However, compared to the least confident type who also never presents the second argument, the confident type’s second argument increases the belief of the listeners.

If top people have families and hobbies, then success is not about productivity

Assume:

1 Productivity is continuous and weakly increasing in talent and effort.

2 The sum of efforts allocated to all activities is bounded, and this bound is similar across people.

3 Families and hobbies take some effort, thus less is left for work. (For this assumption to hold, it may be necessary to focus on families with children in which the partner is working in a different field. Otherwise, a stay-at-home partner may take care of the cooking and cleaning, freeing up time for the working spouse to allocate to work. A partner in the same field of work may provide a collaboration synergy. In both cases, the productivity of the top person in question may increase.)

4 The talent distribution is similar for people with and without families or hobbies. This assumption would be violated if for example talented people are much better at finding a partner and starting a family.

Under these assumptions, reasonably rational people would be more productive without families or hobbies. If success is mostly determined by productivity, then people without families should be more successful on average. In other words, most top people in any endeavour would not have families or hobbies that take time away from work.

In short, if responsibilities and distractions cause lower productivity, and productivity causes success, then success is negatively correlated with such distractions. Therefore, if successful people have families with a similar or greater frequency as the general population, then success is not driven by productivity.

One counterargument is that people first become successful and then start families. In order for this to explain the similar fractions of singles among top and bottom achievers, the rate of family formation after success must be much greater than among the unsuccessful, because catching up from a late start requires a higher rate of increase.

Another explanation is irrationality of a specific form – one which reduces the productivity of high effort significantly below that of medium effort. Then single people with lots of time for work would produce less through their high effort than those with families and hobbies via their medium effort. Productivity per hour naturally falls with increasing hours, but the issue here is total output (the hours times the per-hour productivity). An extra work hour has to contribute negatively to success to explain the lack of family-success correlation. One mechanism for a negative effect of hours on output is burnout of workaholics. For this explanation, people have to be irrational enough to keep working even when their total output falls as a result.

If the above explanations seem unlikely but the assumptions reasonable in a given field of human endeavour, then reaching the top and staying there is mostly not about productivity (talent and effort) in this field. For example, in academic research.

A related empirical test of whether success in a given field is caused by productivity is to check whether people from countries or groups that score highly on corruption indices disproportionately succeed in this field. Either conditional on entering the field or unconditionally. In academia, in fields where convincing others is more important than the objective correctness of one’s results, people from more nepotist cultures should have an advantage. The same applies to journals – the general interest ones care relatively more about a good story, the field journals more about correctness. Do people from more corrupt countries publish relatively more in general interest journals, given their total publications? Of course, conditional on their observable characteristics like the current country of employment.

Another related test for meritocracy in academia or the R&D industry is whether coauthored publications and patents are divided by the number of coauthors in their influence on salaries and promotions. If there is an established ranking of institutions or job titles, then do those at higher ranks have more quality-weighted coauthor-divided articles and patents? The quality-weighting is the difficult part, because usually there is no independent measure of quality (unaffected by the dependent variable, be it promotions, salary, publication venue).

Putting your money where your mouth is in policy debates

Climate change deniers should put their money where their mouth is by buying property in low-lying coastal areas or investing in drought-prone farmland. Symmetrically, those who believe the Earth is warming as a result of pollution should short sell climate-vulnerable assets. Then everyone eventually receives the financial consequences of their decisions and claimed beliefs. The sincere would be happy to bet on their beliefs, anticipating positive profit. Of course, the beliefs have to be somewhat dogmatic or the individuals in question risk-loving, otherwise the no-agreeing-to-disagree theorem would preclude speculative trade (opposite bets on a common event).

Governments tend to compensate people for widespread damage from natural disasters, because distributing aid is politically popular and there is strong lobbying for this free insurance. This insulates climate change deniers against the downside risk of buying flood- or wildfire-prone property. To prevent the cost of the damages from being passed to the taxpayers, the deniers should be required to buy insurance against disaster risk, or to sign contracts with (representatives of) the rest of society agreeing to transfer to others the amount of any government compensation they receive after flood, drought or wildfire. Similarly, those who short sell assets that lose value under a warming climate (or buy property that appreciates, like Arctic ports, under-ice mining and drilling rights) should not be compensated for the lost profit if the warming does not take place.

In general, forcing people to put their money where their mouth is would avoid wasting time on long useless debates (e.g. do high taxes reduce economic growth, does a high minimum wage raise unemployment, do tough punishments deter crime). Approximately rational people would doubt the sincerity of anyone who is not willing to bet on her or his beliefs, so one’s credibility would be tied to one’s skin in the game: a stake in the claim signals sincerity. Currently, it costs pundits almost nothing to make various claims in the media – past wrong statements are quickly forgotten, not impacting the reputation for accuracy much. 

The bets on beliefs need to be legally enforceable, so have to be made on objectively measurable events, such as the value of a publicly traded asset. By contrast, it is difficult to verify whether government funding for the arts benefits culture, or whether free public education is good for civil society, therefore bets on such claims would lead to legal battles. The lack of enforceability would reduce the penalty for making false statements, thus would not deter lying or shorten debates much.

An additional benefit from betting on (claimed) beliefs is to provide insurance to those harmed by the actions driven by these beliefs. For example, climate change deniers claim small harm from air pollution. Their purchases of property that will be damaged by a warming world allows climate change believers to short sell such assets. If the Earth then warms, then the deniers lose money and the believers gain at their expense. This at least partially compensates the believers for the damage caused by the actions of the deniers.

Why rational agents may react negatively to honesty

Emotional people may of course dislike an honest person, just because his truthful opinion hurt their feelings. In contrast, rational agents’ payoff cannot decrease when they get additional information, so they always benefit from honest feedback. However, rational decision makers may still adjust their attitude to be more negative towards a person making truthful, informative statements. The reason is Bayesian updating about two dimensions: the honesty of the person and how much the person cares about the audience’s feelings. Both dimensions of belief positively affect attitude towards the person. His truthful statements increase rational listeners’ belief about his honesty, but may reduce belief in his tactfulness, which may shift rational agents’ opinions strongly enough in the negative direction to outweigh the benefit from honesty.

The relative effect of information about how much the person cares, compared to news about his honesty, is greater when the latter is relatively more certain. In the limit, if the audience is completely convinced that the person is honest (or certain of his dishonesty), then the belief about his honesty stays constant no matter what he does, and only the belief about tact moves. Then telling an unpleasant truth unambiguously worsens the audience’s attitude. Thus if a reasonably rational listener accuses a speaker of „brutal honesty” or tactlessness, then it signals that the listener is relatively convinced either that the speaker is a liar or that he is a trustworthy type. Therefore an accusation of tactlessness may be taken as an insult or a compliment, depending on one’s belief about the accuser’s belief about one’s honesty.

If tact takes effort, and the cost of this effort is lower for those who care about the audience’s emotions, then pleasant comments are an informative signal (in the Spence signalling sense) that the speaker cares about the feelings of others. In that case the inference that brutal honesty implies an uncaring nature is correct.

On the other hand, if the utility of rational agents only depends on the information content of statements, not directly on their positive or negative emotional tone, then the rational agents should not care about the tact of the speaker. In this case, there is neither a direct reason for the speaker to avoid unpleasant truths (out of altruism towards the audience), nor an indirect benefit from signalling tactfulness. Attitudes would only depend on one dimension of belief: the one about honesty. Then truthfulness cannot have a negative effect.

Higher order beliefs may still cause honesty to be interpreted negatively even when rational agents’ utility does not depend on the emotional content of statements. The rational listeners may believe that the speaker believes that the audience’s feelings would be hurt by negative comments (for example, the speaker puts positive probability on irrational listeners, or on their utility directly depending on the tone of the statements they hear), in which case tactless truthtelling still signals not caring about others’ emotions.

On the optimal burden of proof

All claims should be considered false until proven otherwise, because lies can be invented much faster than refuted. In other words, the maker of a claim has the burden of providing high-quality scientific proof, for example by referencing previous research on the subject. Strangely enough, some people seem to believe marketing, political spin and conspiracy theories even after such claims have been proven false. It remains to wish that everyone received the consequences of their choices (so that karma works).
Considering all claims false until proven otherwise runs into a logical problem: a claim and its opposite claim cannot be simultaneously false. The priority for falsity should be given to actively made claims, e.g. someone saying that a product or a policy works, or that there is a conspiracy behind an accident. Especially suspect are claims that benefit their maker if people believe them. A higher probability of falsity should also be attached to positive claims, e.g. that something has an effect in whatever direction (as opposed to no effect) or that an event is due to non-obvious causes, not chance. The lack of an effect should be the null hypothesis. Similarly, ignorance and carelessness, not malice, should be the default explanation for bad events.
Sometimes two opposing claims are actively made and belief in them benefits their makers, e.g. in politics or when competing products are marketed. This is the hardest case to find the truth in, but a partial and probabilistic solution is possible. Until rigorous proof is found, one should keep an open mind. Keeping an open mind creates a vulnerability to manipulation: after some claim is proven false, its proponents often try to defend it by asking its opponents to keep an open mind, i.e. ignore evidence. In such cases, the mind should be closed to the claim until its proponents provide enough counter-evidence for a neutral view to be reasonable again.
To find which opposing claim is true, the first test is logic. If a claim is logically inconsistent with itself, then it is false by syntactic reasoning alone. A broader test is whether the claim is consistent with other claims of the same person. For example, Vladimir Putin said that there were no Russian soldiers in Crimea, but a month later gave medals to some Russian soldiers, citing their successful operation in Crimea. At least one of the claims must be false, because either there were Russian soldiers in Crimea or not. The way people try to weasel out of such self-contradictions is to say that the two claims referred to different time periods, definitions or circumstances. In other words, change the interpretation of words. A difficulty for the truth-seeker is that sometimes such a change in interpretation is a legitimate clarification. Tongues do slip. Nonetheless, a contradiction is probabilistic evidence for lying.
The second test for falsity is objective evidence. If there is a streetfight and the two sides accuse each other of starting it, then sometimes a security camera video can refute one of the contradicting claims. What evidence is objective is, sadly, subject to interpretation. Videos can be photoshopped, though it is difficult and time-consuming. The objectivity of the evidence is strongly positively correlated with the scientific rigour of its collection process. „Hard” evidence is a signal of the truth, but a probabilistic signal. In this world, most signals are probabilistic.
The third test of falsity is the testimony of neutral observers, preferably several of them, because people misperceive and misremember even under the best intentions. The neutrality of observers is again up for debate and interpretation. In some cases, an observer is a statistics-gathering organisation. Just like objective evidence, testimony and statistics are probabilistic signals.
The fourth test of falsity is the testimony of interested parties, to which the above caveats apply even more strongly.
Integrating conflicting evidence should use Bayes’ rule, because it keeps probabilities consistent. Consistency helps glean information about one aspect of the question from data on other aspects. Background knowledge should be combined with the evidence, for example by ruling out physical impossibilities. If a camera shows a car disappearing behind a corner and immediately reappearing, moving in the opposite direction, then physics says that the original car couldn’t have changed direction so fast. The appearing car must be a different one. Knowledge of human interactions and psychology is part of the background information, e.g. if smaller, weaker and outnumbered people rarely attack the stronger and more numerous, then this provides probabilistic info about who started a fight. Legal theory incorporates background knowledge of human nature to get information about the crime – human nature suggests motives. Asking: „Who benefits?” has a long history in law.

On simple answers

Bayes’ rule exercise: is a simple or a complicated answer to a complicated problem more likely to be correct?

Depends on the conditional probabilities: if simple questions are more likely to have simple answers and complex questions complicated, then a complicated answer is more likely to be correct for a complicated problem.

It seems reasonable that the complexity of the answer is correlated with the difficulty of the problem. But this is an empirical question.

If difficult problems are likely to have complex answers, then this is an argument against slogans and ideologies. These seek to give a catchy one-liner as the answer to many problems in society. No need to think – ideology has the solution. Depending on your political leaning, poverty may be due to laziness or exploitation. The foreign policy “solution” is bombing for some, eternal appeasement for others.

The probabilistic preference for complex answers in complicated situations seems to contradict Occam’s razor (among answers equally good at explaining the facts, the simplest answer should be chosen). There is no actual conflict with the above Bayesian exercise. There, the expectation of a complex answer applies to complicated questions, while a symmetric anticipation of a simple answer holds for simple problems. The answers compared are not equally good, because one fits the structure of the question better than the other.

Which ideology is more likely to be wrong?

Exercise in Bayes’ rule: is an ideology more likely to be wrong if it appeals relatively more to poor people than the rich?

More manipulable folks are more likely to lose their money, so less likely to be rich. Stupid people have a lower probability of making money. By Bayes, the rich are on average less manipulable and more intelligent than the poor.

Less manipulable people are less likely to find an ideology built on fallacies appealing. By Bayes, an ideology relatively more appealing to the stupid and credulous is more likely to be wrong. Due to such people being poor with a higher probability, an ideology embraced more by the poor than the rich is more likely to be fallacious.

Another exercise: is an ideology more likely to be wrong if academics like it relatively more than non-academics?

Smarter people are more likely to become academics, so by Bayes’ rule, academics are more likely to be smart. Intelligent people have a relatively higher probability of liking a correct ideology, so by Bayes, an ideology appealing to the intelligent is more likely to be correct. An ideology liked by academics is correct with a higher probability.