Tag Archives: research

Leader turnover due to organisation performance is underestimated

Berry and Fowler (2021) “Leadership or luck? Randomization inference for leader effects in politics, business, and sports” in Science Advances propose a method they call RIFLE for testing the null hypothesis that leaders have no effect on organisation performance. The method is robust to serial correlation in outcomes and leaders, but not to endogenous leader turnover, as Berry and Fowler honestly point out. The endogeneity is that the organisation’s performance influences the probability that the leader is replaced (economic growth causes voters to keep a politician in office, losing games causes a team to replace its coach).

To test whether such endogeneity is a significant problem for their results, Berry and Fowler regress the turnover probability on various measures of organisational performance. They find small effects, but this underestimates the endogeneity problem, because Berry and Fowler use linear regression, forcing the effect of performance on turnover to be monotone and linear.

If leader turnover is increased by both success (get a better job elsewhere if the organisation performs well, so quit voluntarily) and failure (fired for the organisation’s bad performance), then the relationship between turnover and performance is U-shaped. Average leaders keep their jobs, bad and good ones transition elsewhere. A linear regression finds a near-zero effect in this case even if the true effect is large. How close the regression coefficient is to zero depends on how symmetric the effects of good and bad performance on leader transition are, not how large these effects are.

The problem for the RIFLE method of Berry and Fowler is that the small apparent effect of organisation performance on leader turnover from OLS regression misses the endogeneity in leader transitions. Such endogeneity biases RIFLE, as Berry and Fowler admit in their paper.

The endogeneity may explain why Berry and Fowler find stronger leader effects in sports (coaches in various US sports) than in business (CEOs) and politics (mayors, governors, heads of government). A sports coach may experience more asymmetry in the transition probabilities for good and bad performance than a politician. For example, if the teams fire coaches after bad performance much more frequently than poach coaches from well-performing competing teams, then the effect of performance on turnover is close to monotone: bad performance causes firing. OLS discovers this monotone effect. On the other hand, if politicians move with equal likelihood after exceptionally good and bad performance of the administrative units they lead, then linear regression finds no effect of performance on turnover. This misses the bias in RIFLE, which without the bias might show a large leader effect in politics also.

The unreasonably large effect of governors on crime (the governor effect explains 18-20% of the variation in both property and violent crime) and the difference between the zero effect of mayors on crime and the large effect of governors that Berry and Fowler find makes me suspect something is wrong with that particular analysis in their paper. In a checks-and-balances system, the governor should not have that large of influence on the state’s crime. A mayor works more closely with the local police, so would be expected to have more influence on crime.

Dilution effect explained by signalling

Signalling confidence in one’s arguments explains the dilution effect in marketing and persuasion. The dilution effect is that the audience averages the strength of a persuader’s arguments instead of adding the strengths. More arguments in favour of a position should intuitively increase the confidence in the correctness of this position, but empirically, adding weak arguments reduces people’s belief, which is why drug advertisements on US late-night TV list mild side effects in addition to serious ones. The target audience of these ads worries less about side effects when the ad mentions more slight problems with the drug, although additional side effects, whether weak or strong, should make the drug worse.

A persuader who believes her first argument to be strong enough to convince everyone does not waste valuable time to add other arguments. Listeners evaluate arguments partly by the confidence they believe the speaker has in these claims. This is rational Bayesian updating because a speaker’s conviction in the correctness of what she says is positively correlated with the actual validity of the claims.

A countervailing effect is that a speaker with many arguments has spent significant time studying the issue, so knows more precisely what the correct action is. If the listeners believe the bias of the persuader to be small or against the action that the arguments favour, then the audience should rationally believe a better-informed speaker more.

An effect in the same direction as dilution is that a speaker with many arguments in favour of a choice strongly prefers the listeners to choose it, i.e. is more biased. Then the listeners should respond less to the persuader’s effort. In the limit when the speaker’s only goal is always for the audience to comply, at any time cost of persuasion, then the listeners should ignore the speaker because a constant signal carries no information.


Start with the standard model of signalling by information provision and then add countersignalling.

The listeners choose either to do what the persuader wants or not. The persuader receives a benefit B if the listeners comply, otherwise receives zero.

The persuader always presents her first argument, otherwise reveals that she has no arguments, which ends the game with the listeners not doing what the persuader wants. The persuader chooses whether to spend time at cost c>0, c<B to present her second argument, which may be strong or weak. The persuader knows the strength of the second argument but the listeners only have the common prior belief that the probability of a strong second argument is p0. If the second argument is strong, then the persuader is confident, otherwise not.

If the persuader does not present the second argument, then the listeners receive an exogenous private signal in {1,0} about the persuader’s confidence, e.g. via her subconscious body language. The probabilities of the signals are Pr(1|confident) =Pr(0|not) =q >1/2. If the persuader presents the second argument, then the listeners learn the confidence with certainty and can ignore any signals about it. Denote by p1 the updated probability that the audience puts on the second argument being strong.

If the speaker presents a strong second argument, then p1=1, if the speaker presents a weak argument, then p1=0, if the speaker presents no second argument, then after signal 1, the audience updates their belief to p1(1) =p0*q/(p0*q +(1-p0)*(1-q)) >p0 and after signal 0, to p1(0) =p0*(1-q)/(p0*(1-q) +(1-p0)*q) <p0.

The listeners prefer to comply (take action a=1) when the second argument of the persuader is strong, otherwise prefer not to do what the persuader wants (action a=0). At the prior belief p0, the listeners prefer not to comply. Therefore a persuader with a strong second argument chooses max{B*1-c, q*B*1 +(1-q)*B*0} and presents the argument iff (1-q)*B >c. A persuader with a weak argument chooses max{B*0-c, (1-q)*B*1 +q*B*0}, always not to present the argument. If a confident persuader chooses not to present the argument, then the listeners use the exogenous signal, otherwise use the choice of presentation to infer the type of the persuader.

One extension is that presenting the argument still leaves some doubt about its strength.

Another extension has many argument strength levels, so each type of persuader sometimes presents the second argument, sometimes not.

In this standard model, if the second argument is presented, then always by the confident type. As is intuitive, the second argument increases the belief of the listeners that the persuader is right. Adding countersignalling partly reverses the intuition – a very confident type of the persuader knows that the first argument already reveals her great confidence, so the listeners do what the very confident persuader wants. The very confident type never presents the second argument, so if the confident type chooses to present it, then the extra argument reduces the belief of the audience in the correctness of the persuader. However, compared to the least confident type who also never presents the second argument, the confident type’s second argument increases the belief of the listeners.

Tissue sampling by piggybacking on vaccination or testing campaigns

Obtaining tissue samples from a large population of healthy individuals is useful for many research and testing applications. Establishing the distribution of genes, transcriptomes, cell distributions and morpologies in a normal population allows comparing clinical laboratory findings to reference values obtained from this baseline. The genetic composition of the population can be used to estimate historical migration patterns in paleoanthropology and selective pressures in evolutionary biology.

Gathering tissue samples from many people is expensive and time-consuming, unless it happens as a byproduct of existing programs. Collecting used vaccination needles or coronavirus nasal swabs that have a few cells attached allows anonymous tissue sampling of almost the entire population. A few cells per person are enough for many analyses in modern biology. Bulk collection of needles or swabs has built-in untraceability of biological material to an individual, which should alleviate privacy concerns and reduce the bureaucratic burden of ethics approvals.

Flight cameras for environmental and traffic monitoring

Recordings from the downward-pointing cameras on commercial airliners that provide inflight belly-cam views could be downloaded after landing to use for research, for example on vegetation cover, traffic density on roads, night-time light which measures economic development. The flight paths are saved on flight tracking websites anyway, which enables localising the video at any point of time to the GPS coordinates the flight was at. The recordings are not much use for military spying because countries ban overflights of sensitive sites anyway. Thus security and privacy arguments should not stop research in this case.

The resolution of the belly cameras is low and the wavelengths cover only visible light, not infrared which would be useful for vegetation measurements. The compensating upside is the frequent overflights of many parts of the globe, thus the dense temporal coverage. The videos are almost costless to obtain – just plug an external hard drive into the existing inflight entertainment system to and later upload its contents at the airport. The low cost contrasts with specialised satellite and aerial surveys.

Diffraction grating of parallel electron beams

Diffraction gratings with narrow bars and bar spacing are useful for separating short-wavelength electromagnetic radiation (x-rays, gamma rays) into a spectrum, but the narrow bars and gaps are difficult to manufacture. The bars are also fragile and thus need a backing material, which may absorb some of the radiation, leaving less of it to be studied. Instead of manufacturing the grating out of a solid material composed of neutral atoms, an alternative may be to use many parallel electron beams. Electromagnetic waves do scatter off electrons, thus the grating of parallel electron beams should have a similar effect to a solid grating of molecules. My physics knowledge is limited, so this idea may not work for many reasons.

Electron beams can be made with a diameter a few nanometres across, and can be bent with magnets. Thus the grating could be made from a single beam if powerful enough magnets bend it back on itself. Or many parallel beams generated from multiple sources.

The negatively charged electrons repel each other, so the beams tend to bend away from each other. To compensate for this, the beam sources could target the beams to a common focus and let the repulsion forces bend the beams outward. There would exist a point at which the converging and then diverging beams are parallel. The region near that point could be used as the grating. The converging beams should start out sufficiently close to parallel that they would not collide before bending outward again.

Proton or ion beams are also a possibility, but protons and ions have larger diameter than electrons, which tends to create a coarser grating. Also, electron beam technology is more widespread and mature (cathode ray tubes were used in old televisions), thus easier to use off the shelf.

The most liveable cities rankings are suspicious

The „most liveable cities” rankings do not publish their methodology, only vague talk about a weighted index of healthcare, safety, economy, education, etc. An additional suspicious aspect is that the top-ranked cities are all large – there are no small towns. There are many more small than big cities in the world (this is known as Zipf’s law), so by chance alone, one would expect most of the top-ranked towns in any ranking that is not size-based to be small. The liveability rankings do not mention restricting attention to sizes above some cutoff. Even if a minimum size was required, one would expect most of the top-ranked cities to be close to this lower bound, just based on the size distribution.

The claimed ranking methodology includes several variables one would expect to be negatively correlated with the population of a city (safety, traffic, affordability). The only plausible positively size-associated variables are culture and entertainment, if these measure the total number of venues and events, not the per-capita number. Unless the index weights entertainment very heavily, one would expect big cities to be at a disadvantage in the liveability ranking based on the correlations, i.e. the smaller the town, the greater its probability of achieving a given liveability score and placing in the top n in the rankings. So the “best places to live” should be almost exclusively small towns. Rural areas not so much, because these usually have limited access to healthcare, education and amenities. The economy of remote regions grows less overall and the population is older, but some (mining) boom areas radically outperform cities in these dimensions. Crime is generally low, so if rural areas were included in the liveability index, then some of these would have a good change of attaining top rank.

For any large city, there exists a small town with better healthcare, safety, economy, education, younger population, more entertainment events per capita, etc (easy examples are university towns). The fact that these do not appear at the top of a liveability ranking should raise questions about its claimed methodology.

The bias in favour of bigger cities is probably coming from sample selection and hometown patriotism. If people vote mostly for their own city and the respondents of the liveability survey are either chosen from the population approximately uniformly randomly or the sample is weighted towards larger cities (online questionnaires have this bias), then most of the votes will favour big cities.

Blind testing of bicycle fitting

Claims that getting a professional bike fit significantly improves riding comfort and speed and reduces overuse injuries seem suspicious – how can a centimetre here or there make such a large difference? A very wrong fit (e.g. an adult using a children’s bike) of course creates big problems, but most people can adjust their bike to a reasonable fit based on a few online suggestions.

To determine the actual benefit of a bike fit requires a randomised trial: have professionals determine the bike fit for a large enough sample of riders, measure and record the objective parameters of the fit (centimetres of seatpost out of the seat tube, handlebar height from the ground, pedal crank length, etc). Then randomly change the fit by a few centimetres or leave it unchanged, without the cyclist knowing, and let the rider test the bike. Record the speed, ask the rider to rate the comfort, fatigue, etc. Repeat for several random changes in fit. Statistically test whether the average speed, comfort rating and other outcome variables across the sample of riders are better with the actual fit or with small random changes. To eliminate the placebo effect, blind testing is important – the cyclists should not know whether and how the fit has been changed.

Another approach is to have each rider test a large sample of different bike fits, find the best one empirically, record its objective parameters and then have a sample of professional fitters (who should not know what empirical fit was found) choose the best fit. Test statistically whether the professionals choose the same fit as the cyclist.

A simpler trial that does not quite answer the question of interest checks the consistency of different bike fitters. The same person with the same bike in the same initial configuration goes to various fitters and asks them to choose a fit. After each fitting, the objective sizing of the bike is recorded and then the bike is returned to the initial configuration before the next fit. The test is whether all fitters choose approximately the same parameters. Inconsistency implies that most fitters cannot figure out the objectively best fit, but consistency does not imply that the consensus of the fitters is the optimal sizing. They could all be wrong the same way – consistency is insufficient to answer the question of interest.

Avoiding the Bulow and Rogoff 1988 result on the impossibility of borrowing

Bulow and Rogoff 1988 NBER working paper 2623 proves that countries cannot borrow, due to their inability to credibly commit to repay, if after default they can still buy insurance. The punishment of defaulting on debt is being excluded from future borrowing. This punishment is not severe enough to motivate a country to repay, by the following argument. A country has two reasons to borrow: it is less patient than the lenders (values current consumption or investment opportunities relatively more) and it is risk-averse (either because the utility of consumption is concave, or because good investment opportunities appear randomly). Debt can be used to smooth consumption or take advantage of temporary opportunities for high-return investment: borrow when consumption would otherwise be low, pay back when relatively wealthy.

After the impatient country has run up its debt to the maximum level the creditors are willing to tolerate, the impatience motive to borrow disappears, because the lenders do not allow more consumption to be transferred from the future to the present. Only the insurance motive to borrow remains. The punishment for default is the inability to insure via debt, because in a low-consumption or valuable-investment state of affairs, no more can be borrowed. Bulow and Rogoff assume that the country can still save or buy insurance by paying in advance, so “one-sided” risk-sharing (pay back when relatively wealthy, or when investment opportunities are unavailable) is possible. This seemingly one-sided risk-sharing becomes standard two-sided risk-sharing upon default, because the country can essentially “borrow” from itself the amount that it would have spent repaying debt. This amount can be used to consume or invest in the state of the world where these activities are attractive, or to buy insurance if consumption and investment are currently unattractive. Thus full risk-sharing is achieved.

More generally, if the country can avoid the punishment that creditors impose upon default (evade trade sanctions by smuggling, use alternate lenders if current creditors exclude it), then the country has no incentive to repay, in which case lenders have no incentive to lend.

The creditors know that once the country has run up debt to the maximum level they allow, it will default. Thus rational lenders set the maximum debt to zero. In other words, borrowing is impossible.

A way around the no-borrowing theorem of Bulow and Rogoff is to change one or more assumptions. In an infinite horizon game, Hellwig and Lorenzoni allow the country to run a Ponzi scheme on the creditors, thus effectively “borrow from time period infinity”, which permits a positive level of debt. Sometimes even an infinite level of debt.

Another assumption that could realistically be removed is that the country can buy insurance after defaulting. Restricting insurance need not be due to an explicit legal ban. The insurers are paid in advance, thus do not exclude the country out of fear of default. Instead, the country’s debt contract could allow creditors to seize the country’s financial assets abroad, specifically in creditor countries, and these assets could be defined to include insurance premiums already paid, or the payments from insurers to the country. The creditors have no effective recourse against the sovereign debtor, but they may be able to enforce claims against insurance firms outside the defaulting country.

Seizing premiums to or payments from insurers would result in negative profits to insurers or restrict the defaulter to one-sided risk-sharing, without the abovementioned possibility of making it two-sided. Seizing premiums makes insurers unwilling to insure, and seizing payments from insurers removes the country’s incentive to purchase insurance. Either way, the country’s benefit from risk-sharing after default is eliminated. This punishment would motivate loan repayment, in turn motivating lending.

Asking questions of yourself

To make better decisions, ask about all your activities “Am I doing this right? Is there a better way?” I would have benefited from considering such questions about many everyday tasks. For example, I brushed my teeth wrong (sawing at the roots) until late teens, brushed my teeth at the wrong time (right after a meal when the enamel is soft) until my 30s. I only learned to cut my own hair in my mid-20s, and this was the highest-return investment I ever made, because a hair clipper costs as much as a haircut, so pays for itself with the first use.

Peeling a kiwi with a spoon is far easier than slicing with a knife. All it took to learn this was one web search, but it required asking myself the question of whether I was peeling fruit optimally. Same for extracting the seed from an avocado.

Cracking the shell of a hard-boiled egg, making two holes at the ends and blowing air under the membrane before peeling is another trick I wish I had known earlier.

Microwaved food is cooler in the centre, so to avoid scalding one’s mouth, it is helpful to start eating it from the middle. Cooked food left in a covered cooking pot or transferred to a storage container while still mildly hot does not go bad at room temperature for several days – doing this experiment required posing this hypothesis. Drinking without touching the bottle with one’s mouth turns out to be quite easy and is widespread in India.

Only after learning to drive did I start meaningfully using gears on a bicycle, and it took about 15 years more to start shifting approximately correctly (pedalling cadence 60-100 rotations per minute, downshifting before stopping, avoiding cross-geared riding). Similarly for basic bike maintenance like cleaning and oiling the chain, selecting the appropriate front and rear tire pressure given one’s weight and tire widths. Seat height is one thing I figured out early, but not handlebar height.

As a teenager, I would have benefited from asking myself whether I was overtraining, whether my nutrition was reasonable, how soon to return to training after various injuries and whether to seek medical assistance with these. Questioning the competence of coaches and doing a simple web search for sports medicine resources would have prevented following some of their mistaken advice.

Sometimes asking yourself the question reveals that you are already doing the task correctly. On the internet, people claim that they do not use shampoo, just water, and their hair stays clean-smelling and more lush than using detergent. An experiment not to use shampoo was a failure for me, causing greasy hair and lots of dandruff after a few days. The optimality of shampoo may depend on individual scalp and hair characteristics. On the other hand, a single-blade disposable razor and cold water give me a better shave than multi-bladed fancy brands with foam (that get clogged), and the disposable razor stays sharp enough for a month or two of everyday shaving.

When going to teach, it may be worth asking whether the room is the correct one, even if some students show up and the room is free, because once in this situation I was in a room with the right label, but in the wrong building.

On the other hand, constantly doubting oneself is unhealthy and unhelpful. If enough evidence points one way, then it is time to make up one’s mind.

Blind testing of clothes

Inspired by blind taste testing, manufacturers’ claims about clothes could be tested by subjects blinded to what they are wearing. The test would work as follows. People put clothes on by feel with their eyes closed or in a pitch dark room and wear other clothes on top of the item to be tested. Thus the subjects cannot see what they are wearing. They then rate the comfort, warmth, weight, softness and other physical aspects of the garment. This would help consumers select the most practical clothing and keep advertising somewhat more honest than heretofore. For example, many socks are advertised as warm, but based on my experience, many of them do not live up to the hype. I would be willing to pay a small amount for data about past wearers’ experience. Online reviews are notoriously emotional and biased.

Some aspects of clothes can also be measured objectively – warmth is one of these, measured by heat flow through the garment per unit of area. Such data is unfortunately rarely reported. The physical measurements to conduct on clothes require some thought, to make these correspond to the wearing experience. For example, if clothes are thicker in some parts, then their insulation should be measured in multiple places. Some parts of the garment may usually be worn with more layers under or over it than others, which may affect the required warmth of different areas of the clothing item differently. Sweat may change the insulation properties dramatically, e.g. for cotton. Windproofness matters for whether windchill can be felt. All this needs taking into account when converting physical measurements to how the clothes feel.