Tag Archives: econometrics

Identifying unmeasurable effort in contests

To distinguish unmeasurable effort from unmeasurable exogenous factors like talent or environmental interference in contests, assumptions are needed, even for partial identification when overall performance can be objectively measured (e.g., chess move quality evaluated by a computer). Combining one of the following assumptions with the additive separability of effort and the exogenous factors provides sign restrictions on coefficient estimates. Additive separability means that talent or the environment changes performance the same way at any effort level.

One such identifying assumption is that effort is greatest when it makes the most difference – against an equal opponent. By contrast, effort is lower against much better and much worse opponents.

A similar identifying assumption is that if there is personal conflict between some contest participants but not others, then effort is likely higher against a hated opponent than a neutral one.

The performance of a given contestant against an equal opponent compared to against an unequal one is a lower bound on how much effort affects performance. Similarly, the performance against a hated rival compared to against a neutral contestant is a lower bound on the effect of effort. The lower bound is not the total influence of effort, because even against an unequal neutral opponent, effort is still positive.

Computer vision training sets of photos are endogenous

In principle, every pixel could be independent of any other, so the set of possible photos is the number of pixels times the number of colours – billions at least. No training data set is large enough to cover these photo possibilities many times over, as required for statistical analysis, of which machine learning is a subfield. The problem is solved by restricting attention to a small subset of possible photos. In this case, there is a reasonable number of possible photos, which can be covered by a reasonably large training data set.

Useful photos on any topic usually contain just one main object, such as a face, with less than 100 secondary objects (furniture, clothes, equipment). There is a long right tail – some useful photos have dozens of the main object, like a group photo full of faces, but I do not know of a photo with a thousand distinguishable faces. Photos of mass events may have ten thousand people, but lack the resolution to make any face in these useful.

Only selected photos are worth analysing. Only photos sufficiently similar to these are worth putting in a computer vision training dataset. The sample selection occurs both on the input and the output side: few of the billions of pixel arrangements actually occur as photos to be classified by machine vision and most of the training photos are similar to those. There are thus fewer outputs to predict than would be generated from a uniform random distribution and more inputs close to those outputs than would occur if input data was uniform random. Both speed learning.

When photo resolution improves, more objects of interest may appear in photos without losing usefulness to blur. Then such photos become available in large numbers and are added to the datasets.

Clinical trials of other drugs in other species to predict a drug’s effect in humans

Suppose we want to know whether a drug is safe or effective for humans, but do not have data on what it does in humans, only on its effects in mice, rats, rhesus macaques and chimpanzees. In general, we can predict the effect of the drug on humans better with the animal data than without it. Information on “nearby” realisations of a random variable (effect of the drug) helps predict the realisation we are interested in. The method should weight nearby observations more than observations further away when predicting. For example, if the drug has a positive effect in animals, then predicts a positive effect in humans, and the larger the effect in animals, the greater the predicted effect in humans.

A limitation of weighting is that it does not take into account the slope of the effect when moving from further observations to nearer. For example, a very large effect of the drug in mice and rats but a small effect in macaques and chimpanzees predicts the same effect in humans as a small effect in rodents and a large one in monkeys and apes, if the weighted average effect across animals is the same in both cases. However, intuitively, the first case should have a smaller predicted effect in humans than the second, because moving to animals more similar to humans, the effect becomes smaller in the first case but larger in the second. The idea is similar to a proportional integral-derivative (PID) controller in engineering.

The slope of the effect of the drug is extra information that increases the predictive power of the method if the assumption that the similarity of effects decreases in genetic distance holds. Of course, if this assumption fails in the data, then imposing it may result in bias.

Assumptions may be imposed on the method using constrained estimation. One constraint is the monotonicity of the effect in some measure of distance between observations. The method may allow for varying weights by adding interaction terms (e.g., the effect of the drug times genetic similarity). The interaction terms unfortunately require more data to estimate.

Extraneous information about the slope of the effect helps justify the constraints and reduces the need for adding interaction terms, thus decreases the data requirement. An example of such extra information is whether the effects of other drugs that have been tested in these animals as well as humans were monotone in genetic distance. Using information about these other drugs imposes the assumption that the slopes of the effects of different drugs are similar. The similarity of the slopes should intuitively depend on the chemical similarity of the drugs, with more distant drugs having more different profiles of effects across animals.

The similarity of species in terms of the effects drugs have on them need not correspond to genetic similarity or the closeness of any other observable characteristic of these organisms, although often these similarities are similar. The similarity of interest is how similar the effects of the drug are across these species. Estimating this similarity based on the similarity of other drugs across these animals may also be done by a weighted regression, perhaps with constraints or added interaction terms. More power for the estimation may be obtained from simultaneous estimation of the drug-effect-similarity of the species and the effect of the drug in humans. An analogy is demand and supply estimation in industrial organisation where observations about each side of the market give information about the other side. Another analogy is duality in mathematics, in this case between the drug-effect-similarity of the species and the given drug’s similarity of effects across these species.

The similarity of drugs in terms of their effects on each species need not correspond to chemical similarity, although it often does. The similarity of interest for the drugs is how similar their effects are in humans, and also in other species.

The inputs into the joint estimation of drug similarity, species similarity and the effect of the given drug in humans are the genetic similarity of the species, the chemical similarity of the drugs and the effects for all drug-species pairs that have been tested. In the matrix where the rows are the drugs and the columns the species, we are interested in filling in the cell in the row “drug of interest” and the column “human”. The values in all the other cells are informative about this cell. In other words, there is a benefit from filling in these other cells of the matrix.

Given the duality of drugs and species in the drug effect matrix, there is information to be gained from running clinical trials of chemically similar human-use-approved drugs in species in which the drug of interest has been tested but the chemically similar ones have not. The information is directly about the drug-effect-similarity of these species to humans, which indirectly helps predict the effect of the drug of interest in humans from the effects of it in other species. In summary, testing other drugs in other species is informative about what a given drug does in humans. Adapting methods from supply and demand estimation, or otherwise combining all the data in a principled theoretical framework, may increase the information gain from these other clinical trials.

Extending the reasoning, each (species, drug) pair has some unknown similarity to the (human, drug of interest) pair. A weighted method to predict the effect in the (human, drug of interest) pair may gain power from constraints that the similarity of different (species, drug) pairs increases in the genetic closeness of the species and the chemical closeness of the drugs.

Define Y_{sd} as the effect of drug d in species s. Define X_{si} as the observable characteristic (gene) i of species s. Define X_{dj} as the observable characteristic (chemical property) j of drug d. The simplest method is to regress Y_{sd} on all the X_{si} and X_{dj} and use the coefficients to predict the Y_{sd} of the (human, drug of interest) pair. If there are many characteristics i and j and few observations Y_{sd}, then variable selection or regularisation is needed. Constraints may be imposed, like X_{si}=X_i for all s and X_{dj}=X_j for all d.

Fused LASSO (least absolute shrinkage and selection operator), clustered LASSO and prior LASSO seem related to the above method.

Leader turnover due to organisation performance is underestimated

Berry and Fowler (2021) “Leadership or luck? Randomization inference for leader effects in politics, business, and sports” in Science Advances propose a method they call RIFLE for testing the null hypothesis that leaders have no effect on organisation performance. The method is robust to serial correlation in outcomes and leaders, but not to endogenous leader turnover, as Berry and Fowler honestly point out. The endogeneity is that the organisation’s performance influences the probability that the leader is replaced (economic growth causes voters to keep a politician in office, losing games causes a team to replace its coach).

To test whether such endogeneity is a significant problem for their results, Berry and Fowler regress the turnover probability on various measures of organisational performance. They find small effects, but this underestimates the endogeneity problem, because Berry and Fowler use linear regression, forcing the effect of performance on turnover to be monotone and linear.

If leader turnover is increased by both success (get a better job elsewhere if the organisation performs well, so quit voluntarily) and failure (fired for the organisation’s bad performance), then the relationship between turnover and performance is U-shaped. Average leaders keep their jobs, bad and good ones transition elsewhere. This is related to the Peter Principle that an employee is promoted to her or his level of incompetence. A linear regression finds a near-zero effect of performance on turnover in this case even if the true effect is large. How close the regression coefficient is to zero depends on how symmetric the effects of good and bad performance on leader transition are, not how large these effects are.

The problem for the RIFLE method of Berry and Fowler is that the small apparent effect of organisation performance on leader turnover from OLS regression misses the endogeneity in leader transitions. Such endogeneity biases RIFLE, as Berry and Fowler admit in their paper.

The endogeneity may explain why Berry and Fowler find stronger leader effects in sports (coaches in various US sports) than in business (CEOs) and politics (mayors, governors, heads of government). A sports coach may experience more asymmetry in the transition probabilities for good and bad performance than a politician. For example, if the teams fire coaches after bad performance much more frequently than poach coaches from well-performing competing teams, then the effect of performance on turnover is close to monotone: bad performance causes firing. OLS discovers this monotone effect. On the other hand, if politicians move with equal likelihood after exceptionally good and bad performance of the administrative units they lead, then linear regression finds no effect of performance on turnover. This misses the bias in RIFLE, which without the bias might show a large leader effect in politics also.

The unreasonably large effect of governors on crime (the governor effect explains 18-20% of the variation in both property and violent crime) and the difference between the zero effect of mayors on crime and the large effect of governors that Berry and Fowler find makes me suspect something is wrong with that particular analysis in their paper. In a checks-and-balances system, the governor should not have that large of influence on the state’s crime. A mayor works more closely with the local police, so would be expected to have more influence on crime.

If top people have families and hobbies, then success is not about productivity


1 Productivity is continuous and weakly increasing in talent and effort.

2 The sum of efforts allocated to all activities is bounded, and this bound is similar across people.

3 Families and hobbies take some effort, thus less is left for work. (For this assumption to hold, it may be necessary to focus on families with children in which the partner is working in a different field. Otherwise, a stay-at-home partner may take care of the cooking and cleaning, freeing up time for the working spouse to allocate to work. A partner in the same field of work may provide a collaboration synergy. In both cases, the productivity of the top person in question may increase.)

4 The talent distribution is similar for people with and without families or hobbies. This assumption would be violated if for example talented people are much better at finding a partner and starting a family.

Under these assumptions, reasonably rational people would be more productive without families or hobbies. If success is mostly determined by productivity, then people without families should be more successful on average. In other words, most top people in any endeavour would not have families or hobbies that take time away from work.

In short, if responsibilities and distractions cause lower productivity, and productivity causes success, then success is negatively correlated with such distractions. Therefore, if successful people have families with a similar or greater frequency as the general population, then success is not driven by productivity.

One counterargument is that people first become successful and then start families. In order for this to explain the similar fractions of singles among top and bottom achievers, the rate of family formation after success must be much greater than among the unsuccessful, because catching up from a late start requires a higher rate of increase.

Another explanation is irrationality of a specific form – one which reduces the productivity of high effort significantly below that of medium effort. Then single people with lots of time for work would produce less through their high effort than those with families and hobbies via their medium effort. Productivity per hour naturally falls with increasing hours, but the issue here is total output (the hours times the per-hour productivity). An extra work hour has to contribute negatively to success to explain the lack of family-success correlation. One mechanism for a negative effect of hours on output is burnout of workaholics. For this explanation, people have to be irrational enough to keep working even when their total output falls as a result.

If the above explanations seem unlikely but the assumptions reasonable in a given field of human endeavour, then reaching the top and staying there is mostly not about productivity (talent and effort) in this field. For example, in academic research.

A related empirical test of whether success in a given field is caused by productivity is to check whether people from countries or groups that score highly on corruption indices disproportionately succeed in this field. Either conditional on entering the field or unconditionally. In academia, in fields where convincing others is more important than the objective correctness of one’s results, people from more nepotist cultures should have an advantage. The same applies to journals – the general interest ones care relatively more about a good story, the field journals more about correctness. Do people from more corrupt countries publish relatively more in general interest journals, given their total publications? Of course, conditional on their observable characteristics like the current country of employment.

Another related test for meritocracy in academia or the R&D industry is whether coauthored publications and patents are divided by the number of coauthors in their influence on salaries and promotions. If there is an established ranking of institutions or job titles, then do those at higher ranks have more quality-weighted coauthor-divided articles and patents? The quality-weighting is the difficult part, because usually there is no independent measure of quality (unaffected by the dependent variable, be it promotions, salary, publication venue).

The smartest professors need not admit the smartest students

The smartest professors are likely the best at targeting admission offers to students who are the most useful for them. Other things equal, the intelligence of a student is beneficial, but there may be tradeoffs. The overall usefulness may be maximised by prioritising obedience (manipulability) over intelligence or hard work. It is an empirical question what the real admissions criteria are. Data on pre-admissions personality test results (which the admissions committee may or may not have) would allow measuring whether the admission probability increases in obedience. Measuring such effects for non-top universities is complicated by the strategic incentive to admit students who are reasonably likely to accept, i.e. unlikely to get a much better offer elsewhere. So the middle- and bottom-ranked universities might not offer a place to the highest-scoring students for reasons independent of the obedience-intelligence tradeoff.

Similarly, a firm does not necessarily hire the brightest and individually most productive workers, but rather those who the firm expects to contribute the most to the firm’s bottom line. Working well with colleagues, following orders and procedures may in some cases be the most important characteristics. A genius who is a maverick may disrupt other workers in the organisation too much, reducing overall productivity.

The most liveable cities rankings are suspicious

The „most liveable cities” rankings do not publish their methodology, only vague talk about a weighted index of healthcare, safety, economy, education, etc. An additional suspicious aspect is that the top-ranked cities are all large – there are no small towns. There are many more small than big cities in the world (this is known as Zipf’s law), so by chance alone, one would expect most of the top-ranked towns in any ranking that is not size-based to be small. The liveability rankings do not mention restricting attention to sizes above some cutoff. Even if a minimum size was required, one would expect most of the top-ranked cities to be close to this lower bound, just based on the size distribution.

The claimed ranking methodology includes several variables one would expect to be negatively correlated with the population of a city (safety, traffic, affordability). The only plausible positively size-associated variables are culture and entertainment, if these measure the total number of venues and events, not the per-capita number. Unless the index weights entertainment very heavily, one would expect big cities to be at a disadvantage in the liveability ranking based on the correlations, i.e. the smaller the town, the greater its probability of achieving a given liveability score and placing in the top n in the rankings. So the “best places to live” should be almost exclusively small towns. Rural areas not so much, because these usually have limited access to healthcare, education and amenities. The economy of remote regions grows less overall and the population is older, but some (mining) boom areas radically outperform cities in these dimensions. Crime is generally low, so if rural areas were included in the liveability index, then some of these would have a good change of attaining top rank.

For any large city, there exists a small town with better healthcare, safety, economy, education, younger population, more entertainment events per capita, etc (easy examples are university towns). The fact that these do not appear at the top of a liveability ranking should raise questions about its claimed methodology.

The bias in favour of bigger cities is probably coming from sample selection and hometown patriotism. If people vote mostly for their own city and the respondents of the liveability survey are either chosen from the population approximately uniformly randomly or the sample is weighted towards larger cities (online questionnaires have this bias), then most of the votes will favour big cities.

Distinguishing discrimination in admissions from the opposite discrimination in grading

There are at least two potential explanations for why students from group A get a statistically significantly higher average grade in the same course than those from group B. The first is discrimination against A in admissions: if members of A face a stricter ability cutoff to be accepted at the institution, then conditional on being accepted, they have higher average ability. One form of a stricter ability cutoff is requiring a higher score from members of A, provided admissions test scores are positively correlated with ability.

The second explanation is discrimination in favour of group A in grading: students from A are given better grades for the same work. To distinguish this from admissions discrimination against A, one way is to compare the relative grades of groups A and B across courses. If the difference in average grades is due to ability, then it should be quite stable across courses, compared to a difference coming from grading standards, which varies with each grader’s bias for A.

Of course, there is no clear line how much the relative grades of group A vary across courses under grading discrimination, as opposed to admissions bias. Only statistical conclusions can be drawn about the relative importance of the two opposing mechanisms driving the grade difference. The distinction is more difficult to make when there is a „cartel” in grading discrimination, so that all graders try to boost group A by the same amount, i.e. to minimise the variance in the advantage given to A. Conscious avoidance of detection could be one reason to reduce the dispersion in the relative grade improvement of A.

Another complication when trying to distinguish the causes of the grade difference is that ability may affect performance differentially across courses. An extreme case is if the same trait improves outcomes in one course, but worsens them in another, for example lateral thinking is beneficial in a creative course, but may harm performance when the main requirement is to follow rules and procedures. To better distinguish the types of discrimination, the variation in the group difference in average grades should be compared across similar courses. The ability-based explanation results in more similar grade differences between more closely related courses. Again, if graders in similar courses vary less in their bias than graders in unrelated fields, then distinguishing the types of discrimination is more difficult.

Easier combining of entertainment and work may explain increased income inequality

Many low-skill jobs (guard, driver, janitor, manual labourer) permit on-the-job consumption of forms of entertainment (listening to music or news, phoning friends) that became much cheaper and more available with the introduction of new electronic devices (first small radios, then TVs, then cellphones, smartphones). Such entertainment does not reduce productivity at the abovementioned jobs much, which is why it is allowed. On the other hand, many high-skill jobs (planning, communicating, performing surgery) are difficult to combine with any entertainment, because the distraction would decrease productivity significantly. The utility of low-skill work thus increased relatively more than that of skilled jobs when electronics spread and cheapened. The higher utility made low-skill jobs relatively more attractive, so the supply of labour at these increased relatively more. This supply rise reduced the pay relative to high-skill jobs, which increased income inequality. Another way to describe this mechanism is that as the disutility of low-skill jobs fell, so did the real wage required to compensate people for this disutility.

An empirically testable implication of this theory is that jobs of any skill level that do not allow on-the-job entertainment should have seen salaries increase more than comparable jobs which can be combined with listening to music or with personal phone calls. For example, a janitor cleaning an empty building can make personal calls, but a cleaner of a mall (or other public venue) during business hours may be more restricted. Both can listen to music on their headphones, so the salaries should not have diverged when small cassette players went mainstream, but should have diverged when cellphones with headsets became cheap. Similarly, a trucker or nightwatchman has more entertainment options than a taxi driver or mall security guard, because the latter do not want to annoy customers with personal calls or loud music. A call centre operator is more restricted from audiovisual entertainment than a receptionist.

According to the above theory, the introduction of radios and cellphones should have increased the wage inequality between areas with good and bad reception, for example between remote rural and urban regions, or between underground and aboveground mining. On the other hand, the introduction of recorded music should not have increased these inequalities as much, because the availability of records is more similar across regions than radio or phone coverage.

Laplace’s principle of indifference makes history useless

Model the universe in discrete time with only one variable, which can take values 0 and 1. The history of the universe up to time t is a vector of length t consisting of zeroes and ones. A deterministic universe is a fixed sequence. A random universe is like drawing the next value (0 or 1) according to some probability distribution every period, where the probabilities can be arbitrary and depend in arbitrary ways on the past history.
The prior distribution over deterministic universes is a distribution over sequences of zeroes and ones. The prior determines which sets are generic. I will assume the prior with the maximum entropy, which is uniform (all paths of the universe are equally likely). This follows from Laplace’s principle of indifference, because there is no information about the distribution over universes that would make one universe more likely than another. The set of infinite sequences of zeroes and ones is bijective with the interval [0,1], so a uniform distribution on it makes sense.
After observing the history up to time t, one can reject all paths of the universe that would have led to a different history. For a uniform prior, any history is equally likely to be followed by 0 or 1. The prediction of the next value of the variable is the same after every history, so knowing the history is useless for decision-making.
Many other priors besides uniform on all sequences yield the same result. For example, uniform restricted to the support consisting of sequences that are eventually constant. There is a countable set of such sequences, so the prior is improper uniform. A uniform distribution restricted to sequences that are eventually periodic, or that in the limit have equal frequency of 1 and 0 also works.
Having more variables, more values of these variables or making time continuous does not change the result. A random universe can be modelled as deterministic with extra variables. These extras can for example be the probability of drawing 1 next period after a given history.
Predicting the probability distribution of the next value of the variable is easy, because the probability of 1 is always one-half. Knowing the history is no help for this either.