Tag Archives: econometrics

Distinguishing discrimination in admissions from the opposite discrimination in grading

There are at least two potential explanations for why students from group A get a statistically significantly higher average grade in the same course than those from group B. The first is discrimination against A in admissions: if members of A face a stricter ability cutoff to be accepted at the institution, then conditional on being accepted, they have higher average ability. One form of a stricter ability cutoff is requiring a higher score from members of A, provided admissions test scores are positively correlated with ability.

The second explanation is discrimination in favour of group A in grading: students from A are given better grades for the same work. To distinguish this from admissions discrimination against A, one way is to compare the relative grades of groups A and B across courses. If the difference in average grades is due to ability, then it should be quite stable across courses, compared to a difference coming from grading standards, which varies with each grader’s bias for A.

Of course, there is no clear line how much the relative grades of group A vary across courses under grading discrimination, as opposed to admissions bias. Only statistical conclusions can be drawn about the relative importance of the two opposing mechanisms driving the grade difference. The distinction is more difficult to make when there is a „cartel” in grading discrimination, so that all graders try to boost group A by the same amount, i.e. to minimise the variance in the advantage given to A. Conscious avoidance of detection could be one reason to reduce the dispersion in the relative grade improvement of A.

Another complication when trying to distinguish the causes of the grade difference is that ability may affect performance differentially across courses. An extreme case is if the same trait improves outcomes in one course, but worsens them in another, for example lateral thinking is beneficial in a creative course, but may harm performance when the main requirement is to follow rules and procedures. To better distinguish the types of discrimination, the variation in the group difference in average grades should be compared across similar courses. The ability-based explanation results in more similar grade differences between more closely related courses. Again, if graders in similar courses vary less in their bias than graders in unrelated fields, then distinguishing the types of discrimination is more difficult.

Easier combining of entertainment and work may explain increased income inequality

Many low-skill jobs (guard, driver, janitor, manual labourer) permit on-the-job consumption of forms of entertainment (listening to music or news, phoning friends) that became much cheaper and more available with the introduction of new electronic devices (first small radios, then TVs, then cellphones, smartphones). Such entertainment does not reduce productivity at the abovementioned jobs much, which is why it is allowed. On the other hand, many high-skill jobs (planning, communicating, performing surgery) are difficult to combine with any entertainment, because the distraction would decrease productivity significantly. The utility of low-skill work thus increased relatively more than that of skilled jobs when electronics spread and cheapened. The higher utility made low-skill jobs relatively more attractive, so the supply of labour at these increased relatively more. This supply rise reduced the pay relative to high-skill jobs, which increased income inequality. Another way to describe this mechanism is that as the disutility of low-skill jobs fell, so did the real wage required to compensate people for this disutility.

An empirically testable implication of this theory is that jobs of any skill level that do not allow on-the-job entertainment should have seen salaries increase more than comparable jobs which can be combined with listening to music or with personal phone calls. For example, a janitor cleaning an empty building can make personal calls, but a cleaner of a mall (or other public venue) during business hours may be more restricted. Both can listen to music on their headphones, so the salaries should not have diverged when small cassette players went mainstream, but should have diverged when cellphones with headsets became cheap. Similarly, a trucker or nightwatchman has more entertainment options than a taxi driver or mall security guard, because the latter do not want to annoy customers with personal calls or loud music. A call centre operator is more restricted from audiovisual entertainment than a receptionist.

According to the above theory, the introduction of radios and cellphones should have increased the wage inequality between areas with good and bad reception, for example between remote rural and urban regions, or between underground and aboveground mining. On the other hand, the introduction of recorded music should not have increased these inequalities as much, because the availability of records is more similar across regions than radio or phone coverage.

Laplace’s principle of indifference makes history useless

Model the universe in discrete time with only one variable, which can take values 0 and 1. The history of the universe up to time t is a vector of length t consisting of zeroes and ones. A deterministic universe is a fixed sequence. A random universe is like drawing the next value (0 or 1) according to some probability distribution every period, where the probabilities can be arbitrary and depend in arbitrary ways on the past history.
The prior distribution over deterministic universes is a distribution over sequences of zeroes and ones. The prior determines which sets are generic. I will assume the prior with the maximum entropy, which is uniform (all paths of the universe are equally likely). This follows from Laplace’s principle of indifference, because there is no information about the distribution over universes that would make one universe more likely than another. The set of infinite sequences of zeroes and ones is bijective with the interval [0,1], so a uniform distribution on it makes sense.
After observing the history up to time t, one can reject all paths of the universe that would have led to a different history. For a uniform prior, any history is equally likely to be followed by 0 or 1. The prediction of the next value of the variable is the same after every history, so knowing the history is useless for decision-making.
Many other priors besides uniform on all sequences yield the same result. For example, uniform restricted to the support consisting of sequences that are eventually constant. There is a countable set of such sequences, so the prior is improper uniform. A uniform distribution restricted to sequences that are eventually periodic, or that in the limit have equal frequency of 1 and 0 also works.
Having more variables, more values of these variables or making time continuous does not change the result. A random universe can be modelled as deterministic with extra variables. These extras can for example be the probability of drawing 1 next period after a given history.
Predicting the probability distribution of the next value of the variable is easy, because the probability of 1 is always one-half. Knowing the history is no help for this either.

Defence against bullying

Humans are social animals. For evolutionary reasons, they feel bad when their social group excludes, bullies or opposes them. Physical bullying and theft or vandalism of possessions have real consequences and cannot be countered purely in the mind. However, the real consequences are usually provable to the authorities, which makes it easier to punish the bullies and demand compensation. Psychological reasons may prevent the victim from asking the authorities to help. Verbal bullying has an effect only via psychology, because vibrations of air from the larynx or written symbols cannot hurt a human physically.

One psychological defense is diversification of group memberships. The goal is to prevent exclusion from most of one’s social network. If a person belongs to only one group in society, then losing the support of its members feels very significant. Being part of many circles means that exclusion from one group can be immediately compensated by spending more time in others.

Bullies instinctively understand that their victims can strengthen themselves by diversifying their connections, so bullies try to cut a victim’s other social ties. The beaters of family members forbid their family from having other friends or going to social events. School bullies mock a victim’s friends to drive them away and weaken the victim’s connection to them. Dictators create paranoia against foreigners, accusing them of spying and sabotage.

When a person has already been excluded from most of their social network, joining new groups or lobbying for readmission to old ones may be hard. People prefer to interact with those who display positive emotions. The negative emotions caused by a feeling of abandonment make it difficult to present a happy and fun image to others. Also, if the „admission committee” knows that a candidate to join their group has no other options, then they are likely to be more demanding, in terms of requiring favours or conformity to the group norms. Bargaining power depends on what each side gets when the negotiations break down – the better the outside option, the stronger the bargaining position. It is thus helpful to prepare for potential future exclusion in advance by joining many groups. Diversifying one’s memberships before the alternative groups become necessary is insurance. One should keep one’s options open, which argues for living in a bigger city, exploring different cultures both online and the real world, and not burning bridges with people who at some point excluded or otherwise acted against one.

There may be a case for forgiving bullies if they take enough nice actions to compensate. Apologetic words alone do not cancel actions, as discussed elsewhere (http://sanderheinsalu.com/ajaveeb/?p=556). Forgiving does not mean forgetting, because past behaviour is informative about future actions, and social interactions are a dynamic game. The entire sharing economy (carsharing, home-renting) is made possible by having people’s reputations follow them even if they try to escape the consequences of their past deeds. The difficulty of evading consequences motivates better behaviour. The same holds in social interactions. In the long run, it is better for everyone, except perhaps the worst people, if past deeds are rewarded or punished as they deserve. If bullying is not punished, then the perpetrators learn this and intensify their oppression in the future.

Of course, the bullies may try to punish those who reported them to the authorities. The threat to retaliate against whistleblowers shows fear of punishment, because people who do not care about the consequences would not bother threatening. The whistleblower can in turn threaten the bullies with reporting to the authorities if the bullies punish the original whistleblowing. The bullies can threaten to punish this second report, and the whistleblower threaten to punish the bullies’ second retaliation, etc. The bullying and reporting is a repeated interaction and has multiple equilibria. One equilibrium is that the bullies rule, therefore nobody dares to report them, and due to not being reported, they continue to rule. Another equilibrium is that any bullying is swiftly reported and punished, so the bullies do not even dare to start the bullying-reporting-retaliation cycle. The bullies rationally try to push the interaction towards the equilibrium where they rule. Victims and goodhearted bystanders should realise this and work towards the other equilibrium by immediately reporting any bullying against anyone, not just oneself.

To prevent insults from creating negative emotions, one should remember that the opinion of only a few other people at one point in time contains little information. Feedback is useful for improving oneself, and insults are a kind of feedback, but a more accurate measure of one’s capabilities is usually available. This takes the form of numerical performance indicators at work, studies, sports and various other tests in life. If people’s opinions are taken as feedback, then one should endeavour to survey a statistically meaningful sample of these opinions. The sample should be large and representative of society – the people surveyed should belong to many different groups.

If some people repeatedly insult one, then one should remember that the meaning of sounds or symbols that people produce (called language) is a social norm. If the society agrees on a different meaning for a given sound, then that sound starts to mean what the people agreed. Meaning is endogenous – it depends on how people choose to use language. On an individual level, if a person consistently mispronounces a word, then others learn what that unusual sound from that person means. Small groups can form their own slang, using words to denote meanings differently from the rest of society. Applying this insight to bullying, if others frequently use an insulting word to refer to a person, then that word starts to mean that person, not the negative thing that it originally meant. So one should not interpret an insulting word in a way that makes one feel bad. The actual meaning is neutral, just the „name” of a particular individual in the subgroup of bullies. Of course, in future interactions one should not forget the insulters’ attempt to make one feel bad.

To learn the real meaning of a word, as used by a specific person, one should Bayes update based on the connection of that person’s words and actions. This also helps in understanding politics. If transfers from the rich to the poor are called „help to the needy” by one party and „welfare” by another, then these phrases by the respective parties should be interpreted as „transfers from the rich to the poor”. If a politician frequently says the opposite of the truth, then his or her statements should be flipped (negation inserted) to derive their real meaning. Bayesian updating also explains why verbal apologies are usually nothing compared to actions.

Practicing acting in a drama club helps to understand that words often do not have content. Their effect is just in people’s minds. Mock confrontations in a play will train a person to handle real disputes.

Learning takes time and practice, including learning how to defend against bullying and ignore insults. Successfully resisting will train one to resist better. Dealing with adversity is sometimes called „building character”. To deliberately train oneself to ignore insults, one may organise an insult competition – if the insulted person reacts emotionally, then the insulter wins, otherwise the insulter loses. As with any training and competition, the difficulty level should be adjusted for ability and experience.

The current trend towards protecting children from even verbal bullying, and preventing undergraduate students from hearing statements that may distress them could backfire. If they are not trained to resist bullying and experience it at some point in their life, which seems likely, then they may be depressed for a long time or overreact to trivial insults. The analogy is living in an environment with too few microbes, which does not build immunity and causes allergy. „Safe spaces” and using only mild words are like disinfecting everything.

The bullies themselves are human, thus social animals, and feel negative emotions when excluded or ignored. If there are many victims and few bullies, then the victims should band together and exclude the bullies in turn. One force preventing this is that the victims see the bullies as the „cool kids” (attractive, rich, strong) and want their approval. The victims see other victims as „losers” or „outsiders” and help victimise them, and the other victims respond in kind. The outsiders do not understand that what counts as „cool” is often a social norm. If the majority thinks behaviour, clothing or slang A cool, then A is cool, but if the majority agrees on B, then B is preferred. The outsiders face a coordination game: if they could agree on a new social norm, then their number being larger than the number of insiders would spread the new norm. The outsiders would become the „cool kids” themselves, and the previously cool insiders would become the excluded outsiders.

Finding new friends helps increase the number who spread one’s preferred norms, as well as insuring against future exclusion by any subset of one’s acquaintances.

If there are many people to choose from when forming new connections, then the links should be chosen strategically. People imitate their peers, so choosing those with good habits as one’s friends helps one acquire these habits oneself. Having friends who exercise, study and have a good work ethic increases one’s future fitness, education and professional success. Criminal, smoking, racist friends nudge one towards similar behaviours and values. Choosing friends is thus a game with one’s future self. The goal is to direct the future self to a path preferred by the current self. The future self in turn directs its future selves. It takes time and effort to replace one’s friends, so there is a switching cost in one’s social network choice. A bad decision in the past may have an impact for a long time.

It may be difficult to determine who is a good person and who is not. Forming a social connection and subtly testing a person may be the only way to find out their true face. For example, telling them a fake secret and asking them not to tell anyone, then observing whether the information leaks. One should watch how one’s friends behave towards others, not just oneself. There is a tradeoff between learning about more people and interacting with only good people. The more connections one forms, the greater the likelihood that some are with bad people, but the more one learns. This is strategic experimentation in a dynamic environment.

Measuring a person’s contribution to society

Sometimes it is debated whether one profession or person contributes more to society than another, for example whether a scientist is more valuable than a doctor. There are many dimensions to any job. One could compare the small and probabilistic contribution to many people’s lives that a scientist makes to the large and visible influence of a doctor to a few patients’ wellbeing. These debates can to some extent be avoided, because a simple measure of a person’s contribution to society is their income. It is an imperfect measure, as are all measures, but it is an easily obtained baseline from which to start. If the people compared are numerous, un-cartelized and employed by numerous competitive employers, then their pay equals their marginal productivity, as explained in introductory economics.

People are usually employed by one firm at a time, and full-time non-overtime work is the most common, so the employers can be thought of as buying one “full-time unit” of labour from each worker. The marginal productivity equals the total productivity in the case where only one or zero units can be supplied. So the salary equals the total productivity at work.

Income from savings in a competitive capital market equals the value provided to the borrower of those savings. If the savings are to some extent inherited or obtained from gifts, then the interest income is to that extent due to someone else’s past productivity. Then income is greater than the contribution to society.

Other reasons why income may be a biased measure are negative externalities (criminal income measures harm to others), positive externalities (scientists help future generations, but don’t get paid for it), market power (teachers, police, social workers employed by monopsonist government get paid less than their value), transaction costs (changing a job is a hassle for the employer and the employee alike) and incomplete information (hard to measure job performance, so good workers underpaid and bad overpaid on average). In short, all the market failures covered in introductory economics.

If the income difference is large and the quantitative effect of the market failures is similar (neither person is a criminal, both work for employers whose competitive situations are alike, little inheritance), then the productivity difference is likely to be in the same direction as the salary difference. If the salary difference is small and the jobs are otherwise similar, the contribution to society is likely similar, so ranking their productivity is not that important. Comparison of people whose labour markets have different failures to a different extent is difficult.

Local and organic food is wasteful

The easiest measure of any good’s environmental impact is its price. It is not a perfect measure. Subsidies for the inputs of a product can lower its price below more environmentally friendly alternatives that are not favoured by the government. Taxes, market power, externalities and incomplete information can similarly distort relative prices, as introductory economics courses explain. However, absent additional data, a more expensive good likely requires more resources and causes more environmental damage. Remembering this saves time on debating whether local non-organic is better than non-local organic fair trade, etc.

Local and organic are marketing terms, one suggesting helping local farmers and a lower environmental impact from transport, the other claiming health benefits and a lower environmental impact from fertilizers. Organic food may use less of some category of chemicals, but this must have a tradeoff in lower yield (more land used per unit produced) or greater use of some other input, because its higher price shows more resource use overall. From the (limited) research I have read, there is no difference in the health effects of organic and non-organic food. To measure this difference, a selection bias must be taken into account – the people using organic are more health-conscious, so may be healthier to start with. On the other hand, those buying organic and local may be more manipulable, which has unknown health effects. Local food may use less resources for transport, but its higher price shows it uses more resources in total. One resource is the more expensive labour of rich countries (the people providing this labour consume more, thus have a greater environmental impact).

If one wants to help “local farmers” (usually large agribusinesses, not the family farms their lobbying suggests), one can give them money directly. No need to buy their goods, just make them a bank transfer and then buy whichever product is the least wasteful.

There are economies of scale in farming, so the more efficient large agricultural companies tend to outcompete family farms. The greater efficiency is also more environmentally friendly: more production for the same resources, or the same production with less. Helping the small farms avoid takeover is bad for the environment.

Fair trade and sustainable sourcing may be good things, if the rules for obtaining this classification are reasonable and enforced. But who buying fair trade or sustainable has actually checked what the meaning behind the labels is (the “fine print”), or verified with independent auditors whether the nice-sounding principles are put into practice? When a term is used in marketing, I suspect business as usual behind it.

Statistics with a single history

Only one history is observable to a person – the one that actually happened. Counterfactuals are speculation about what would have happened if choices or some other element of the past history had differed. Only one history is observable to humanity as a whole, to all thinking beings in the universe as a whole, etc. This raises the question of how to do statistics with a single history.

The history is chopped into small pieces, which are assumed similar to each other and to future pieces of history. All conclusions require assumptions. In the case of statistics, the main assumption is “what happened in the past, will continue to happen in the future.” The “what” that is happening can be complicated – a long chaotic pattern can be repeated. It should be specified what the patterns of history consist of before discussing them.

The history observable to a brain consists of the sensory inputs and memory. Nothing else is accessible. This is pointed out by the “brain in a jar” thought experiment. Memory is partly past sensory inputs, but may also depend on spontaneous changes in the brain. Machinery can translate previously unobservable aspects of the world into accessible sensory inputs, for example convert infrared and ultraviolet light into visible wavelengths. Formally, history is a function from time to vectors of sensory inputs.

The brain has a built-in ability to classify sensory inputs by type – visual, auditory, etc. This is why the inputs form a vector. For a given sense, there is a built-in “similarity function” that enables comparing inputs from the same sense at different times.

Inputs distinguished by one person, perhaps with the help of machinery, may look identical to another person. The interpretation is that there are underlying physical quantities that must differ by more than the “just noticeable difference” to be perceived as different. The brain can access physical quantities only through the senses, so whether there is a “real world” cannot be determined, only assumed. If most people’s perceptions agree about something, and machinery also agrees (e.g. measuring tape does not agree with visual illusions), then this “something” is called real and physical. The history accessible to humanity as a whole is a function from time to the concatenation of their sensory input vectors.

The similarity functions of people can also be aggregated, compared to machinery and the result interpreted as a physical quantity taking “similar” values at different times.

A set of finite sequences of vectors of sensory inputs is what I call a pattern of history. For example, a pattern can be a single sequence or everything but a given sequence. Patterns may repeat, due to the indistinguishability of physical quantities close to each other. The finer distinctions one can make, the fewer the instances with the same perception. In the limit of perfect discrimination of all variable values, history is unlikely to ever repeat. In the limit of no perception at all, history is one long repetition of nothing happening. The similarity of patterns is defined based on the similarity function in the brain.

Repeated similar patterns together with assumptions enable learning and prediction. If AB is always followed by C, then learning is easy. Statistics are needed when this is not the case. If half the past instances of AB are followed by C, half by D, then one way to interpret this is by constructing a state space with a probability distribution on it. For example, one may assume the existence of an unperceived variable that can take values c,d and assume that ABc leads deterministically to ABC and ABd to ABD. The past instances of AB can be interpreted as split into equal numbers of ABc and ABd. The prediction after observing AB is equal probabilities of C and D. This is a frequentist setup.

A Bayesian interpretation puts a prior probability distribution on histories and updates it based on the observations. The prior may put probability one on a single future history after each past one. Such a deterministic prediction is easily falsified – one observation contrary to it suffices. Usually, many future histories are assumed to have positive probability. Updating requires conditional probabilities of future histories given the past. The histories that repeat past patterns are usually given higher probability than others. Such a conditional probability system embodies the assumption “what happened in the past, will continue to happen in the future.”

There is a tradeoff between the length of a pattern and the number of times it has repeated. Longer patterns permit prediction further into the future, but fewer repetitions mean more uncertainty. Much research in statistics has gone into finding the optimal pattern length given the data. A long pattern contains many shorter ones, with potentially different predictions. Combining information from different pattern lengths is also a research area. Again, assumptions determine which pattern length and combination is optimal. Assumptions can be tested, but only under other assumptions.

Causality is also a mental construct. It is based on past repetitions of an AB-like pattern, without occurrence of BA or CB-like patterns.

The perception of time is created by sensory inputs and memory, e.g. seeing light and darkness alternate, feeling sleepy or alert due to the circadian rhythm and remembering that this has happened before. History is thus a mental construct. It relies on the assumptions that time exists, there is a past in which things happened and current recall is correlated with what actually happened. The preceding discussion should be restated without assuming time exists.

 

Bayesian vs frequentist statistics – how to decide?

Which predicts better, Bayesian or frequentist statistics? This is an empirical question. To find out, should we compare their predictions to the data using Bayesian or frequentist statistics? What if Bayesian statistics says frequentist is better and frequentist says Bayesian is better (Liar’s paradox)? To find the best method for measuring the quality of the predictions, should we use Bayesianism or frequentism? And to find the best method to find the best method for comparing predictions to data? How to decide how to decide how to decide, as in Lipman (1991)?

Theory and data both needed for prediction

Clearly, data is required for prediction. Theory only says: “If this, then that.” It connects assumptions and conclusions. Data tells whether the assumptions are true. It allows the theory to be applied.
Theory is also required for prediction, although that is less obvious. For example, after observing a variable taking the value 1 a million times, what is the prediction for the next realization for the variable? Under the theory that the variable is constant, the next value is predicted to be 1. If the theory says there are a million 1-s followed by a million 0-s followed by a million 1-s etc, then the next value is 0. This theory may sound more complicated than the other, but prediction is concerned with correctness, not complexity. Also, the simplicity of a theory is a slippery concept – see the “grue-bleen example” in philosophy.
The constant sequence may sound like a more “natural” theory, but actually both the “natural” and the correct theory depend on where the data comes from. For example, the data may be generated by measuring whether it is day or night every millisecond. Day=1, night=0. Then a theory that a large number of 1-s are followed by a large number of 0-s, etc is more natural and correct than the theory that the sequence is constant.
Sometimes the theory is so simple that it is not noticed, like when forecasting a constant sequence. Which is more important for prediction, theory or data? Both equally, because the lack of either makes prediction impossible. If the situation is simple, then theorists may not be necessary, but theory still is.

Evaluating the truth and the experts simultaneously

When evaluating an artwork, the guilt of a suspect or the quality of theoretical research, the usual procedure is to gather the opinions of a number of people and take some weighted average of these. There is no objective measure of the truth or the quality of the work. What weights should be assigned to different people’s opinions? Who should be counted an expert or knowledgeable witness?
A circular problem appears: the accurate witnesses are those who are close to the truth, and the truth is close to the average claim of the accurate witnesses. This can be modelled as a set of signals with unknown precision. Suppose the signals are normally distributed with mean equal to the truth (witnesses unbiased, just have poor memories). If the precisions were known, then these could be used as weights in the weighted average of the witness opinions, which would be an unbiased estimate of the truth with minimal variance. If the truth were known, then the distance of the opinion of a witness from it would measure the accuracy of that witness. But both precisions and the truth are unknown.
Simultaneously determining the precisions of the signals and the estimate of the truth may have many solutions. If there are two witnesses with different claims, we could assign the first witness infinite precision and the second finite, and estimate the truth to equal the opinion of the first witness. The truth is derived from the witnesses and the precisions are derived from the truth, so this is consistent. The same applies with witnesses switched.
A better solution takes a broader view and simultaneously estimates witness precisions and the truth. These form a vector of random variables. Put a prior probability distribution on this vector and use Bayes’ rule to update this distribution in response to the signals (the witness opinions).
The solution of course depends on the chosen prior. If one witness is assumed infinitely precise and the others finitely, then the updating rule keeps the infinite and finite precisions and estimates the truth to equal the opinion of the infinitely precise witness. The assumption of the prior seems unavoidable. At least it makes clear why the multiple solutions arise.