# Laplace’s principle of indifference makes history useless

Model the universe in discrete time with only one variable, which can take values 0 and 1. The history of the universe up to time t is a vector of length t consisting of zeroes and ones. A deterministic universe is a fixed sequence. A random universe is like drawing the next value (0 or 1) according to some probability distribution every period, where the probabilities can be arbitrary and depend in arbitrary ways on the past history.
The prior distribution over deterministic universes is a distribution over sequences of zeroes and ones. The prior determines which sets are generic. I will assume the prior with the maximum entropy, which is uniform (all paths of the universe are equally likely). This follows from Laplace’s principle of indifference, because there is no information about the distribution over universes that would make one universe more likely than another. The set of infinite sequences of zeroes and ones is bijective with the interval [0,1], so a uniform distribution on it makes sense.
After observing the history up to time t, one can reject all paths of the universe that would have led to a different history. For a uniform prior, any history is equally likely to be followed by 0 or 1. The prediction of the next value of the variable is the same after every history, so knowing the history is useless for decision-making.
Many other priors besides uniform on all sequences yield the same result. For example, uniform restricted to the support consisting of sequences that are eventually constant. There is a countable set of such sequences, so the prior is improper uniform. A uniform distribution restricted to sequences that are eventually periodic, or that in the limit have equal frequency of 1 and 0 also works.
Having more variables, more values of these variables or making time continuous does not change the result. A random universe can be modelled as deterministic with extra variables. These extras can for example be the probability of drawing 1 next period after a given history.
Predicting the probability distribution of the next value of the variable is easy, because the probability of 1 is always one-half. Knowing the history is no help for this either.

# Defence against bullying

Humans are social animals. For evolutionary reasons, they feel bad when their social group excludes, bullies or opposes them. Physical bullying and theft or vandalism of possessions have real consequences and cannot be countered purely in the mind. However, the real consequences are usually provable to the authorities, which makes it easier to punish the bullies and demand compensation. Psychological reasons may prevent the victim from asking the authorities to help. Verbal bullying has an effect only via psychology, because vibrations of air from the larynx or written symbols cannot hurt a human physically.

One psychological defense is diversification of group memberships. The goal is to prevent exclusion from most of one’s social network. If a person belongs to only one group in society, then losing the support of its members feels very significant. Being part of many circles means that exclusion from one group can be immediately compensated by spending more time in others.

Bullies instinctively understand that their victims can strengthen themselves by diversifying their connections, so bullies try to cut a victim’s other social ties. The beaters of family members forbid their family from having other friends or going to social events. School bullies mock a victim’s friends to drive them away and weaken the victim’s connection to them. Dictators create paranoia against foreigners, accusing them of spying and sabotage.

When a person has already been excluded from most of their social network, joining new groups or lobbying for readmission to old ones may be hard. People prefer to interact with those who display positive emotions. The negative emotions caused by a feeling of abandonment make it difficult to present a happy and fun image to others. Also, if the „admission committee” knows that a candidate to join their group has no other options, then they are likely to be more demanding, in terms of requiring favours or conformity to the group norms. Bargaining power depends on what each side gets when the negotiations break down – the better the outside option, the stronger the bargaining position. It is thus helpful to prepare for potential future exclusion in advance by joining many groups. Diversifying one’s memberships before the alternative groups become necessary is insurance. One should keep one’s options open, which argues for living in a bigger city, exploring different cultures both online and the real world, and not burning bridges with people who at some point excluded or otherwise acted against one.

There may be a case for forgiving bullies if they take enough nice actions to compensate. Apologetic words alone do not cancel actions, as discussed elsewhere (http://sanderheinsalu.com/ajaveeb/?p=556). Forgiving does not mean forgetting, because past behaviour is informative about future actions, and social interactions are a dynamic game. The entire sharing economy (carsharing, home-renting) is made possible by having people’s reputations follow them even if they try to escape the consequences of their past deeds. The difficulty of evading consequences motivates better behaviour. The same holds in social interactions. In the long run, it is better for everyone, except perhaps the worst people, if past deeds are rewarded or punished as they deserve. If bullying is not punished, then the perpetrators learn this and intensify their oppression in the future.

Of course, the bullies may try to punish those who reported them to the authorities. The threat to retaliate against whistleblowers shows fear of punishment, because people who do not care about the consequences would not bother threatening. The whistleblower can in turn threaten the bullies with reporting to the authorities if the bullies punish the original whistleblowing. The bullies can threaten to punish this second report, and the whistleblower threaten to punish the bullies’ second retaliation, etc. The bullying and reporting is a repeated interaction and has multiple equilibria. One equilibrium is that the bullies rule, therefore nobody dares to report them, and due to not being reported, they continue to rule. Another equilibrium is that any bullying is swiftly reported and punished, so the bullies do not even dare to start the bullying-reporting-retaliation cycle. The bullies rationally try to push the interaction towards the equilibrium where they rule. Victims and goodhearted bystanders should realise this and work towards the other equilibrium by immediately reporting any bullying against anyone, not just oneself.

To prevent insults from creating negative emotions, one should remember that the opinion of only a few other people at one point in time contains little information. Feedback is useful for improving oneself, and insults are a kind of feedback, but a more accurate measure of one’s capabilities is usually available. This takes the form of numerical performance indicators at work, studies, sports and various other tests in life. If people’s opinions are taken as feedback, then one should endeavour to survey a statistically meaningful sample of these opinions. The sample should be large and representative of society – the people surveyed should belong to many different groups.

If some people repeatedly insult one, then one should remember that the meaning of sounds or symbols that people produce (called language) is a social norm. If the society agrees on a different meaning for a given sound, then that sound starts to mean what the people agreed. Meaning is endogenous – it depends on how people choose to use language. On an individual level, if a person consistently mispronounces a word, then others learn what that unusual sound from that person means. Small groups can form their own slang, using words to denote meanings differently from the rest of society. Applying this insight to bullying, if others frequently use an insulting word to refer to a person, then that word starts to mean that person, not the negative thing that it originally meant. So one should not interpret an insulting word in a way that makes one feel bad. The actual meaning is neutral, just the „name” of a particular individual in the subgroup of bullies. Of course, in future interactions one should not forget the insulters’ attempt to make one feel bad.

To learn the real meaning of a word, as used by a specific person, one should Bayes update based on the connection of that person’s words and actions. This also helps in understanding politics. If transfers from the rich to the poor are called „help to the needy” by one party and „welfare” by another, then these phrases by the respective parties should be interpreted as „transfers from the rich to the poor”. If a politician frequently says the opposite of the truth, then his or her statements should be flipped (negation inserted) to derive their real meaning. Bayesian updating also explains why verbal apologies are usually nothing compared to actions.

Practicing acting in a drama club helps to understand that words often do not have content. Their effect is just in people’s minds. Mock confrontations in a play will train a person to handle real disputes.

Learning takes time and practice, including learning how to defend against bullying and ignore insults. Successfully resisting will train one to resist better. Dealing with adversity is sometimes called „building character”. To deliberately train oneself to ignore insults, one may organise an insult competition – if the insulted person reacts emotionally, then the insulter wins, otherwise the insulter loses. As with any training and competition, the difficulty level should be adjusted for ability and experience.

The current trend towards protecting children from even verbal bullying, and preventing undergraduate students from hearing statements that may distress them could backfire. If they are not trained to resist bullying and experience it at some point in their life, which seems likely, then they may be depressed for a long time or overreact to trivial insults. The analogy is living in an environment with too few microbes, which does not build immunity and causes allergy. „Safe spaces” and using only mild words are like disinfecting everything.

The bullies themselves are human, thus social animals, and feel negative emotions when excluded or ignored. If there are many victims and few bullies, then the victims should band together and exclude the bullies in turn. One force preventing this is that the victims see the bullies as the „cool kids” (attractive, rich, strong) and want their approval. The victims see other victims as „losers” or „outsiders” and help victimise them, and the other victims respond in kind. The outsiders do not understand that what counts as „cool” is often a social norm. If the majority thinks behaviour, clothing or slang A cool, then A is cool, but if the majority agrees on B, then B is preferred. The outsiders face a coordination game: if they could agree on a new social norm, then their number being larger than the number of insiders would spread the new norm. The outsiders would become the „cool kids” themselves, and the previously cool insiders would become the excluded outsiders.

Finding new friends helps increase the number who spread one’s preferred norms, as well as insuring against future exclusion by any subset of one’s acquaintances.

If there are many people to choose from when forming new connections, then the links should be chosen strategically. People imitate their peers, so choosing those with good habits as one’s friends helps one acquire these habits oneself. Having friends who exercise, study and have a good work ethic increases one’s future fitness, education and professional success. Criminal, smoking, racist friends nudge one towards similar behaviours and values. Choosing friends is thus a game with one’s future self. The goal is to direct the future self to a path preferred by the current self. The future self in turn directs its future selves. It takes time and effort to replace one’s friends, so there is a switching cost in one’s social network choice. A bad decision in the past may have an impact for a long time.

It may be difficult to determine who is a good person and who is not. Forming a social connection and subtly testing a person may be the only way to find out their true face. For example, telling them a fake secret and asking them not to tell anyone, then observing whether the information leaks. One should watch how one’s friends behave towards others, not just oneself. There is a tradeoff between learning about more people and interacting with only good people. The more connections one forms, the greater the likelihood that some are with bad people, but the more one learns. This is strategic experimentation in a dynamic environment.

# Measuring a person’s contribution to society

Sometimes it is debated whether one profession or person contributes more to society than another, for example whether a scientist is more valuable than a doctor. There are many dimensions to any job. One could compare the small and probabilistic contribution to many people’s lives that a scientist makes to the large and visible influence of a doctor to a few patients’ wellbeing. These debates can to some extent be avoided, because a simple measure of a person’s contribution to society is their income. It is an imperfect measure, as are all measures, but it is an easily obtained baseline from which to start. If the people compared are numerous, un-cartelized and employed by numerous competitive employers, then their pay equals their marginal productivity, as explained in introductory economics.

People are usually employed by one firm at a time, and full-time non-overtime work is the most common, so the employers can be thought of as buying one “full-time unit” of labour from each worker. The marginal productivity equals the total productivity in the case where only one or zero units can be supplied. So the salary equals the total productivity at work.

Income from savings in a competitive capital market equals the value provided to the borrower of those savings. If the savings are to some extent inherited or obtained from gifts, then the interest income is to that extent due to someone else’s past productivity. Then income is greater than the contribution to society.

Other reasons why income may be a biased measure are negative externalities (criminal income measures harm to others), positive externalities (scientists help future generations, but don’t get paid for it), market power (teachers, police, social workers employed by monopsonist government get paid less than their value), transaction costs (changing a job is a hassle for the employer and the employee alike) and incomplete information (hard to measure job performance, so good workers underpaid and bad overpaid on average). In short, all the market failures covered in introductory economics.

If the income difference is large and the quantitative effect of the market failures is similar (neither person is a criminal, both work for employers whose competitive situations are alike, little inheritance), then the productivity difference is likely to be in the same direction as the salary difference. If the salary difference is small and the jobs are otherwise similar, the contribution to society is likely similar, so ranking their productivity is not that important. Comparison of people whose labour markets have different failures to a different extent is difficult.

# Local and organic food is wasteful

The easiest measure of any good’s environmental impact is its price. It is not a perfect measure. Subsidies for the inputs of a product can lower its price below more environmentally friendly alternatives that are not favoured by the government. Taxes, market power, externalities and incomplete information can similarly distort relative prices, as introductory economics courses explain. However, absent additional data, a more expensive good likely requires more resources and causes more environmental damage. Remembering this saves time on debating whether local non-organic is better than non-local organic fair trade, etc.

Local and organic are marketing terms, one suggesting helping local farmers and a lower environmental impact from transport, the other claiming health benefits and a lower environmental impact from fertilizers. Organic food may use less of some category of chemicals, but this must have a tradeoff in lower yield (more land used per unit produced) or greater use of some other input, because its higher price shows more resource use overall. From the (limited) research I have read, there is no difference in the health effects of organic and non-organic food. To measure this difference, a selection bias must be taken into account – the people using organic are more health-conscious, so may be healthier to start with. On the other hand, those buying organic and local may be more manipulable, which has unknown health effects. Local food may use less resources for transport, but its higher price shows it uses more resources in total. One resource is the more expensive labour of rich countries (the people providing this labour consume more, thus have a greater environmental impact).

If one wants to help “local farmers” (usually large agribusinesses, not the family farms their lobbying suggests), one can give them money directly. No need to buy their goods, just make them a bank transfer and then buy whichever product is the least wasteful.

There are economies of scale in farming, so the more efficient large agricultural companies tend to outcompete family farms. The greater efficiency is also more environmentally friendly: more production for the same resources, or the same production with less. Helping the small farms avoid takeover is bad for the environment.

Fair trade and sustainable sourcing may be good things, if the rules for obtaining this classification are reasonable and enforced. But who buying fair trade or sustainable has actually checked what the meaning behind the labels is (the “fine print”), or verified with independent auditors whether the nice-sounding principles are put into practice? When a term is used in marketing, I suspect business as usual behind it.

# Statistics with a single history

Only one history is observable to a person – the one that actually happened. Counterfactuals are speculation about what would have happened if choices or some other element of the past history had differed. Only one history is observable to humanity as a whole, to all thinking beings in the universe as a whole, etc. This raises the question of how to do statistics with a single history.

The history is chopped into small pieces, which are assumed similar to each other and to future pieces of history. All conclusions require assumptions. In the case of statistics, the main assumption is “what happened in the past, will continue to happen in the future.” The “what” that is happening can be complicated – a long chaotic pattern can be repeated. It should be specified what the patterns of history consist of before discussing them.

The history observable to a brain consists of the sensory inputs and memory. Nothing else is accessible. This is pointed out by the “brain in a jar” thought experiment. Memory is partly past sensory inputs, but may also depend on spontaneous changes in the brain. Machinery can translate previously unobservable aspects of the world into accessible sensory inputs, for example convert infrared and ultraviolet light into visible wavelengths. Formally, history is a function from time to vectors of sensory inputs.

The brain has a built-in ability to classify sensory inputs by type – visual, auditory, etc. This is why the inputs form a vector. For a given sense, there is a built-in “similarity function” that enables comparing inputs from the same sense at different times.

Inputs distinguished by one person, perhaps with the help of machinery, may look identical to another person. The interpretation is that there are underlying physical quantities that must differ by more than the “just noticeable difference” to be perceived as different. The brain can access physical quantities only through the senses, so whether there is a “real world” cannot be determined, only assumed. If most people’s perceptions agree about something, and machinery also agrees (e.g. measuring tape does not agree with visual illusions), then this “something” is called real and physical. The history accessible to humanity as a whole is a function from time to the concatenation of their sensory input vectors.

The similarity functions of people can also be aggregated, compared to machinery and the result interpreted as a physical quantity taking “similar” values at different times.

A set of finite sequences of vectors of sensory inputs is what I call a pattern of history. For example, a pattern can be a single sequence or everything but a given sequence. Patterns may repeat, due to the indistinguishability of physical quantities close to each other. The finer distinctions one can make, the fewer the instances with the same perception. In the limit of perfect discrimination of all variable values, history is unlikely to ever repeat. In the limit of no perception at all, history is one long repetition of nothing happening. The similarity of patterns is defined based on the similarity function in the brain.

Repeated similar patterns together with assumptions enable learning and prediction. If AB is always followed by C, then learning is easy. Statistics are needed when this is not the case. If half the past instances of AB are followed by C, half by D, then one way to interpret this is by constructing a state space with a probability distribution on it. For example, one may assume the existence of an unperceived variable that can take values c,d and assume that ABc leads deterministically to ABC and ABd to ABD. The past instances of AB can be interpreted as split into equal numbers of ABc and ABd. The prediction after observing AB is equal probabilities of C and D. This is a frequentist setup.

A Bayesian interpretation puts a prior probability distribution on histories and updates it based on the observations. The prior may put probability one on a single future history after each past one. Such a deterministic prediction is easily falsified – one observation contrary to it suffices. Usually, many future histories are assumed to have positive probability. Updating requires conditional probabilities of future histories given the past. The histories that repeat past patterns are usually given higher probability than others. Such a conditional probability system embodies the assumption “what happened in the past, will continue to happen in the future.”

There is a tradeoff between the length of a pattern and the number of times it has repeated. Longer patterns permit prediction further into the future, but fewer repetitions mean more uncertainty. Much research in statistics has gone into finding the optimal pattern length given the data. A long pattern contains many shorter ones, with potentially different predictions. Combining information from different pattern lengths is also a research area. Again, assumptions determine which pattern length and combination is optimal. Assumptions can be tested, but only under other assumptions.

Causality is also a mental construct. It is based on past repetitions of an AB-like pattern, without occurrence of BA or CB-like patterns.

The perception of time is created by sensory inputs and memory, e.g. seeing light and darkness alternate, feeling sleepy or alert due to the circadian rhythm and remembering that this has happened before. History is thus a mental construct. It relies on the assumptions that time exists, there is a past in which things happened and current recall is correlated with what actually happened. The preceding discussion should be restated without assuming time exists.

# Bayesian vs frequentist statistics – how to decide?

Which predicts better, Bayesian or frequentist statistics? This is an empirical question. To find out, should we compare their predictions to the data using Bayesian or frequentist statistics? What if Bayesian statistics says frequentist is better and frequentist says Bayesian is better (Liar’s paradox)? To find the best method for measuring the quality of the predictions, should we use Bayesianism or frequentism? And to find the best method to find the best method for comparing predictions to data? How to decide how to decide how to decide, as in Lipman (1991)?

# Theory and data both needed for prediction

Clearly, data is required for prediction. Theory only says: “If this, then that.” It connects assumptions and conclusions. Data tells whether the assumptions are true. It allows the theory to be applied.
Theory is also required for prediction, although that is less obvious. For example, after observing a variable taking the value 1 a million times, what is the prediction for the next realization for the variable? Under the theory that the variable is constant, the next value is predicted to be 1. If the theory says there are a million 1-s followed by a million 0-s followed by a million 1-s etc, then the next value is 0. This theory may sound more complicated than the other, but prediction is concerned with correctness, not complexity. Also, the simplicity of a theory is a slippery concept – see the “grue-bleen example” in philosophy.
The constant sequence may sound like a more “natural” theory, but actually both the “natural” and the correct theory depend on where the data comes from. For example, the data may be generated by measuring whether it is day or night every millisecond. Day=1, night=0. Then a theory that a large number of 1-s are followed by a large number of 0-s, etc is more natural and correct than the theory that the sequence is constant.
Sometimes the theory is so simple that it is not noticed, like when forecasting a constant sequence. Which is more important for prediction, theory or data? Both equally, because the lack of either makes prediction impossible. If the situation is simple, then theorists may not be necessary, but theory still is.

# Evaluating the truth and the experts simultaneously

When evaluating an artwork, the guilt of a suspect or the quality of theoretical research, the usual procedure is to gather the opinions of a number of people and take some weighted average of these. There is no objective measure of the truth or the quality of the work. What weights should be assigned to different people’s opinions? Who should be counted an expert or knowledgeable witness?
A circular problem appears: the accurate witnesses are those who are close to the truth, and the truth is close to the average claim of the accurate witnesses. This can be modelled as a set of signals with unknown precision. Suppose the signals are normally distributed with mean equal to the truth (witnesses unbiased, just have poor memories). If the precisions were known, then these could be used as weights in the weighted average of the witness opinions, which would be an unbiased estimate of the truth with minimal variance. If the truth were known, then the distance of the opinion of a witness from it would measure the accuracy of that witness. But both precisions and the truth are unknown.
Simultaneously determining the precisions of the signals and the estimate of the truth may have many solutions. If there are two witnesses with different claims, we could assign the first witness infinite precision and the second finite, and estimate the truth to equal the opinion of the first witness. The truth is derived from the witnesses and the precisions are derived from the truth, so this is consistent. The same applies with witnesses switched.
A better solution takes a broader view and simultaneously estimates witness precisions and the truth. These form a vector of random variables. Put a prior probability distribution on this vector and use Bayes’ rule to update this distribution in response to the signals (the witness opinions).
The solution of course depends on the chosen prior. If one witness is assumed infinitely precise and the others finitely, then the updating rule keeps the infinite and finite precisions and estimates the truth to equal the opinion of the infinitely precise witness. The assumption of the prior seems unavoidable. At least it makes clear why the multiple solutions arise.

# Retaking exams alters their informativeness

If only those who fail are allowed to retake an exam and it is not reported whether a grade comes from the first exam or a retake, then the failers get an advantage. They get a grade that is the maximum of two attempts, while others only get one attempt.
A simple example has two types of exam takers: H and L, with equal proportions in the population. The type may reflect talent or preparation for exam. There are three grades: A, B, C. The probabilities for each type to receive a certain grade from any given attempt of the exam are for H, Pr(A|H)=0.3, Pr(B|H)=0.6, Pr(C|H)=0.1 and for L, Pr(A|L)=0.2, Pr(B|L)=0.1, Pr(C|L)=0.7. The H type is more likely to get better grades, but there is noise in the grade.
After the retake of the exam, the probabilities for H to end up with each grade are Pr*(A|H)=0.33, Pr*(B|H)=0.66 and Pr*(C|H)=0.01. For L,  Pr*(A|L)=0.34, Pr*(B|L)=0.17 and Pr*(C|L)=0.49. So the L type ends up with an A grade more frequently than H, due to retaking exams 70% of the time as opposed to H’s 10%.
If the observers of the grades are rational, they will infer by Bayes’ rule Pr(H|A)=33/67, Pr(H|B)=66/83 and Pr(H|C)=1/50.
It is probably to counter the advantage of retakers that some universities in the UK discount grades obtained from retaking exams (http://www.telegraph.co.uk/education/universityeducation/10236397/University-bias-against-A-level-resit-pupils.html). In the University of Queensland, those who fail a course can take a supplementary exam, but the grade is distinguished on the transcript from the grade obtained on first try. Also, the maximum grade possible from taking a supplementary exam is one step above failure – the three highest grades cannot be obtained.

# Who discriminates whom?

In social networks with multiple races, ethnic or religious groups involved it is generally the case that there are fewer links between groups and more within groups than would be expected from uniform random matching. One piece of research exploring this is Currarini, Jackson, Pin (2009).

When observing fewer intergroup links than equal-probability matching predicts, the natural question is who discriminates whom. If group A and group B don’t form links, then is it because group A does not want to link to B or because B does not link to A? If we observe more couples where the man is white and the woman is Asian than expected from uniform random matching, is this due to the `yellow fever’ of white men or a preference of Asian women for white men? It could also be caused by white men and Asian women meeting more frequently than other groups, but this particular kind of biased matching seems unlikely.

Assume both sides’ consent is needed for a link to form. Then the probability that a member of A and a member of B form a link is the product of the probabilities of A accepting B and B accepting A. We can interpret these probabilities as the preference of A for B and B for A and say that if the preference of A for A is stronger than the preference of A for B, then A discriminates against B. From data on undirected links alone, only the product of the probabilities can be calculated, not the separate probabilities. So based only on this data it is impossible to tell who discriminates whom.

If there are more than two groups in the society, then for each pair of groups the same problem occurs. Under the additional assumption that a person treats all other groups the same, only his own group possibly differently from the other groups, the preference of each group for each group can be calculated. This assumption is unlikely to hold in practice though.

If only one side’s consent is needed for a link to form, then from data on these directed links, the preference of each group for each group can again be calculated. The preference of A for B is just the fraction of A’s links that are to B, divided by the fraction of B in the population.

With additional data on who initiated a link or how much effort each side is putting into a link, the preference parameters may be identifiable. The online dating website OKCupid has some statistics on how likely each race is to initiate contact with each other race and how likely each race is to respond to an initial message by another race. If these statistics covered the whole population, then it would be easy to calculate who discriminates whom. In the case of a dating website however, the set of people using it is unlikely to be a representative sample of the population. This may change the results in a major way.

If the average attractiveness of group A in just the dating website (not in the whole population) is higher than that of other groups, then group A is likely to receive more initial contact attempts just because they are attractive. They can also afford to respond to fewer contact attempts since, being attractive, they can be pickier and make less effort to form links. If we disregard the nonrepresentative sample problem and just calculate the preferences of all groups for all other groups, then all groups will be found discriminating in favour of group A, and group A will be found discriminating against all others. But in the general population this may not be the case.

The attractiveness of group A in the dating website can differ from their average attractiveness if the website is more popular with group A and there is adverse selection into using the website. Adverse selection here means that only the people sufficiently unattractive to find a match by chance during their everyday life make the extra effort of starting to use the website to look for matches. So the average attractiveness of all groups using the website is lower than the population’s average attractiveness.

If a larger fraction of group A prefers to use the website and the users from all groups are drawn from the bottom end of the attractiveness distribution, then the website is relatively more popular with attractive members of A than with attractive members of other groups. Therefore the average attractiveness of those members of A using the website is higher than the average attractiveness of those members of other groups using the website. The higher preference of group A for using the website must be exogeneous, i.e. due to something other than A’s lower average attractiveness, otherwise this preference does not cause A’s attractiveness on the website to rise. It could be that members of A are more familiar with the internet, so have a lower effort cost of using any website. Or there may be a social stigma against using online dating sites, which could be smaller in group A than in other groups.

If statistics from a nonrandom sample show discrimination, there may or may not be actual discrimination in the population, depending on the bias of the sample. It could also be that the actual discrimination is larger than the sample shows, if the sample bias goes in the opposite direction from the one described above.