Laplace’s principle of indifference makes history useless

Model the universe in discrete time with only one variable, which can take values 0 and 1. The history of the universe up to time t is a vector of length t consisting of zeroes and ones. A deterministic universe is a fixed sequence. A random universe is like drawing the next value (0 or 1) according to some probability distribution every period, where the probabilities can be arbitrary and depend in arbitrary ways on the past history.
The prior distribution over deterministic universes is a distribution over sequences of zeroes and ones. The prior determines which sets are generic. I will assume the prior with the maximum entropy, which is uniform (all paths of the universe are equally likely). This follows from Laplace’s principle of indifference, because there is no information about the distribution over universes that would make one universe more likely than another. The set of infinite sequences of zeroes and ones is bijective with the interval [0,1], so a uniform distribution on it makes sense.
After observing the history up to time t, one can reject all paths of the universe that would have led to a different history. For a uniform prior, any history is equally likely to be followed by 0 or 1. The prediction of the next value of the variable is the same after every history, so knowing the history is useless for decision-making.
Many other priors besides uniform on all sequences yield the same result. For example, uniform restricted to the support consisting of sequences that are eventually constant. There is a countable set of such sequences, so the prior is improper uniform. A uniform distribution restricted to sequences that are eventually periodic, or that in the limit have equal frequency of 1 and 0 also works.
Having more variables, more values of these variables or making time continuous does not change the result. A random universe can be modelled as deterministic with extra variables. These extras can for example be the probability of drawing 1 next period after a given history.
Predicting the probability distribution of the next value of the variable is easy, because the probability of 1 is always one-half. Knowing the history is no help for this either.

Statistics with a single history

Only one history is observable to a person – the one that actually happened. Counterfactuals are speculation about what would have happened if choices or some other element of the past history had differed. Only one history is observable to humanity as a whole, to all thinking beings in the universe as a whole, etc. This raises the question of how to do statistics with a single history.

The history is chopped into small pieces, which are assumed similar to each other and to future pieces of history. All conclusions require assumptions. In the case of statistics, the main assumption is “what happened in the past, will continue to happen in the future.” The “what” that is happening can be complicated – a long chaotic pattern can be repeated. It should be specified what the patterns of history consist of before discussing them.

The history observable to a brain consists of the sensory inputs and memory. Nothing else is accessible. This is pointed out by the “brain in a jar” thought experiment. Memory is partly past sensory inputs, but may also depend on spontaneous changes in the brain. Machinery can translate previously unobservable aspects of the world into accessible sensory inputs, for example convert infrared and ultraviolet light into visible wavelengths. Formally, history is a function from time to vectors of sensory inputs.

The brain has a built-in ability to classify sensory inputs by type – visual, auditory, etc. This is why the inputs form a vector. For a given sense, there is a built-in “similarity function” that enables comparing inputs from the same sense at different times.

Inputs distinguished by one person, perhaps with the help of machinery, may look identical to another person. The interpretation is that there are underlying physical quantities that must differ by more than the “just noticeable difference” to be perceived as different. The brain can access physical quantities only through the senses, so whether there is a “real world” cannot be determined, only assumed. If most people’s perceptions agree about something, and machinery also agrees (e.g. measuring tape does not agree with visual illusions), then this “something” is called real and physical. The history accessible to humanity as a whole is a function from time to the concatenation of their sensory input vectors.

The similarity functions of people can also be aggregated, compared to machinery and the result interpreted as a physical quantity taking “similar” values at different times.

A set of finite sequences of vectors of sensory inputs is what I call a pattern of history. For example, a pattern can be a single sequence or everything but a given sequence. Patterns may repeat, due to the indistinguishability of physical quantities close to each other. The finer distinctions one can make, the fewer the instances with the same perception. In the limit of perfect discrimination of all variable values, history is unlikely to ever repeat. In the limit of no perception at all, history is one long repetition of nothing happening. The similarity of patterns is defined based on the similarity function in the brain.

Repeated similar patterns together with assumptions enable learning and prediction. If AB is always followed by C, then learning is easy. Statistics are needed when this is not the case. If half the past instances of AB are followed by C, half by D, then one way to interpret this is by constructing a state space with a probability distribution on it. For example, one may assume the existence of an unperceived variable that can take values c,d and assume that ABc leads deterministically to ABC and ABd to ABD. The past instances of AB can be interpreted as split into equal numbers of ABc and ABd. The prediction after observing AB is equal probabilities of C and D. This is a frequentist setup.

A Bayesian interpretation puts a prior probability distribution on histories and updates it based on the observations. The prior may put probability one on a single future history after each past one. Such a deterministic prediction is easily falsified – one observation contrary to it suffices. Usually, many future histories are assumed to have positive probability. Updating requires conditional probabilities of future histories given the past. The histories that repeat past patterns are usually given higher probability than others. Such a conditional probability system embodies the assumption “what happened in the past, will continue to happen in the future.”

There is a tradeoff between the length of a pattern and the number of times it has repeated. Longer patterns permit prediction further into the future, but fewer repetitions mean more uncertainty. Much research in statistics has gone into finding the optimal pattern length given the data. A long pattern contains many shorter ones, with potentially different predictions. Combining information from different pattern lengths is also a research area. Again, assumptions determine which pattern length and combination is optimal. Assumptions can be tested, but only under other assumptions.

Causality is also a mental construct. It is based on past repetitions of an AB-like pattern, without occurrence of BA or CB-like patterns.

The perception of time is created by sensory inputs and memory, e.g. seeing light and darkness alternate, feeling sleepy or alert due to the circadian rhythm and remembering that this has happened before. History is thus a mental construct. It relies on the assumptions that time exists, there is a past in which things happened and current recall is correlated with what actually happened. The preceding discussion should be restated without assuming time exists.

Theory and data both needed for prediction

Clearly, data is required for prediction. Theory only says: “If this, then that.” It connects assumptions and conclusions. Data tells whether the assumptions are true. It allows the theory to be applied.
Theory is also required for prediction, although that is less obvious. For example, after observing a variable taking the value 1 a million times, what is the prediction for the next realization for the variable? Under the theory that the variable is constant, the next value is predicted to be 1. If the theory says there are a million 1-s followed by a million 0-s followed by a million 1-s etc, then the next value is 0. This theory may sound more complicated than the other, but prediction is concerned with correctness, not complexity. Also, the simplicity of a theory is a slippery concept – see the “grue-bleen example” in philosophy.
The constant sequence may sound like a more “natural” theory, but actually both the “natural” and the correct theory depend on where the data comes from. For example, the data may be generated by measuring whether it is day or night every millisecond. Day=1, night=0. Then a theory that a large number of 1-s are followed by a large number of 0-s, etc is more natural and correct than the theory that the sequence is constant.
Sometimes the theory is so simple that it is not noticed, like when forecasting a constant sequence. Which is more important for prediction, theory or data? Both equally, because the lack of either makes prediction impossible. If the situation is simple, then theorists may not be necessary, but theory still is.