Probabilities of doppelgangers

Having been quoted in a 2016 BBC online article You are surprisingly likely to have a living doppelganger, I have subsequently been contacted by other journalists for comments on relevant probabilities. So here are comments from the viewpoint of basic mathematical probability.

Interest in this topic was apparently rekindled by this press release from the University of Adelaide whose key assertion was the likelihood that two people share the exact same face is in excess of [actually meaning "less than"] one in a trillion. This was based on the 2015 scientific paper Are human faces unique? A metric approach to finding single individuals without duplicates in large samples by Teghan Lucas and Maciej Henneberg. That paper examined a database of around 4,000 faces and made 8 measurements (e.g. ear length) of each individual, recorded to the nearest millimeter. By examining whether there were any two individuals with exactly the same values for 4 or 5 of those measurements, and then extrapolating via some kind of mathematical model, the following conclusion (edited for clarity) was stated:

(*) The probability of finding a duplicate of a given face (that is, these 8 measurements are identical) is less than 1/(Earth population), implying that this method of facial identification is as reliable as that of DNA (because it is unlikely that there exist any identical pair)
A subsequent critique Reply to Lucas & Henneberg: Are human faces unique? by statisticians Ronald Meester et al argues (again edited for clarity)
A variety of related reasons show that this central claim (*) is unsubstantiated.
(1) The absence of any mathematical model for facial profiles,
(2) problems in the determination of the match probability
(3) the extrapolation
(4) unsubstantiated claims in the popular press release.
I believe that most academic statisticians would agree with this critique, and therefore that (*) does not represent any kind of consensus scientific opinion.

The conceptual difficulty with any serious analysis is to specify what exactly it means to be a doppelganger, to say two faces "look very similar". If one judges by physical measurements, then any very precise measurementof two people would be different, even for identical twins (who are implicitly not counted as doppelgangers). So one needs to quantity "how similar", and that part of the Lucas & Henneberg paper is reasonable (they measured to the nearest millimeter). But judging by human perception might give quite different results.

How to think about the probabilities

Basic textbook mathematical probability will not give us a numerical answer, but does help us to clarify the issue. We need some (hugely oversimplified) model, and let's use the following.

Model. There is some unknown (and very large) number N of possible faces, meaning that any individual face is "very similar" to one of these possible faces, and that two individuals are deemed doppelgangers if each of their faces is "very similar" to one of the possible faces. Then suppose that each face in the world population (size M, approximately 8 billion) is like a random pick from the N possible faces.

Analysis. The key mathematical point is that there are three quite different events, with different probabilities, that one can consider within this model.
(E1) There is at least one doppelganger pair, somewhere in the world.
(E2) A typical person ("you") has at least one doppelganger.
(E3) Every person in the world has at least one doppelganger.

The probabilities of these events depend on M (which we know) and on N (which we don't know). We know how to calculate such probabilities (in terms of M and N) because this mathematical structure arises often. In fact
(E1) is an instance of the birthday problem
(E2) is an instance of the binomial distribution
(E3) is a variant of the coupon collector's problem (CCP).
It is important to note that N depends on how we judge "very similar" and that we would need some kind of real world data to make a numerical estimate of N. However we can make a start without data. For each of these events there is some critical value of N, in the following sense:

if the true value of N is much less than the critical value, then the event is very likely,
whereas if the true value of N is much larger than the critical value, then the event is very unlikely.
This is qualitatively obvious, in that the more possible faces there are, the less likely doppelgangers will be. Then within our model, we can calculate these critical values.

For (E1) the critical value of N is 5 x 10^{19} = 50 billion billion. To me this seems implausibly large, so I personally am very happy to believe that some doppelgangers exist.
For (E2) the critical value of N is 11 billion.
For (E3) the critical value of N is 400 million.
To me, it is hard to guess whether N is larger or smaller than the values for (E2) or (E3) above.

But this model is unrealistic ......

Of course this model is very unrealistic for many reasons. In particular, because of non-uniformity, meaning that all possible faces are in fact not equally likely. Such non-uniformity would make (E1) and (E2) more likely (that is, the critical value would increase), because our "typical" person's face is likely to be one with a comparatively larger likelihood. On the other hand, non-uniformity would make (E3) less likely, because an individual with relatively unlikely face is less likely to have a doppelganger, which for the purpose of (E3) dominates the opposite effect.

Some details about (E3) and the CCP

The "400 million" comes from (a variation of) the coupon collectors result. If there are (all numbers approximate) 400 million possible faces, then the CCP result says that the number of people needed for the event
(*) each of these faces is chosen by at least one person
is around 8 billion (see next paragraph). Now for our "does everyone have a doppelganger" question, we have a somewhat different event, which can be formulated as the event
(**) each of these possible faces is chosen by either zero or at least two people.
Now the events (*) and (**) are different, but the mathematical analysis is similar, so we get approximately the same answer, that "400 million" is the critical value of N = "number of possible faces" for event (**).

To quote and edit from a standard account:
The CCP asks for the number of trials M needed to collect all N different coupons, when each coupon is equally likely to be obtained in each trial.
The solution to the coupon collector's problem is
M = N * log(N) approximately.
In our setting we use this solution “backwards”. We take
N = number of possible faces
M = world population.
Because we know M = 8 billion then we can use the solution to obtain N = 400 million. Again, note this is all “approximate”.

Is it possible to get relevant data?

Actual data to address these questions is scarce. One can find many online pictures of Celebrity Lookalikes, for instance here and here. Because these are within some small subset of the world population, this suggests that existence of a "similar to that degree" doppelganger is quite common. But for a typical individual it seems hard to actually exhibit such as match, In principle one could do an experiment by randomly picking 10 ordinary people and online offering a monetary prize for the first person to exhibit a matching doppelganger. But with the modern ability to manipulate photographs, one would need to be careful about fakes.