Linear congruential (LCG) $ X_{n+1} = (aX_n +c)\mod m.$ Period $\le m$.
Lagged Fibonacci, KISS, xorshift family, PCG, ...
Wichmann-Hill Sum of 3 LCGs.
def WH(s1, s2, s3):
s1 = (171 * s1) % 30269
s2 = (172 * s2) % 30307
s3 = (170 * s3) % 30323
r = (s1/30269 + s2/30307 + s3/30323) % 1
return [r, s1, s2, s3]
McCullough, B.D., 2008. Microsoft Excel's 'Not The Wichmann–Hill' random number generators
Computational Statistics & Data Analysis, 52, 4587–4593
doi:10.1016/j.csda.2008.03.006
state space 19937 bits
period $2^{19937}-1$, a Mersenne Prime
can have slow "burn in," especially for seeds with many zeros
output for close seeds can be close
perfectly predictable from 624 successive outputs; problems discovered in 2007
As if $H(M)$ is random $L$-bit string is assigned to $M$ in a way that's essentially unique.
Generate a random string $S$ of reasonable length, e.g., 20 digits.
$$ X_i = {\mbox{Hash}}(S,i),$$
interpreted as a (long) hexadecimal number.
Unbounded state space!
Textbook: take $X \sim U[0,1)$; define $Y \equiv 1 + \lfloor mX \rfloor$.
In practice, $X$ from a PRNG is not really $U[0,1)$: derived by normalizing a PRN that's (supposed to be) uniform on $w$-bit integers.
If $m > 2^w$, at least $m-2^w$ values will have probability 0 instead of probability $1/m$.
Unless $m$ is a power of 2, the distribution of $Y$ isn't uniform on $\{1, \ldots, m\}$.
For $m < 2^w$, the ratio of the largest to smallest selection probability is, to first order, $1+ m 2^{-w}$. (Knuth v2 3.4.1.A.)
For $m = 10^9$ and $w=32$, $1 + m 2^{-w} \approx 1.233$.
If $w=32$, then for $m>2^{32}=4.24e9$, some values will have probability 0.
Until relatively recently, R did not support 64-bit integers.
Generate $\log_2(m-1)$ pseudo-random bits; discard out-of-range numbers.
numpy.random.randint()
does this, but numpy.random.choice()
doesn't
In R
,
sample(1:m, k, replace=FALSE)
uses variant of the faulty 1 + floor(m*X)
approach.
If you put $N>n$ pigeons in $n$ pigeonholes, then at least one pigeonhole must contain more than one pigeon.
At most $n$ pigeons can be put in $n$ pigeonholes if at most one pigeon is put in each hole.
Stirling bounds $ e n^{n+1/2} e^{-n} \ge n! \ge \sqrt{2 \pi} n^{n+1/2} e^{-n}.$
$ \frac{2^{nH(k/n)}}{n+1} \le {n \choose k} \le 2^{nH(k/n)},$ where $H(q) \equiv -q \log_2(q) - (1-q) \log_2 (1-q)$.
For $\ell \ge 1$ and $m \ge 2$, $ { {\ell m } \choose { \ell }} \ge \frac{m^{m(\ell-1)+1}}{\sqrt{\ell} (m-1)^{(m-1)(\ell-1)}}. $
$n^k$ possible samples of size $k$ from $n$ items
Expression | full | scientific notation |
---|---|---|
$2^{32}$ | 4,294,967,296 | 4.29e9 |
$2^{64}$ | 18,446,744,073,709,551,616 | 1.84e19 |
$2^{128}$ | 3.40e38 | |
$2^{32 \times 624}$ | 9.27e6010 | |
$13!$ | 6,227,020,800 | 6.23e9 |
$21!$ | 51,090,942,171,709,440,000 | 5.11e19 |
$35!$ | 1.03e40 | |
$2084!$ | 3.73e6013 | |
${50 \choose 10}$ | 10,272,278,170 | 1.03e10 |
${100 \choose 10}$ | 17,310,309,456,440 | 1.73e13 |
${500 \choose 10}$ | 2.4581e20 | |
$\frac{2^{32}}{{50 \choose 10}}$ | 0.418 | |
$\frac{2^{64}}{{500 \choose 10}}$ | 0.075 | |
$\frac{2^{32}}{7000!}$ | $<$ 1e-54,958 | |
$\frac{2}{52!}$ | 2.48e-68 |
Suppose ${\mathbb P}_0$ and ${\mathbb P}_1$ are probability distributions on a common measurable space.
If there is some set $S$ for which ${\mathbb P}_0 = \epsilon$ and ${\mathbb P}_1(S) = 0$, then $\|{\mathbb P}_0 - {\mathbb P}_1 \|_1 \ge 2 \epsilon$.
Thus there is a function $f$ with $|f| \le 1$ such that
$${\mathbb E}_{{\mathbb P}_0}f - {\mathbb E}_{{\mathbb P}_1}f \ge 2 \epsilon.$$
Given a good source of randomness, many ways to draw a simple random sample.
One basic approach: shuffle then first $k$.
Many ways to generate pseudo-random permutation.
PIKK
(permute indices and keep $k$)¶For instance, if we had a way to generate independent, identically distributed (iid) $U[0,1]$ random numbers, we could do it as follows:
Algorithm PIKK
More efficient:
Algorithm Fisher-Yates-Knuth-Durstenfeld shuffle
(backwards version)
for i=1, ..., n-1:
J <- random integer uniformly distributed on {i, ..., n}
(a[J], a[i]) <- (a[i], a[J])
There's a version suitable for streaming, i.e., generating a random permutation of a list that has an (initially) unknown number $n$ of elements:
Algorithm Fisher-Yates-Knuth-Durstenfeld shuffle
(streaming version)
i <- 0
a = []
while there are records left:
i <- i+1
J <- random integer uniformly distributed on {1, ..., i}
if J < i:
a[i] <- a[J]
a[J] <- next record
else:
a[i] <- next record
Sort using a "random" comparison function, e.g.,
def RandomSort(a,b):
return (0.5 - np.random.random())
But this fails the basic properties of an ordering, e.g., transitivity, reflexiveness, etc. Doesn't produce random permutation. Output also depends on sorting algorithm.
Notoriously used by Microsoft to offer a selection of browsers in the EU. Resulted in IE being 5th of 5 about half the time in IE and about 26% of the time in Chrome, but only 6% of the time in Safari (4th about 40% of the time).
Random_Sample
¶def Random_Sample(n, k, gen=np.random): # from Cormen et al. 2009
if k==0:
return set()
else:
S = Random_Sample(n-1, k-1)
i = gen.randint(1,n+1)
if i in S:
S = S.union([n])
else:
S = S.union([i])
return S
Random_Sample(100,5)
{11, 15, 35, 43, 48}
The previous algorithms require $n$ to be known. There are reservoir algorithms that do not. Moreover, the algorithms are suitable for streaming (aka online) use: items are examined sequentially and either enter into the reservoir, or, if not, are never revisited.
R
, Waterman (per Knuth, 1997)¶Put first $k$ items into the reservoir
when item $k+j$ is examined, either skip it (with probability $j/(k+j)$) or swap for a uniformly selected item in the reservoir (with probability $k/(k+j)$)
naive version requires at most $n-k$ pseudo-random numbers
closely related to FYKD shuffle
Much more efficient than Algorithm R
, using random skips. Essentially linear in $k$.
Note: Vitter proposes using the (inaccurate) $J = \lfloor mU \rfloor$ to generate a random integer between $0$ and $m$ in both algorithm R
and algorithm Z
. Pervasive!
Package/Lang | default | other | SRS algorithm |
---|---|---|---|
SAS 9.2 | MT | 32-bit LCG | Floyd's ordered hash or Fan et al. 1962 |
SPSS 20.0 | 32-bit LCG | MT1997ar | trunc + rand indices |
SPSS ≤ 12.0 | 32-bit LCG | ||
STATA 13 | KISS 32 | PIKK | |
STATA 14 | MT | PIKK | |
R | MT | trunc + rand indices | |
python | MT | mask + rand indices | |
MATLAB | MT | trunc + PIKK | |
StatXact | MT |
Key. MT = Mersenne Twister. LCG = linear congruential generator. PIKK = assign a number to each of the $n$ items and sort. The KISS generator combines 4 generators of three types: two multiply-with-carry generators, the 3-shift register SHR3 and the congruential generator CONG.
STATA 10 before April 2011: 95.1% of the $2^{31}$ possible seed values resulted in the first and second draws from rnormal() having the same sign.
Know from pigeonhole argument that $L_1$ distance between true and desired is big for modest sampling & permutation problems.
Know from equidistribution of MT that large ensemble frequencies will be right, but expect dependence issues
Looking for problems that occur across seeds, large enough to be visible in $O(10^5)$ replications
Examined simple random sample frequencies, derangements, partial derangements, Spearman correlation, etc.
Must be problems, but where are they?
Building library for permutation tests in Python http://statlab.github.io/permute/
Smaller library of examples in R to go with Pesarin & Salmaso's text https://github.com/statlab/permuter
Working on plug-in CS-PRNG replacement for MT in Python https://github.com/statlab/cryptorandom
Replace the standard PRNGs in R and Python with PRNGs with unbounded state spaces, and cryptographic or near-cryptographic quality
Replace floor(1+nU)
in R's sample()
with bit-mask algorithm