The 1.4 trillion dollar project.

A Google search on "1.4 trillion dollars" gets a surprisingly large number of hits, which can be traced back to some smaller number of different appearances of "1.4 trillion dollars" in some authoritative past data or some evidence-based future estimate. This observation (from an email that I now cannot find -- thanks anyway to the sender!) suggests the following line of thought. There is some distribution (in x) for "relative frequency of times in which the sum of x dollars arises in public discussion" and we can use what we find in Google search as proxy for "in public discussion". It is hard to envisage any complete theoretical prediction for this distribution, but much of the vague conceptual discussion of "informationless priors" (which I will not repeat here) suggests a density proportional to 1/x in the tail. Specifically, if one looks at sums from 1.1 trillion dollars to 9.9 trillion dollars (ignoring the ".0 trillion" ones, there are 81 such numbers), one might predict that the frequencies of different such numbers x should be roughly proportional to 1/x.

Is this prediction roughly accurate? The table below shows data collected by Amy Huang and Irvin Liu as an Undergraduate Research Project in Spring 2009. I haphazardly picked a few numbers out of the 81 possible.

trillion dollars 1.4 2.8 3.3 4.2 4.7 5.6 8.4
observed frequency 26 29 19 13 10 4 5

What conclusion might we draw? In brief, the data (excluding the anomalous first figure) shows a very crude fit to the predicted 1/x frequency. There are overlapping conceptual and practical isssues in obtaining this data, so we're reluctant to undertake quantitative analysis or draw any more definite conclusion. Instead we invite the reader to consider better ways to formulate and execute the project!

Details of data collection. We went through the items found in the Google search, and counted items which satisfied the following criteria.
(1) The data must refer to some explicit time period -- often this is one year, but quite often a period like "2004-2008" or "2011-2015".
(2) Either the data/estimate itself looks authoritative, or one can quickly find an authoritative source of the same number. For future estimates, we didn't attempt to use our own judgement to assess the reasonableness of a forecasting methdology, we just checked it seemed to have been done by some reputable source, as opposed to a wild guess by a blogger!
(3) We did not double-count items refering to the same underlying data/estimate.
(4) We didn't count endpoints of intervals, such as "between 1.1 and 1.4 trillion". But we did count "more than 1.4 trillion" in contexts where this implicitly meant "a little more than 1.4 trillion", that is "between 1.4 trillion and 1.5 trillion".
(5) The stopping rule was to look at 20 pages of Google search results.

Issues in data collection. The practical difficulty was that (1-4) involve some subjective judgement, and that even maintaining a list of the "different" underlying instances requires some effort when the list exceeds 20. The conceptual difficulty is in defining a suitable "stopping rule": using "20 pages" is biased in that there may be be multiple items refering to the same instance.

Rounding. Obviously a writer uses "1.4 trillion" as a rounded figure, to mean something like "between 1.35 trillion and 1.45 trillion". This is why we didn't search for "2.0 trillion", since writers might round "2.03 trillion" to either "2.0 trillion" or to "2 trillion" or to "two trillion".

What is this measuring, anyway?

What we are measuring is clearly some complicated mixture of objective data and subjective perception. For instance, in the U.S. there is one Federal budget, 50 State budgets and thousands of city budgets, but the Federal budget is intrinsically of interest to more people than is a particular city budget. Conceptually, there is some density f(x) of "objective number of occurences" in which all those budget figures are weighted equally, and some other density g(x) of figures that come to an average person's attention; one might guess these densities would be related by a very rough size-biasing relationship -- g(x) proportional to x f(x). However, I guess that what we see in a Google search would be somewhat intermediate between the "objective" and the "subjective" quantities.

Aside. Somewhat related is The Secret Lives of Numbers which says: Since 1997, we have collected at intervals a novel set of data on the popularity of numbers: by performing a massive automated Internet search on each of the integers from 0 to 1,000,000 and counting the number of pages which contained each, we have obtained a picture of the Internet community's numeric interests and inclinations.