A Short History of Data Mining

Interactive: A look at data collection and mining throughout American History

The growth of data mining is inevitable. It reflects a fundamental change caused by technological advances that, like King Canute’s fabled tide, cannot be stopped or slowed. The opportunity ⎯ or problem, depending on one’s perspective ⎯ derives from two related, yet distinct trends: increases in computing power and decreases in data storage costs.

Many are familiar with the long-term increase in the power of computers. It is most familiarly characterized as Moore’s Law ⎯ named after Intel computer scientist Gordon Moore, who first posited the law in 1965. Moore’s Law predicts that computer chip capacities will double every eighteen to twenty-four months. Moore’s law has been remarkably constant for nearly thirty years, as the graph below demonstrates.

Processing Power Doubles Every 18-24 Months
The scale makes clear that the effect of routine doubling is logarithmic. Processor capacity today is roughly more than one million times faster than processor speed in 1970.

The power of this processing capacity ⎯ which translates almost directly into processing speed ⎯ is immense. It is what drives the information technology tools that power Google and Amazon and make Walmart’s purchasing system a reality. Though no one predicts that processing speed will double indefinitely–surely a physical impossibility ⎯ there is no current expectation that the limits of chip capacity have been reached.

To this trend one must also add the remarkable reduction in the costs of data storage. As the following chart demonstrates, data storage costs have also been decreasing at a logarithmic rate, almost identical to the increases we have experienced in chip capacity, but with an inverse slope.

The Costs of Data Storage Are Decreasing Exponentially Too
What this means in practical terms is that in 1984 ⎯ just twenty-five years ago ⎯ it cost roughly two hundred dollars to store a megabyte of data. By 1999 that cost had sunk to seventy-five cents. Today you can buy one hundred megabytes of data storage capacity for a penny. On E-Bay you can frequently purchase a terabyte storage device for your desktop for just one hundred dollars. A terabyte is roughly 1 trillion bytes of data—a huge volume for storing simple alphanumeric information. According to the Library of Congress the approximate amount of its collections that are digitized and freely and publicly available on the Internet is about 74 terabytes — so you can have the entire publicly available library all for the investment of $7400 in storage costs. (Of course, this excludes the non-public content, which is a much larger holding).

Here, too, the prospects are for ever-cheaper data storage. One can readily imagine peta-, exa-, or even yottabyte sized personal storage devices. [A petabyte is 10005 bytes, a exabyte is 10006 bytes, and a yottabyte is 10008 bytes.] If that is for the individual, imagine what a large corporation or a government can purchase and maintain.

The story of technology today requires us to answer the question: “What happens when ever-quicker processing power meets ever-cheaper storage capacity?” Data is now pervasively available and pervasively searchable. For large-scale databases of the size maintained by governments or companies, the practical limitations lie in the actual search algorithms used and how they are designed to process the data, not in the chips or the storage units.

For perspective on how science, policy and other events shaped data mining in the United States, please view our interactive timeline feature.

Comments are closed.