Freshman Honors, Sp04 - What's in Google - Prof. Futrelle

Professor Futrelle - College of Computer and Information Sciences, Northeastern U., Boston, MA

Version of 22 February 2004

What's all this information that Google indexes?

Many of us turn to Google for information about just about everything. But there's far more to the world of information than you can find there, interestingly enough. Much of the world's "highest-quality" knowledge is found in books and journals that you have to pay for or subscribe to, material not available on Google. Sophisticated medical information or details of legal cases or research on our ecosystem are not there. This information is important, so why isn't it available? The ultimate reasons are economic, "you get what you pay for". Collecting and indexing legal material is a labor-intensive and computer-intensive task that costs money. Currently, the money comes from the users, the attorneys and clients who need the information. The results of sophisticated biological and medical studies continue to be published in prestigious journals that try to exert tight quality control over the articles they accept. These journals cost money to run and cost money to subscribe to. (But see the note on "The Open Access Movement" below.)

The structure and quality of Google information

Google has to work with what's there. Most of what they index is simple files, most HTML, but many in PDF, Word or Postscript. There is little that Google can do other than scan them for words and phrases, index those and give you back lists. But users would be happy to have more powerful methods that would allow online systems to return answers to questions, not just more things to be read. That's a tall order, and not one that will be filled soon. But there are many people working on these deeper and more challenging problems. Professor Futrelle's lab has been working for some years on these deeper problems, as other pages on this Honors site detail. His focus on the figures in scientific documents, and especially the diagrams that appear in published Biology research papers, is unique in the world. But again, more on that in the rest of this site.

The Open Access movement

Many scientists are fed up with what they see is the exorbitant cost of professional journals. The government typically pays for the research published there, but then readers or libraries have to pay for it again to get access to it. In addition, the publishers retain the copyrights to the material, so a researcher can't even legally post their papers on a website for others to read. The Open Access movement has been instigated by scientists to break the lock that commercial publishers have on the scientific literature. Models differ, but for some, there is a one-time fee up front due from the authors, but after that, there are no access restrictions. The Open Access literature movement was inspired in part by the open software revolution derived from Gnu and Linux. See the links below for more information about Open Access.

