Biomed Central journals text search page - by Professor Futrelle, 28 January 2006

Updated with Yakov Kronrod's search form 6 March 2006 and by Professor Futrelle for AI class, CSG120 Spring 2008

 Google

The purpose of these search is to search for text only in papers from the Open Access journals publisher, Biomed Central. We do this with a bit of Javascript that automatically includes certain text that only appears in their papers in the Google search string. (View the html source for this page to see how we augment the form.) The purpose of this search is to search only BMC journals (which reside in a variety of domains). The annoying part about this form is that in the resulting google search, the trivial BMC-specific text shows up first. This could be fixed by using a bit of Javascript instead, which would put the user's search terms to the left of the BMC-specific text.

Here's our search page for figures - searches the captions.

A note on wild card searches in google: A few experiments showed that the number of asterisks used affects the search. The phrase "we used * to" got 17K hits, whereas "we used * * to" got 23K, including, apparently, more text in the * * position than the * position. Their wildcard search is pretty general - The following does what's expected: "we * to * the", got 66K results. There are user discussion boards, blogs and web pages out there for various google search topics, so there is more wildcard wisdom and experience out there. A search on google search wildcards gets 360K hits.

Here's a more biologically/scientifically meaningful example for a BMC search:
"absence of * suggests", got 32 hits.

The following phrase links an object of interest to an approach to find information about it experimentally:
"to assess * we"
got 116 hits, including, "In order to assess RPTP dimerization, we have assayed Fluorescence Resonance Energy Transfer (FRET) between chimeric proteins of cyan- and yellow-emitting ..."
The semantics of this is that there is a goal or hypothesis to be investigated, "assessed" being related in some way to "investigate". The means or course of action is then stated. Schematically, goal => action to attempt to reach goal.

Another meaningful example is, "that * disrupted", got 83 hits, such as, "These results showed that 13-MTD disrupted the mitochondrial integrity."

To me, the searches on "to assess * we" and "that * disrupted" are more interesting than the standard entity-search studies that many papers have described, e.g., "* and related proteins", returning, "IHF and related proteins", which finds the entity IHF and identifies it as a protein. Even simpler, "the protein * is", got 791 hits. "the protein * which", 320 hits, and "the protein * that", 428 hits.

Our goals can include finding entities, but most importantly finding constructions related to the hypotheses, their exploration, and the results seen. Many of the results of Biology are not formally hypothesis-driven, because we don't always know in advance what might result. They're more approach and technique driven. Try something and then report what you find. So an hypothesis in another science might lead to confirmation or refutation, whereas in Biology, the "hypothesis" is more wide open: Do something-or-other and see what results. Two bits of cleverness are involved here: What to do and what to be on the lookout for in the results.


AltaVista.com is mentioned as an engine that can do wildcards within words, which could be very helpful. But for a BMC search on "we used * to", it returned 312 results versus 17K for google. That doesn't mean google's better, because it does have a tendency to return multiples. Also, AltaVista only allowed me to look at three pages worth of results out of their 312. But some a lot more. What's up with that?

I found this note about AltaVista stating that their wildcards only deal with word endings: "Wildcards: With simple queries you are allowed to enter a wildcard character at the end of phrases which will substitute for any combination of letters. The asterisk (*) is AltaVista's wildcard character. For example, butt* will get all occurences of butt, butts, butter, button, etc. The asterisk cannot be used at the beginning or in the middle of words. It will substitute for up to 5 additional lower case letters."

Frankly, I've had little luck with these word-ending wildcards in AltaVista so far.

Here is our BMC figure search page.