Stop Tokenizer

Based on the following stop words I tokenized obama’s most recent state of the union. Below is a portion of the results.

Stop words:

("a", "an", "the","and",".",","," ","because",





"entire","sure", "u", "!", "was", "has", "its", "through", "me", "his","once","carry",

"anew","'", "t","let", "us", "new", "before", "come","two", "one", "ve", "go", "8",

"she", "her", "he","none","at","been","these","what","up","were","them","some","had",


Results: (portion)
41835 41841  |people|
41845 41850  |haiti|
41858 41863  |lives|
41878 41887  |americans|
41896 41903  |dropped|
41904 41914  |everything|
41921 41930  |someplace|
41940 41945  |never|
41955 41959  |pull|
41960 41966  |people|
41976 41981  |never|
41982 41987  |known|
41997 42003  |rubble|
42005 42014  |prompting|
42015 42021  |chants|
42057 42064  |another|
42065 42069  |life|
42074 42079  |saved|
42084 42090  |spirit|
42100 42109  |sustained|
42115 42121  |nation|
42126 42130  |more|
42140 42149  |centuries|
42150 42155  |lives|
42171 42177  |people|
42187 42195  |finished|
42198 42207  |difficult|
42208 42212  |year|
42237 42246  |difficult|
42247 42253  |decade|
42265 42269  |year|
42286 42292  |decade|
42293 42302  |stretches|
42324 42328  |quit|
42339 42343  |quit|
42357 42362  |seize|
42368 42374  |moment|
42380 42385  |start|
42405 42410  |dream|
42411 42418  |forward|
42427 42437  |strengthen|
42442 42447  |union|
42453 42457  |more|
42458 42463  |thank|
42469 42472  |god|
42473 42478  |bless|
42488 42491  |god|
42492 42497  |bless|

42502 42508  |united|
42509 42515  |states|
42519 42526  |america|

I added stop words based on the unnecessary words that were appearing in the previous parsing. Doing this reinforced the idea that picking stop tokenizers is largely based on biases. To me it was clear which words were not meaningful, so I kept taking them out. At some point though it became more complicated. For example the use of words like “own”, which have different meanings, verb and noun, could be interesting, or things like “come”, but I decided they were not important.

It seems to me that google, yahoo, and Bing all have very similar stop tokenizers. One article said that Yahoo! search is now powered by Bing in the US and google in Japan. (

This article states:

You can differentiate by having product information. But Google scrapes it. You can differentiate through consumer & editorial reviews. But Google scrapes it. You can differentiate by brand, but Google sells branded keywords to competitors. No matter what you do, Google competes against you. You can opt out of being scraped, but then you get no search traffic (& the ecosystem is set up to pay someone else to scrape your content + wrap it in ads).”

For google, there are so many other more complicated and more important algorithms for text parsing, and the ad /proprietary / information ownership motives in addition to the need to define the user’s intentions are.

When stop tokenization is used alone, the search has no way of ordering the results in a meaningful way. Google looks for things like “natural citations” to further break down results, presenting the ones that include the stop words as well as more relevant. It also has really interesting web based approaches like looking for natural link growth, depth and quality of links to and from a page, and a page’s age.

Yahoo apparently, from what I understand, offers more search priority through ad sales than google, and has a poorer distinction of link quality and depth, therefore constantly presenting us with tangentially connected information, which I have to say has its place. I love when I go to check my yahoo spam account, that I keep mostly for the purpose of distraction, and I come across a link to a cat doing something cute, which brings me to some advertisement, and then another exciting link, and helps me pass some time while I get ready to get back to work. It has its place as a totally different kind of search, and as some search engines work to meet people’s needs, others are working to reduce people’s attention spans to meet their shifty search strategies and criteria.

The most important thing is the slow convergence / homogenization of search strategies, following google mostly, limiting web recommendations and search result more and more, as often this is based on “link growth”, so “viral” things are almost self fulfilling prophecies.

As search engines become personalized, they get to limit the recommendations for each person even more. On our end, we seem to develop a relationship with our preferred search engines as well, learning their stop tokenizers, and attempting to match their parsing style to fit our needs. We begin to develop a relationship, a two way relationship, with our search engine that becomes more and more specific, isolated, and limited, and we begin to learn from the search engine’s biases, further limiting our use of them as searching for something outside of their biases. Maybe the key is for different search engines to continue to evolve in different directions, so the recommendations and search results on each would not reply to our behaviors in the same way.