{"id":672,"date":"2011-02-18T06:41:51","date_gmt":"2011-02-18T06:41:51","guid":{"rendered":"http:\/\/www.mariarabinovich.com\/blog\/?p=672"},"modified":"2011-02-28T04:24:20","modified_gmt":"2011-02-28T04:24:20","slug":"stop-tokenizer","status":"publish","type":"post","link":"http:\/\/www.mariarabinovich.com\/blog\/archives\/672","title":{"rendered":"Stop Tokenizer"},"content":{"rendered":"<p>Based on the following stop words I tokenized obama&#8217;s most recent state of the union. Below is a portion of the results.<\/p>\n<p>Stop words:<\/p>\n<pre>(\"a\", \"an\", \"the\",\"and\",\".\",\",\",\" \",\"because\",\r\n\r\n\"why\",\"this\",\"is\",\"of\",\"in\",\"if\",\"that\",\"that's\",\"it\",\"then\",\"than\",\"when\",\r\n\r\n\"we\",\"as\",\"from\",\"to\",\"our\",\"s\",\"have\",\"they\",\"have\",\"?\",\r\n\r\n\"all\",\"must\",\"who\",\"you\",\"on\",\"for\",\"may\",\"be\",\"\/\",\"\\\"\\'\",\"\\\"\",\"get\",\"are\",\"i\",\"am\",\"not\",\r\n\r\n\"m\",\"make\",\"makes\",\"for\",\"into\",\"but\",\"can\",\"only\",\"happen\",\"don\",\"same\",\"against\",\"nearly\",\r\n\r\n\"entire\",\"sure\", \"u\", \"!\", \"was\", \"has\", \"its\", \"through\", \"me\", \"his\",\"once\",\"carry\",\r\n\r\n\"anew\",\"'\", \"t\",\"let\", \"us\", \"new\", \"before\", \"come\",\"two\", \"one\", \"ve\", \"go\", \"8\",\r\n\r\n\"she\", \"her\", \"he\",\"none\",\"at\",\"been\",\"these\",\"what\",\"up\",\"were\",\"them\",\"some\",\"had\",\r\n\r\n\"their\",\"do\",\"by\",\"or\",\"re\",\"aren\",\"so\",\"with\",\"will\",\"my\",\"no\",\"there\",\"here\",\"went\",\r\n\r\n\"much\",\"out\",\"other\",\"each\");<\/pre>\n<pre>Results: (portion)<\/pre>\n<pre><!-- p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco} --> \r\n41835 41841\u00a0 |people|\r\n41845 41850\u00a0 |haiti|\r\n41858 41863\u00a0 |lives|\r\n41878 41887\u00a0 |americans|\r\n41896 41903\u00a0 |dropped|\r\n41904 41914\u00a0 |everything|\r\n41921 41930\u00a0 |someplace|\r\n41940 41945\u00a0 |never|\r\n41955 41959\u00a0 |pull|\r\n41960 41966\u00a0 |people|\r\n41976 41981\u00a0 |never|\r\n41982 41987\u00a0 |known|\r\n41997 42003\u00a0 |rubble|\r\n42005 42014\u00a0 |prompting|\r\n42015 42021\u00a0 |chants|\r\n42057 42064\u00a0 |another|\r\n42065 42069\u00a0 |life|\r\n42074 42079\u00a0 |saved|\r\n42084 42090\u00a0 |spirit|\r\n42100 42109\u00a0 |sustained|\r\n42115 42121\u00a0 |nation|\r\n42126 42130\u00a0 |more|\r\n42140 42149\u00a0 |centuries|\r\n42150 42155\u00a0 |lives|\r\n42171 42177\u00a0 |people|\r\n42187 42195\u00a0 |finished|\r\n42198 42207\u00a0 |difficult|\r\n42208 42212\u00a0 |year|\r\n42237 42246\u00a0 |difficult|<\/pre>\n<pre>\r\n42247 42253\u00a0 |decade|\r\n42265 42269\u00a0 |year|\r\n42286 42292\u00a0 |decade|\r\n42293 42302\u00a0 |stretches|\r\n42324 42328\u00a0 |quit|\r\n42339 42343\u00a0 |quit|\r\n42357 42362\u00a0 |seize|\r\n42368 42374\u00a0 |moment|\r\n42380 42385\u00a0 |start|\r\n42405 42410\u00a0 |dream|\r\n42411 42418\u00a0 |forward|\r\n42427 42437\u00a0 |strengthen|\r\n42442 42447\u00a0 |union|\r\n42453 42457\u00a0 |more|<\/pre>\n<pre>42458 42463\u00a0 |thank|\r\n42469 42472\u00a0 |god|\r\n42473 42478\u00a0 |bless|\r\n42488 42491\u00a0 |god|\r\n42492 42497\u00a0 |bless|\r\n\r\n42502 42508\u00a0 |united|\r\n42509 42515\u00a0 |states|\r\n42519 42526\u00a0 |america|\r\n<\/pre>\n<p>I added stop words based on the unnecessary words that were appearing in the previous parsing. Doing this reinforced the idea that picking stop tokenizers is largely based on biases. To me it was clear which words were not meaningful, so I kept taking them out. At some point though it became more complicated. For example the use of words like &#8220;own&#8221;, which have different meanings, verb and noun, could be interesting, or things like &#8220;come&#8221;, but I decided they were not important.<\/p>\n<p>It seems to me that google, yahoo, and Bing all have very similar stop tokenizers. One article said that Yahoo! search is now powered by Bing in the US and google in Japan. (http:\/\/www.seobook.com\/relevancy\/)<\/p>\n<p>This article states:<\/p>\n<p>&#8220;<span style=\"color: #333331; font-family: 'Segoe UI', 'Lucida Grande', arial, verdana, tahoma; line-height: 22px;\">You can differentiate by having product information. But Google scrapes it. You can differentiate through consumer &amp; editorial reviews. But Google scrapes it. You can differentiate by brand, but Google sells branded keywords to competitors. No matter what you do, Google competes against you. You can opt out of being scraped, but then you get no search traffic (&amp; the ecosystem is set up to pay someone else to scrape your content + wrap it in ads).&#8221; <\/span><\/p>\n<p><span style=\"font-family: 'Segoe UI', 'Lucida Grande', arial, verdana, tahoma; color: #333331;\"><span style=\"line-height: 22px;\"><br \/>\n<\/span><\/span><\/p>\n<p><span style=\"color: #333331; font-family: 'Segoe UI', 'Lucida Grande', arial, verdana, tahoma; line-height: 22px;\"><span style=\"font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif; line-height: 19px; color: #000000;\">For google, there are so many other more complicated and more important algorithms for text parsing, and the ad \/proprietary \/ information ownership motives in addition to the need to define the user&#8217;s intentions are. <\/span><\/span><\/p>\n<p>When stop tokenization is used alone, the search has no way of ordering the results in a meaningful way. Google looks for things like &#8220;natural citations&#8221; to further break down results, presenting the ones that include the stop words as well as more relevant. It also has really interesting web based approaches like looking for natural link growth, depth and quality of links to and from a page, and a page&#8217;s age.<\/p>\n<p>Yahoo apparently, from what I understand, offers more search priority through ad sales than google, and has a poorer distinction of link quality and depth, therefore constantly presenting us with tangentially connected information, which I have to say has its place. I love when I go to check my yahoo spam account, that I keep mostly for the purpose of distraction, and I come across a link to a cat doing something cute, which brings me to some advertisement, and then another exciting link, and helps me pass some time while I get ready to get back to work. It has its place as a totally different kind of search, and as some search engines work to meet people&#8217;s needs, others are working to reduce people&#8217;s attention spans to meet their shifty search strategies and criteria.<\/p>\n<p>The most important thing is the slow convergence \/ homogenization of search strategies, following google mostly, limiting web recommendations and search result more and more, as often this is based on &#8220;link growth&#8221;, so &#8220;viral&#8221; things are almost self fulfilling prophecies.<\/p>\n<p>As search engines become personalized, they get to limit the recommendations for each person even more. On our end, we seem to develop a relationship with our preferred search engines as well, learning their stop tokenizers, and attempting to match their parsing style to fit our needs. We begin to develop a relationship, a two way relationship, with our search engine that becomes more and more specific, isolated, and limited, and we begin to learn from the search engine&#8217;s biases, further limiting our use of them as searching for something outside of their biases. Maybe the key is for different search engines to continue to evolve in different directions, so the recommendations and search results on each would not reply to our behaviors in the same way.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Based on the following stop words I tokenized obama&#8217;s most recent state of the union. Below is a portion of the results. Stop words: (&#8220;a&#8221;, &#8220;an&#8221;, &#8220;the&#8221;,&#8221;and&#8221;,&#8221;.&#8221;,&#8221;,&#8221;,&#8221; &#8220;,&#8221;because&#8221;, &#8220;why&#8221;,&#8221;this&#8221;,&#8221;is&#8221;,&#8221;of&#8221;,&#8221;in&#8221;,&#8221;if&#8221;,&#8221;that&#8221;,&#8221;that&#8217;s&#8221;,&#8221;it&#8221;,&#8221;then&#8221;,&#8221;than&#8221;,&#8221;when&#8221;, &#8220;we&#8221;,&#8221;as&#8221;,&#8221;from&#8221;,&#8221;to&#8221;,&#8221;our&#8221;,&#8221;s&#8221;,&#8221;have&#8221;,&#8221;they&#8221;,&#8221;have&#8221;,&#8221;?&#8221;, &#8220;all&#8221;,&#8221;must&#8221;,&#8221;who&#8221;,&#8221;you&#8221;,&#8221;on&#8221;,&#8221;for&#8221;,&#8221;may&#8221;,&#8221;be&#8221;,&#8221;\/&#8221;,&#8221;\\&#8221;\\'&#8221;,&#8221;\\&#8221;&#8221;,&#8221;get&#8221;,&#8221;are&#8221;,&#8221;i&#8221;,&#8221;am&#8221;,&#8221;not&#8221;, &#8220;m&#8221;,&#8221;make&#8221;,&#8221;makes&#8221;,&#8221;for&#8221;,&#8221;into&#8221;,&#8221;but&#8221;,&#8221;can&#8221;,&#8221;only&#8221;,&#8221;happen&#8221;,&#8221;don&#8221;,&#8221;same&#8221;,&#8221;against&#8221;,&#8221;nearly&#8221;, &#8220;entire&#8221;,&#8221;sure&#8221;, &#8220;u&#8221;, &#8220;!&#8221;, &#8220;was&#8221;, &#8220;has&#8221;, &#8220;its&#8221;, &#8220;through&#8221;, &#8220;me&#8221;, &#8220;his&#8221;,&#8221;once&#8221;,&#8221;carry&#8221;, &#8220;anew&#8221;,&#8221;&#8216;&#8221;, &#8220;t&#8221;,&#8221;let&#8221;, &#8220;us&#8221;, &#8220;new&#8221;, &#8220;before&#8221;, &#8220;come&#8221;,&#8221;two&#8221;, &#8220;one&#8221;, &#8220;ve&#8221;, &#8220;go&#8221;, &#8220;8&#8221;, &#8220;she&#8221;, &#8220;her&#8221;, &#8220;he&#8221;,&#8221;none&#8221;,&#8221;at&#8221;,&#8221;been&#8221;,&#8221;these&#8221;,&#8221;what&#8221;,&#8221;up&#8221;,&#8221;were&#8221;,&#8221;them&#8221;,&#8221;some&#8221;,&#8221;had&#8221;, &#8220;their&#8221;,&#8221;do&#8221;,&#8221;by&#8221;,&#8221;or&#8221;,&#8221;re&#8221;,&#8221;aren&#8221;,&#8221;so&#8221;,&#8221;with&#8221;,&#8221;will&#8221;,&#8221;my&#8221;,&#8221;no&#8221;,&#8221;there&#8221;,&#8221;here&#8221;,&#8221;went&#8221;, [&hellip;]<\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[25,26,1],"tags":[],"class_list":["post-672","post","type-post","status-publish","format-standard","hentry","category-learning-bit-x-bit","category-notes","category-uncategorized"],"_links":{"self":[{"href":"http:\/\/www.mariarabinovich.com\/blog\/wp-json\/wp\/v2\/posts\/672","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.mariarabinovich.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.mariarabinovich.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.mariarabinovich.com\/blog\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"http:\/\/www.mariarabinovich.com\/blog\/wp-json\/wp\/v2\/comments?post=672"}],"version-history":[{"count":7,"href":"http:\/\/www.mariarabinovich.com\/blog\/wp-json\/wp\/v2\/posts\/672\/revisions"}],"predecessor-version":[{"id":675,"href":"http:\/\/www.mariarabinovich.com\/blog\/wp-json\/wp\/v2\/posts\/672\/revisions\/675"}],"wp:attachment":[{"href":"http:\/\/www.mariarabinovich.com\/blog\/wp-json\/wp\/v2\/media?parent=672"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.mariarabinovich.com\/blog\/wp-json\/wp\/v2\/categories?post=672"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.mariarabinovich.com\/blog\/wp-json\/wp\/v2\/tags?post=672"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}