<?xml version="1.0" encoding="utf-8"?>
<rss xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title>SemanticVoid - Latest Comments</title><link xmlns="http://www.w3.org/2005/Atom" rel="http://api.friendfeed.com/2008/03#sup" href="http://disqus.com/sup/all.sup#forumcomments-ffc15fdd" type="application/json"/><link>http://semanticvoid.disqus.com/</link><description></description><atom:link href="http://semanticvoid.disqus.com/comments.rss" rel="self"></atom:link><language>en</language><lastBuildDate>Mon, 09 Apr 2012 13:08:36 -0000</lastBuildDate><item><title>Re: Accessing the DOM from within the Firefox extension</title><link>http://semanticvoid.com/blog/2006/06/01/accessing-the-dom-from-within-the-firefox-extension/#comment-492283658</link><description>&lt;p&gt;Thank you, thank you, thank you!!! such crappy documentation out there still!!&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Yurikolovsky</dc:creator><pubDate>Mon, 09 Apr 2012 13:08:36 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-422784283</link><description>&lt;p&gt;can i get the perl code for finding cosine similarity of two documents on a windows machine?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Hamsalekha Sr</dc:creator><pubDate>Sat, 28 Jan 2012 00:39:21 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-352373199</link><description>&lt;p&gt;If I have to find cosine similarity between a query and a document, Should I consider all words in the document? Or just the words which appear in the query?&lt;br&gt;Thanks. &lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Rashmileos</dc:creator><pubDate>Mon, 31 Oct 2011 21:07:28 -0000</pubDate></item><item><title>Re: a speed gun for spam</title><link>http://semanticvoid.com/blog/2011/02/24/speed-gun-for-spam/#comment-314452215</link><description>&lt;p&gt;This is useful too. ok . n&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">asanandan anandan</dc:creator><pubDate>Fri, 06 May 2011 08:11:00 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-192442437</link><description>&lt;p&gt;If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents.  You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents. &lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Antoine Imbert</dc:creator><pubDate>Tue, 26 Apr 2011 23:34:39 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-314451935</link><description>&lt;p&gt;If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents.  You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Antoine Imbert</dc:creator><pubDate>Tue, 26 Apr 2011 23:34:00 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-165365241</link><description>&lt;p&gt;An interesting variation on cosine similarity is the "Fisher metric on multinomial manifold". The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in &lt;a href="http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf" rel="nofollow"&gt;http://yaroslavvb.com/upload/s...&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Yaroslav Bulatov</dc:creator><pubDate>Mon, 14 Mar 2011 01:39:52 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-314451930</link><description>&lt;p&gt;An interesting variation on cosine similarity is the "Fisher metric on multinomial manifold". The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in &lt;a href="http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf" rel="nofollow"&gt;http://yaroslavvb.com/upload/s...&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Yaroslav Bulatov</dc:creator><pubDate>Mon, 14 Mar 2011 01:39:00 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-91010096</link><description>&lt;p&gt;hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">diya</dc:creator><pubDate>Thu, 28 Oct 2010 00:21:19 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-314451928</link><description>&lt;p&gt;hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">diya</dc:creator><pubDate>Thu, 28 Oct 2010 00:21:00 -0000</pubDate></item><item><title>Re: Mindset: Intent-driven Search</title><link>http://semanticvoid.com/blog/2005/08/29/mindset-intent-driven-search/#comment-76099013</link><description>&lt;p&gt;I would like to know more about Yahoo! Mindset ? When it will be up ?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Arun</dc:creator><pubDate>Wed, 08 Sep 2010 07:35:18 -0000</pubDate></item><item><title>Re: Mindset: Intent-driven Search</title><link>http://semanticvoid.com/blog/2005/08/29/mindset-intent-driven-search/#comment-314451558</link><description>&lt;p&gt;I would like to know more about Yahoo! Mindset ? When it will be up ?&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Arun</dc:creator><pubDate>Wed, 08 Sep 2010 07:35:18 -0000</pubDate></item><item><title>Re: stop words</title><link>http://semanticvoid.com/blog/2010/08/24/stop-words/#comment-76099549</link><description>&lt;p&gt;To my knowledge the concept of "stop words" makes much more sense for IR than for NLP.&lt;/p&gt;

&lt;p&gt;Actually most of the definitions for stop words - and these definitions vary from one system to another - are coming from the IR world, not the NLP world. And as far as I know, the term has been credited to Hans Peter Luhn, one of the pioneers in information retrieval, who used the concept in his designs.&lt;/p&gt;

&lt;p&gt;IR commonly defines stop words as words or multi-words that do not appear in indices because they are either insignificant (i.e., articles, prepositions) or so common that the results would be higher than the system can handle.&lt;/p&gt;

&lt;p&gt;But because of homographs, it is not that simple even for IR. Depending on the context, some sequences of tokens might be considered as sequences of stop words, or as plain meaningful words. For example, "Who" and "The" might be considered as two stop words according to this definition, but "The Who" is definitely not a stop word, especially if you are building an music index...&lt;/p&gt;

&lt;p&gt;For NLP, every token is meaningful when you are trying to analyze the syntax and semantic of sentences for applications such as sentiment analysis, (natural language) question answering, text-to-speech synthesis, etc.&lt;/p&gt;

&lt;p&gt;So I agree with you: the concept of stop words is more related to the inability of (some) IR systems to make correct use of them, than to anything else :)&lt;/p&gt;

&lt;p&gt;Nicolas.&lt;br&gt;PS: Great tweets and blog BTW.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Nicolas Torzec</dc:creator><pubDate>Wed, 25 Aug 2010 15:09:44 -0000</pubDate></item><item><title>Re: stop words</title><link>http://semanticvoid.com/blog/2010/08/24/stop-words/#comment-314452201</link><description>&lt;p&gt;To my knowledge the concept of "stop words" makes much more sense for IR than for NLP.&lt;/p&gt;

&lt;p&gt;Actually most of the definitions for stop words - and these definitions vary from one system to another - are coming from the IR world, not the NLP world. And as far as I know, the term has been credited to Hans Peter Luhn, one of the pioneers in information retrieval, who used the concept in his designs.&lt;/p&gt;

&lt;p&gt;IR commonly defines stop words as words or multi-words that do not appear in indices because they are either insignificant (i.e., articles, prepositions) or so common that the results would be higher than the system can handle.&lt;/p&gt;

&lt;p&gt;But because of homographs, it is not that simple even for IR. Depending on the context, some sequences of tokens might be considered as sequences of stop words, or as plain meaningful words. For example, "Who" and "The" might be considered as two stop words according to this definition, but "The Who" is definitely not a stop word, especially if you are building an music index...&lt;/p&gt;

&lt;p&gt;For NLP, every token is meaningful when you are trying to analyze the syntax and semantic of sentences for applications such as sentiment analysis, (natural language) question answering, text-to-speech synthesis, etc.&lt;/p&gt;

&lt;p&gt;So I agree with you: the concept of stop words is more related to the inability of (some) IR systems to make correct use of them, than to anything else :)&lt;/p&gt;

&lt;p&gt;Nicolas.&lt;br&gt;PS: Great tweets and blog BTW.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Nicolas Torzec</dc:creator><pubDate>Wed, 25 Aug 2010 15:09:44 -0000</pubDate></item><item><title>Re: stop words</title><link>http://semanticvoid.com/blog/2010/08/24/stop-words/#comment-76099548</link><description>&lt;p&gt;I've always been in two minds about stop words...&lt;/p&gt;

&lt;p&gt;My experience has been that given enough data (whatever "enough" might mean, it is so subjective to the task at hand) ignoring special handling of stop words has never made a big difference to the quality of results. &lt;/p&gt;

&lt;p&gt;It's always nice to not need special handling code!&lt;/p&gt;

&lt;p&gt;Mat&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Wed, 25 Aug 2010 00:52:51 -0000</pubDate></item><item><title>Re: stop words</title><link>http://semanticvoid.com/blog/2010/08/24/stop-words/#comment-314452197</link><description>&lt;p&gt;I've always been in two minds about stop words...&lt;/p&gt;

&lt;p&gt;My experience has been that given enough data (whatever "enough" might mean, it is so subjective to the task at hand) ignoring special handling of stop words has never made a big difference to the quality of results. &lt;/p&gt;

&lt;p&gt;It's always nice to not need special handling code!&lt;/p&gt;

&lt;p&gt;Mat&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">mat kelcey</dc:creator><pubDate>Wed, 25 Aug 2010 00:52:51 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-76099261</link><description>&lt;p&gt;Hi Anand, et al,&lt;/p&gt;

&lt;p&gt;Interesting dialogue. There are some other very interesting points worth bearing in mind when working with these measures.&lt;/p&gt;

&lt;p&gt;Cosine Similarity and Euclidean Distance capture a lot of the same information. However whereas Euclidean Distance measures an actual distance between the two points of interest, Cosine can be thought of as measuring their apparent distance as viewed from the origin. Think of stars in the sky if you like analogies. The stars in Taurus are all relatively close together from our point of view. But in reality some of them are probably many times closer to us than others - the distance information has been effectively discarded (this is your normalization factor).&lt;/p&gt;

&lt;p&gt;If you're sailing a ship and navigating by the stars, then this is an appropriate thing to do. If you're navigating an interstellar spacecraft, it is probably not. And so it is with your application. Think of the meaning of the vectors with respect to your application and you will chose the most appropriate measure. But combining Cosine with Euclidean will probably not get you very far, as you are simply re-using a lot of information you already have.&lt;/p&gt;

&lt;p&gt;Another way to think of Cosine Similarity is as measuring the *relative* proportions of the various features or dimensions - when all the dimensions between two vectors are in proportion (correlated), you get maximum similarity. Euclidean distance and its relatives (like Manhattan distance) are more concerned with absolutes. In practice though, the measures will often give similar results, especially on very high dimensional data (by normalization, Cosine Similarity effectively reduces the dimensionality of your data by 1. The higher the dimensionality of your data therefore - all other things being equal - the less significant this difference becomes. All things *may not* be equal though and it may be that the distance from the origin is a dimension of special significance - so again, check the logic of your app).&lt;/p&gt;

&lt;p&gt;Another fact about Cosine Similarity that has been pointed out is that it is not really a measure of distance. This is an issue of semantics really. You can address this complaint to a large extent by squaring the value. Your value still ranges between 0 and 1, but it now has the property that the complement of the value (1 minus the value) is equal to the square of the *Sine* of the angle between the vectors - which an equivalent measure of *dissimilarity*. You need to square the values for this relationship to hold. Doing this gives you something akin to a proportion of similarity or dissimilarity (when it is 0.5 you might - but probably shouldn't - say that your points are neither particularly similar nor dissimilar). It is equivalent to taking the R2 value when analysing correlations (which is usually the done thing when trying to do anything cleverer than just rank the data).&lt;/p&gt;

&lt;p&gt;While we're on the subject of interpreting values, if your vector components comprise probabilities which sum to 1 (e.g normalized frequency counts, which represent the probability of finding any particular word at a random location in the document), then you can normalize the vector by simply converting all of the probabilities to their square roots. This will naturally give you a vector whose *length* is 1. Note that this is not the same as scaling the vector linearly as is often done (and as Cosine Similarity does), although that also results in a vector of length 1! I'll leave you to work out whether that is a good thing or not. :-)&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Justin Washtell</dc:creator><pubDate>Tue, 17 Aug 2010 19:53:34 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-314451926</link><description>&lt;p&gt;Hi Anand, et al,&lt;/p&gt;

&lt;p&gt;Interesting dialogue. There are some other very interesting points worth bearing in mind when working with these measures.&lt;/p&gt;

&lt;p&gt;Cosine Similarity and Euclidean Distance capture a lot of the same information. However whereas Euclidean Distance measures an actual distance between the two points of interest, Cosine can be thought of as measuring their apparent distance as viewed from the origin. Think of stars in the sky if you like analogies. The stars in Taurus are all relatively close together from our point of view. But in reality some of them are probably many times closer to us than others - the distance information has been effectively discarded (this is your normalization factor).&lt;/p&gt;

&lt;p&gt;If you're sailing a ship and navigating by the stars, then this is an appropriate thing to do. If you're navigating an interstellar spacecraft, it is probably not. And so it is with your application. Think of the meaning of the vectors with respect to your application and you will chose the most appropriate measure. But combining Cosine with Euclidean will probably not get you very far, as you are simply re-using a lot of information you already have.&lt;/p&gt;

&lt;p&gt;Another way to think of Cosine Similarity is as measuring the *relative* proportions of the various features or dimensions - when all the dimensions between two vectors are in proportion (correlated), you get maximum similarity. Euclidean distance and its relatives (like Manhattan distance) are more concerned with absolutes. In practice though, the measures will often give similar results, especially on very high dimensional data (by normalization, Cosine Similarity effectively reduces the dimensionality of your data by 1. The higher the dimensionality of your data therefore - all other things being equal - the less significant this difference becomes. All things *may not* be equal though and it may be that the distance from the origin is a dimension of special significance - so again, check the logic of your app).&lt;/p&gt;

&lt;p&gt;Another fact about Cosine Similarity that has been pointed out is that it is not really a measure of distance. This is an issue of semantics really. You can address this complaint to a large extent by squaring the value. Your value still ranges between 0 and 1, but it now has the property that the complement of the value (1 minus the value) is equal to the square of the *Sine* of the angle between the vectors - which an equivalent measure of *dissimilarity*. You need to square the values for this relationship to hold. Doing this gives you something akin to a proportion of similarity or dissimilarity (when it is 0.5 you might - but probably shouldn't - say that your points are neither particularly similar nor dissimilar). It is equivalent to taking the R2 value when analysing correlations (which is usually the done thing when trying to do anything cleverer than just rank the data).&lt;/p&gt;

&lt;p&gt;While we're on the subject of interpreting values, if your vector components comprise probabilities which sum to 1 (e.g normalized frequency counts, which represent the probability of finding any particular word at a random location in the document), then you can normalize the vector by simply converting all of the probabilities to their square roots. This will naturally give you a vector whose *length* is 1. Note that this is not the same as scaling the vector linearly as is often done (and as Cosine Similarity does), although that also results in a vector of length 1! I'll leave you to work out whether that is a good thing or not. :-)&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Justin Washtell</dc:creator><pubDate>Tue, 17 Aug 2010 19:53:34 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-76099258</link><description>&lt;p&gt;I am interested in regression models and I have two groups of data ( not equal in sample size). I wish to measure the similarity between the two groups of data.  How can i do that. I need your advice. please if you have any idea let me know.&lt;/p&gt;

&lt;p&gt; &lt;br&gt;Regards,&lt;br&gt;Ahmed&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">ahmed</dc:creator><pubDate>Wed, 04 Aug 2010 10:01:02 -0000</pubDate></item><item><title>Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both</title><link>http://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-314451923</link><description>&lt;p&gt;I am interested in regression models and I have two groups of data ( not equal in sample size). I wish to measure the similarity between the two groups of data.  How can i do that. I need your advice. please if you have any idea let me know.&lt;/p&gt;

&lt;p&gt; &lt;br&gt;Regards,&lt;br&gt;Ahmed&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">ahmed</dc:creator><pubDate>Wed, 04 Aug 2010 10:01:02 -0000</pubDate></item><item><title>Re: Calculation Of Tag Popularity</title><link>http://semanticvoid.com/blog/2006/01/02/calculation-of-tag-popularity/#comment-76099088</link><description>&lt;p&gt;It is excellent post!&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">alladinnn</dc:creator><pubDate>Wed, 21 Jul 2010 06:12:45 -0000</pubDate></item><item><title>Re: Calculation Of Tag Popularity</title><link>http://semanticvoid.com/blog/2006/01/02/calculation-of-tag-popularity/#comment-314451640</link><description>&lt;p&gt;It is excellent post!&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">alladinnn</dc:creator><pubDate>Wed, 21 Jul 2010 06:12:45 -0000</pubDate></item><item><title>Re: Reading Less Is Reading More</title><link>http://semanticvoid.com/blog/2009/10/07/reading-less-is-reading-more/#comment-76099535</link><description>&lt;p&gt;Wow nice idea&lt;br&gt;Lets hope it will work as ur wish&lt;/p&gt;

&lt;p&gt;&lt;a href="http://helloplot.com" rel="nofollow"&gt;helloplot.com&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">suby cherian</dc:creator><pubDate>Mon, 19 Jul 2010 07:58:02 -0000</pubDate></item><item><title>Re: Reading Less Is Reading More</title><link>http://semanticvoid.com/blog/2009/10/07/reading-less-is-reading-more/#comment-314452165</link><description>&lt;p&gt;Wow nice idea&lt;br&gt;Lets hope it will work as ur wish&lt;/p&gt;

&lt;p&gt;&lt;a href="http://helloplot.com" rel="nofollow"&gt;helloplot.com&lt;/a&gt;&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">suby cherian</dc:creator><pubDate>Mon, 19 Jul 2010 07:58:02 -0000</pubDate></item><item><title>Re: Twitter Plots: the case of the 2000 &amp;#8216;following&amp;#8217;</title><link>http://semanticvoid.com/blog/2010/05/27/twitter-plots-the-case-of-the-2000-following/#comment-314452183</link><description>&lt;p&gt;Thanks - this makes sense. Was not aware of it.&lt;/p&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Anand Kishore</dc:creator><pubDate>Sun, 30 May 2010 13:47:05 -0000</pubDate></item></channel></rss>
