SemanticVoid - Latest Commentshttp://semanticvoid.disqus.com/enWed, 30 Mar 2016 06:52:31 -0000Re: what the bleep!http://semanticvoid.com/blog/2011/03/04/what-the-bleep-2/#comment-2596562548<p>great doing. and i happy to visit your blog to know about this wonderful information. thanks for sharing. keep updating.</p>saki rWed, 30 Mar 2016 06:52:31 -0000Re: Interfacing Hadoop With MySQLhttp://semanticvoid.com/blog/2009/03/05/interfacing-hadoop-with-mysql/#comment-1948622495<p>Very interesting info! <a href="http://hadooptraininginhyderabad.co.in/" rel="nofollow noopener" title="http://hadooptraininginhyderabad.co.in/">Hadoop Online Training </a>.</p>mchandanaMon, 06 Apr 2015 04:34:47 -0000Re: The Book Of Mozillahttp://semanticvoid.com/blog/2006/12/06/the-book-of-mozilla/#comment-1322875515<p>the address is The Book Of Mozilla it is easy to download mozilla you can see the link above in the article or go to see my article at <a href="http://www.indodownload.com/" rel="nofollow noopener" title="http://www.indodownload.com/">http://www.indodownload.com/</a></p>nameMon, 07 Apr 2014 16:02:06 -0000Re: Yahoo! Mail Spam Guard Suckshttp://semanticvoid.com/blog/2005/11/11/yahoo-mail-spam-guard-sucks/#comment-744390529<p>Anyone being spammed by a yahoo and/or hotmail account that sounds like this as yahoo did not consider that an email from "Libya," was a security concern, as political correctiveness is always such a the way by others, but not for the safety of Americans:</p><p>Hi dear<br> My name is Aamira a young female from Libya. I just came across your email address while exploring in the Internet in search of sincere and kind heart ed man for true friendship. Something tells me you are a nice person and i would like to continue good friendship with you. If you don't mind reply me here so that i can explain to you all about myself and also send you some copies of my photograph. Please i will be waiting for your urgent response here. Thanks.<br>Aamira.</p>TrthseekerThu, 20 Dec 2012 21:40:27 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-647587727<p>Thumbs up, Justin. Well said.</p>RahulWed, 12 Sep 2012 06:00:28 -0000Re: Accessing the DOM from within the Firefox extensionhttp://semanticvoid.com/blog/2006/06/01/accessing-the-dom-from-within-the-firefox-extension/#comment-536456012<p>good work buddy, I wll try now</p>WasimshahzadWed, 23 May 2012 06:43:57 -0000Re: Accessing the DOM from within the Firefox extensionhttp://semanticvoid.com/blog/2006/06/01/accessing-the-dom-from-within-the-firefox-extension/#comment-492283658<p>Thank you, thank you, thank you!!! such crappy documentation out there still!!</p>YurikolovskyMon, 09 Apr 2012 13:08:36 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-422784283<p>can i get the perl code for finding cosine similarity of two documents on a windows machine?</p>Hamsalekha SrSat, 28 Jan 2012 00:39:21 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-352373199<p>If I have to find cosine similarity between a query and a document, Should I consider all words in the document? Or just the words which appear in the query?<br>Thanks. </p>RashmileosMon, 31 Oct 2011 21:07:28 -0000Re: a speed gun for spamhttp://semanticvoid.com/blog/2011/02/24/speed-gun-for-spam/#comment-314452215<p>This is useful too. ok . n</p>asanandan anandanFri, 06 May 2011 08:11:00 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-192442437<p>If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents. You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents. </p>Antoine ImbertTue, 26 Apr 2011 23:34:39 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-314451935<p>If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents. You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents.</p>Antoine ImbertTue, 26 Apr 2011 23:34:00 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-165365241<p>An interesting variation on cosine similarity is the "Fisher metric on multinomial manifold". The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in <a href="http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf" rel="nofollow noopener" title="http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf">http://yaroslavvb.com/uploa...</a></p>Yaroslav BulatovMon, 14 Mar 2011 01:39:52 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-314451930<p>An interesting variation on cosine similarity is the "Fisher metric on multinomial manifold". The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in <a href="http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf" rel="nofollow noopener" title="http://yaroslavvb.com/upload/save/lebanon-axiomatic.pdf">http://yaroslavvb.com/uploa...</a></p>Yaroslav BulatovMon, 14 Mar 2011 01:39:00 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-91010096<p>hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..</p>diyaThu, 28 Oct 2010 00:21:19 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-314451928<p>hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..</p>diyaThu, 28 Oct 2010 00:21:00 -0000Re: Mindset: Intent-driven Searchhttp://semanticvoid.com/blog/2005/08/29/mindset-intent-driven-search/#comment-314451558<p>I would like to know more about Yahoo! Mindset ? When it will be up ?</p>ArunWed, 08 Sep 2010 07:35:18 -0000Re: Mindset: Intent-driven Searchhttp://semanticvoid.com/blog/2005/08/29/mindset-intent-driven-search/#comment-76099013<p>I would like to know more about Yahoo! Mindset ? When it will be up ?</p>ArunWed, 08 Sep 2010 07:35:18 -0000Re: stop wordshttp://semanticvoid.com/blog/2010/08/24/stop-words/#comment-314452201<p>To my knowledge the concept of "stop words" makes much more sense for IR than for NLP.</p><p>Actually most of the definitions for stop words - and these definitions vary from one system to another - are coming from the IR world, not the NLP world. And as far as I know, the term has been credited to Hans Peter Luhn, one of the pioneers in information retrieval, who used the concept in his designs.</p><p>IR commonly defines stop words as words or multi-words that do not appear in indices because they are either insignificant (i.e., articles, prepositions) or so common that the results would be higher than the system can handle.</p><p>But because of homographs, it is not that simple even for IR. Depending on the context, some sequences of tokens might be considered as sequences of stop words, or as plain meaningful words. For example, "Who" and "The" might be considered as two stop words according to this definition, but "The Who" is definitely not a stop word, especially if you are building an music index...</p><p>For NLP, every token is meaningful when you are trying to analyze the syntax and semantic of sentences for applications such as sentiment analysis, (natural language) question answering, text-to-speech synthesis, etc.</p><p>So I agree with you: the concept of stop words is more related to the inability of (some) IR systems to make correct use of them, than to anything else :)</p><p>Nicolas.<br>PS: Great tweets and blog BTW.</p>Nicolas TorzecWed, 25 Aug 2010 15:09:44 -0000Re: stop wordshttp://semanticvoid.com/blog/2010/08/24/stop-words/#comment-76099549<p>To my knowledge the concept of "stop words" makes much more sense for IR than for NLP.</p><p>Actually most of the definitions for stop words - and these definitions vary from one system to another - are coming from the IR world, not the NLP world. And as far as I know, the term has been credited to Hans Peter Luhn, one of the pioneers in information retrieval, who used the concept in his designs.</p><p>IR commonly defines stop words as words or multi-words that do not appear in indices because they are either insignificant (i.e., articles, prepositions) or so common that the results would be higher than the system can handle.</p><p>But because of homographs, it is not that simple even for IR. Depending on the context, some sequences of tokens might be considered as sequences of stop words, or as plain meaningful words. For example, "Who" and "The" might be considered as two stop words according to this definition, but "The Who" is definitely not a stop word, especially if you are building an music index...</p><p>For NLP, every token is meaningful when you are trying to analyze the syntax and semantic of sentences for applications such as sentiment analysis, (natural language) question answering, text-to-speech synthesis, etc.</p><p>So I agree with you: the concept of stop words is more related to the inability of (some) IR systems to make correct use of them, than to anything else :)</p><p>Nicolas.<br>PS: Great tweets and blog BTW.</p>Nicolas TorzecWed, 25 Aug 2010 15:09:44 -0000Re: stop wordshttp://semanticvoid.com/blog/2010/08/24/stop-words/#comment-76099548<p>I've always been in two minds about stop words...</p><p>My experience has been that given enough data (whatever "enough" might mean, it is so subjective to the task at hand) ignoring special handling of stop words has never made a big difference to the quality of results.</p><p>It's always nice to not need special handling code!</p><p>Mat</p>mat kelceyWed, 25 Aug 2010 00:52:51 -0000Re: stop wordshttp://semanticvoid.com/blog/2010/08/24/stop-words/#comment-314452197<p>I've always been in two minds about stop words...</p><p>My experience has been that given enough data (whatever "enough" might mean, it is so subjective to the task at hand) ignoring special handling of stop words has never made a big difference to the quality of results.</p><p>It's always nice to not need special handling code!</p><p>Mat</p>mat kelceyWed, 25 Aug 2010 00:52:51 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-314451926<p>Hi Anand, et al,</p><p>Interesting dialogue. There are some other very interesting points worth bearing in mind when working with these measures.</p><p>Cosine Similarity and Euclidean Distance capture a lot of the same information. However whereas Euclidean Distance measures an actual distance between the two points of interest, Cosine can be thought of as measuring their apparent distance as viewed from the origin. Think of stars in the sky if you like analogies. The stars in Taurus are all relatively close together from our point of view. But in reality some of them are probably many times closer to us than others - the distance information has been effectively discarded (this is your normalization factor).</p><p>If you're sailing a ship and navigating by the stars, then this is an appropriate thing to do. If you're navigating an interstellar spacecraft, it is probably not. And so it is with your application. Think of the meaning of the vectors with respect to your application and you will chose the most appropriate measure. But combining Cosine with Euclidean will probably not get you very far, as you are simply re-using a lot of information you already have.</p><p>Another way to think of Cosine Similarity is as measuring the *relative* proportions of the various features or dimensions - when all the dimensions between two vectors are in proportion (correlated), you get maximum similarity. Euclidean distance and its relatives (like Manhattan distance) are more concerned with absolutes. In practice though, the measures will often give similar results, especially on very high dimensional data (by normalization, Cosine Similarity effectively reduces the dimensionality of your data by 1. The higher the dimensionality of your data therefore - all other things being equal - the less significant this difference becomes. All things *may not* be equal though and it may be that the distance from the origin is a dimension of special significance - so again, check the logic of your app).</p><p>Another fact about Cosine Similarity that has been pointed out is that it is not really a measure of distance. This is an issue of semantics really. You can address this complaint to a large extent by squaring the value. Your value still ranges between 0 and 1, but it now has the property that the complement of the value (1 minus the value) is equal to the square of the *Sine* of the angle between the vectors - which an equivalent measure of *dissimilarity*. You need to square the values for this relationship to hold. Doing this gives you something akin to a proportion of similarity or dissimilarity (when it is 0.5 you might - but probably shouldn't - say that your points are neither particularly similar nor dissimilar). It is equivalent to taking the R2 value when analysing correlations (which is usually the done thing when trying to do anything cleverer than just rank the data).</p><p>While we're on the subject of interpreting values, if your vector components comprise probabilities which sum to 1 (e.g normalized frequency counts, which represent the probability of finding any particular word at a random location in the document), then you can normalize the vector by simply converting all of the probabilities to their square roots. This will naturally give you a vector whose *length* is 1. Note that this is not the same as scaling the vector linearly as is often done (and as Cosine Similarity does), although that also results in a vector of length 1! I'll leave you to work out whether that is a good thing or not. :-)</p>Justin WashtellTue, 17 Aug 2010 19:53:34 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-76099261<p>Hi Anand, et al,</p><p>Interesting dialogue. There are some other very interesting points worth bearing in mind when working with these measures.</p><p>Cosine Similarity and Euclidean Distance capture a lot of the same information. However whereas Euclidean Distance measures an actual distance between the two points of interest, Cosine can be thought of as measuring their apparent distance as viewed from the origin. Think of stars in the sky if you like analogies. The stars in Taurus are all relatively close together from our point of view. But in reality some of them are probably many times closer to us than others - the distance information has been effectively discarded (this is your normalization factor).</p><p>If you're sailing a ship and navigating by the stars, then this is an appropriate thing to do. If you're navigating an interstellar spacecraft, it is probably not. And so it is with your application. Think of the meaning of the vectors with respect to your application and you will chose the most appropriate measure. But combining Cosine with Euclidean will probably not get you very far, as you are simply re-using a lot of information you already have.</p><p>Another way to think of Cosine Similarity is as measuring the *relative* proportions of the various features or dimensions - when all the dimensions between two vectors are in proportion (correlated), you get maximum similarity. Euclidean distance and its relatives (like Manhattan distance) are more concerned with absolutes. In practice though, the measures will often give similar results, especially on very high dimensional data (by normalization, Cosine Similarity effectively reduces the dimensionality of your data by 1. The higher the dimensionality of your data therefore - all other things being equal - the less significant this difference becomes. All things *may not* be equal though and it may be that the distance from the origin is a dimension of special significance - so again, check the logic of your app).</p><p>Another fact about Cosine Similarity that has been pointed out is that it is not really a measure of distance. This is an issue of semantics really. You can address this complaint to a large extent by squaring the value. Your value still ranges between 0 and 1, but it now has the property that the complement of the value (1 minus the value) is equal to the square of the *Sine* of the angle between the vectors - which an equivalent measure of *dissimilarity*. You need to square the values for this relationship to hold. Doing this gives you something akin to a proportion of similarity or dissimilarity (when it is 0.5 you might - but probably shouldn't - say that your points are neither particularly similar nor dissimilar). It is equivalent to taking the R2 value when analysing correlations (which is usually the done thing when trying to do anything cleverer than just rank the data).</p><p>While we're on the subject of interpreting values, if your vector components comprise probabilities which sum to 1 (e.g normalized frequency counts, which represent the probability of finding any particular word at a random location in the document), then you can normalize the vector by simply converting all of the probabilities to their square roots. This will naturally give you a vector whose *length* is 1. Note that this is not the same as scaling the vector linearly as is often done (and as Cosine Similarity does), although that also results in a vector of length 1! I'll leave you to work out whether that is a good thing or not. :-)</p>Justin WashtellTue, 17 Aug 2010 19:53:34 -0000Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Bothhttp://semanticvoid.com/blog/2007/02/23/similarity-measure-cosine-similarity-or-euclidean-distance-or-both/#comment-314451923<p>I am interested in regression models and I have two groups of data ( not equal in sample size). I wish to measure the similarity between the two groups of data. How can i do that. I need your advice. please if you have any idea let me know.</p><p>Regards,<br>Ahmed</p>ahmedWed, 04 Aug 2010 10:01:02 -0000