SemanticVoid - Latest Comments

Re: what the bleep!

saki r — Wed, 30 Mar 2016 06:52:31 -0000

great doing. and i happy to visit your blog to know about this wonderful information. thanks for sharing. keep updating.

Re: Interfacing Hadoop With MySQL

mchandana — Mon, 06 Apr 2015 04:34:47 -0000

Very interesting info! Hadoop Online Training .

Re: The Book Of Mozilla

name — Mon, 07 Apr 2014 16:02:06 -0000

the address is The Book Of Mozilla it is easy to download mozilla you can see the link above in the article or go to see my article at http://www.indodownload.com/

Re: Yahoo! Mail Spam Guard Sucks

Trthseeker — Thu, 20 Dec 2012 21:40:27 -0000

Anyone being spammed by a yahoo and/or hotmail account that sounds like this as yahoo did not consider that an email from "Libya," was a security concern, as political correctiveness is always such a the way by others, but not for the safety of Americans:

Hi dear
My name is Aamira a young female from Libya. I just came across your email address while exploring in the Internet in search of sincere and kind heart ed man for true friendship. Something tells me you are a nice person and i would like to continue good friendship with you. If you don't mind reply me here so that i can explain to you all about myself and also send you some copies of my photograph. Please i will be waiting for your urgent response here. Thanks.
Aamira.

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

Rahul — Wed, 12 Sep 2012 06:00:28 -0000

Thumbs up, Justin. Well said.

Re: Accessing the DOM from within the Firefox extension

Wasimshahzad — Wed, 23 May 2012 06:43:57 -0000

good work buddy, I wll try now

Re: Accessing the DOM from within the Firefox extension

Yurikolovsky — Mon, 09 Apr 2012 13:08:36 -0000

Thank you, thank you, thank you!!! such crappy documentation out there still!!

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

Hamsalekha Sr — Sat, 28 Jan 2012 00:39:21 -0000

can i get the perl code for finding cosine similarity of two documents on a windows machine?

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

Rashmileos — Mon, 31 Oct 2011 21:07:28 -0000

If I have to find cosine similarity between a query and a document, Should I consider all words in the document? Or just the words which appear in the query?
Thanks.

Re: a speed gun for spam

asanandan anandan — Fri, 06 May 2011 08:11:00 -0000

This is useful too. ok . n

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

Antoine Imbert — Tue, 26 Apr 2011 23:34:39 -0000

If the norm of the vector representing the first document is A LOT smaller than the norm of the vector representing the second document, then your documents have a VERY different size. All the magic of the Cosine Similarity is to abstract the size of the documents. You are interested in the similarity of the topic of the contents, not in the similarity of the size of the contents.

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

Antoine Imbert — Tue, 26 Apr 2011 23:34:00 -0000

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

Yaroslav Bulatov — Mon, 14 Mar 2011 01:39:52 -0000

An interesting variation on cosine similarity is the "Fisher metric on multinomial manifold". The idea is to treat documents as multinomial probability distributions, use KL divergence to define distance for pair of infinitesimally close distributions and take shortest path integral to define distance for arbitrary pair of distributions. Surprisingly, this has a simple closed form. It looks like cosine similarity, except you take square roots of relative frequencies, see formula 17.9 in http://yaroslavvb.com/uploa...

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

Yaroslav Bulatov — Mon, 14 Mar 2011 01:39:00 -0000

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

diya — Thu, 28 Oct 2010 00:21:19 -0000

hi its me diya, i have a question that is sample correlation coeffient is equal to the cosine vector?? if it is then how? have you any idea about this or any solution?? i need your advise.. plz let me know..

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

diya — Thu, 28 Oct 2010 00:21:00 -0000

Re: Mindset: Intent-driven Search

Arun — Wed, 08 Sep 2010 07:35:18 -0000

I would like to know more about Yahoo! Mindset ? When it will be up ?

Re: Mindset: Intent-driven Search

Arun — Wed, 08 Sep 2010 07:35:18 -0000

I would like to know more about Yahoo! Mindset ? When it will be up ?

Re: stop words

Nicolas Torzec — Wed, 25 Aug 2010 15:09:44 -0000

To my knowledge the concept of "stop words" makes much more sense for IR than for NLP.

Actually most of the definitions for stop words - and these definitions vary from one system to another - are coming from the IR world, not the NLP world. And as far as I know, the term has been credited to Hans Peter Luhn, one of the pioneers in information retrieval, who used the concept in his designs.

IR commonly defines stop words as words or multi-words that do not appear in indices because they are either insignificant (i.e., articles, prepositions) or so common that the results would be higher than the system can handle.

But because of homographs, it is not that simple even for IR. Depending on the context, some sequences of tokens might be considered as sequences of stop words, or as plain meaningful words. For example, "Who" and "The" might be considered as two stop words according to this definition, but "The Who" is definitely not a stop word, especially if you are building an music index...

For NLP, every token is meaningful when you are trying to analyze the syntax and semantic of sentences for applications such as sentiment analysis, (natural language) question answering, text-to-speech synthesis, etc.

So I agree with you: the concept of stop words is more related to the inability of (some) IR systems to make correct use of them, than to anything else :)

Nicolas.
PS: Great tweets and blog BTW.

Re: stop words

Nicolas Torzec — Wed, 25 Aug 2010 15:09:44 -0000

To my knowledge the concept of "stop words" makes much more sense for IR than for NLP.

So I agree with you: the concept of stop words is more related to the inability of (some) IR systems to make correct use of them, than to anything else :)

Nicolas.
PS: Great tweets and blog BTW.

Re: stop words

mat kelcey — Wed, 25 Aug 2010 00:52:51 -0000

I've always been in two minds about stop words...

My experience has been that given enough data (whatever "enough" might mean, it is so subjective to the task at hand) ignoring special handling of stop words has never made a big difference to the quality of results.

It's always nice to not need special handling code!

Mat

Re: stop words

mat kelcey — Wed, 25 Aug 2010 00:52:51 -0000

I've always been in two minds about stop words...

It's always nice to not need special handling code!

Mat

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

Justin Washtell — Tue, 17 Aug 2010 19:53:34 -0000

Hi Anand, et al,

Interesting dialogue. There are some other very interesting points worth bearing in mind when working with these measures.

Cosine Similarity and Euclidean Distance capture a lot of the same information. However whereas Euclidean Distance measures an actual distance between the two points of interest, Cosine can be thought of as measuring their apparent distance as viewed from the origin. Think of stars in the sky if you like analogies. The stars in Taurus are all relatively close together from our point of view. But in reality some of them are probably many times closer to us than others - the distance information has been effectively discarded (this is your normalization factor).

If you're sailing a ship and navigating by the stars, then this is an appropriate thing to do. If you're navigating an interstellar spacecraft, it is probably not. And so it is with your application. Think of the meaning of the vectors with respect to your application and you will chose the most appropriate measure. But combining Cosine with Euclidean will probably not get you very far, as you are simply re-using a lot of information you already have.

Another way to think of Cosine Similarity is as measuring the *relative* proportions of the various features or dimensions - when all the dimensions between two vectors are in proportion (correlated), you get maximum similarity. Euclidean distance and its relatives (like Manhattan distance) are more concerned with absolutes. In practice though, the measures will often give similar results, especially on very high dimensional data (by normalization, Cosine Similarity effectively reduces the dimensionality of your data by 1. The higher the dimensionality of your data therefore - all other things being equal - the less significant this difference becomes. All things *may not* be equal though and it may be that the distance from the origin is a dimension of special significance - so again, check the logic of your app).

Another fact about Cosine Similarity that has been pointed out is that it is not really a measure of distance. This is an issue of semantics really. You can address this complaint to a large extent by squaring the value. Your value still ranges between 0 and 1, but it now has the property that the complement of the value (1 minus the value) is equal to the square of the *Sine* of the angle between the vectors - which an equivalent measure of *dissimilarity*. You need to square the values for this relationship to hold. Doing this gives you something akin to a proportion of similarity or dissimilarity (when it is 0.5 you might - but probably shouldn't - say that your points are neither particularly similar nor dissimilar). It is equivalent to taking the R2 value when analysing correlations (which is usually the done thing when trying to do anything cleverer than just rank the data).

While we're on the subject of interpreting values, if your vector components comprise probabilities which sum to 1 (e.g normalized frequency counts, which represent the probability of finding any particular word at a random location in the document), then you can normalize the vector by simply converting all of the probabilities to their square roots. This will naturally give you a vector whose *length* is 1. Note that this is not the same as scaling the vector linearly as is often done (and as Cosine Similarity does), although that also results in a vector of length 1! I'll leave you to work out whether that is a good thing or not. :-)

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

Justin Washtell — Tue, 17 Aug 2010 19:53:34 -0000

Hi Anand, et al,

Interesting dialogue. There are some other very interesting points worth bearing in mind when working with these measures.

Re: Similarity Measure: Cosine Similarity or Euclidean Distance or Both

ahmed — Wed, 04 Aug 2010 10:01:02 -0000

I am interested in regression models and I have two groups of data ( not equal in sample size). I wish to measure the similarity between the two groups of data. How can i do that. I need your advice. please if you have any idea let me know.

Regards,
Ahmed