Understanding Word Embeddings

3 minute read


Humans are capable of easily understanding vocabulary in the form of free text. They are great at interpreting any ambiguity or variability in the ever changing and evolving language.

For example, consider the sentences:

1) I play soccer with my friends.

2) My friends and I play soccer.

3) I play soccer with my books.

We can easily deduce that sentence 1 and 2 are the same even when we change the form but sentence 3 does not makes sense even though its syntactically correct.

But getting the same sense of intuitive understanding using computers is a tough task. Machines only understand the language of numbers. They are incapable of processing text/characters in raw form. For our machine learning model to comprehend natural language, it needs to be configured to a numerical representation. This numerical representation is called a word vector or word embedding.

Lets look at the first and the simplest word vector:

One-Hot Vectors (OHV)

A general method for converting any categorical feature to any vector is using One-Hot Encoding. The idea is to represent each word using a binary vector of text dimension with every value 0 except 1 at the index of the word.

Consider the sentence: “Its very hot today”.

The vocabulary V would contain 4 words: $[‘Its’,’very’, ‘hot’, ‘today’]$

Each word here is an $R^{V\times 1}$ one-hot vector with 1 only at the index in the vocabulary and rest all zeros. $|V|$ is the size of vocabulary (4 in this case).

  • Each word is represented as an independent entity with no relation to the context of the sentence.
  • Moreover when we deal with language, we don’t have only 4 words in our vocabulary. We need to work with millions of words and it would be computationally infeasible to represent each word as a million dimension sparse vector of 0 values with 1 at only a single index. We need vectors which are dense and are able to gather contextual information.

There are mainly two approaches to do this:

  1. Frequency / Count based approach
  2. Context / Prediction based approach

Frequency based approach

Language is fundamentally composed of characters. These characters form words which further form sentences to paragraphs and documents. The most apparent features in a text are the count of the words and the order in which they are present in the text.

For ex: A book on “Gandhi” would contain more occurences of the word “Gandhi” than the word “Cricket”.

A common way is to count occurences of words in text and use that as a feature called the Bag-of-words (BOW) approach.

For ex:

  • Document $D_1$ : He plays soccer and also plays violin.
  • Document $D_2$ : She mostly plays tennis.

Vocabulary $V = [‘he’, ‘plays’, ‘soccer’, ‘and’, ‘voilin’, ‘she’, ‘mostly’, ‘tennis’]$

The matrix $M^{D\times V}$ looks like this:

So the vectors for each word look like this:

This BoW approach is generally modified to use TF-IDF weighing [1] where importance is given to certain words specific to a document and trivialize the words that appear in almost all documents. (Ex: “a, the, are” should be given less focus as they appear in almost all documents )

To be continued…