Copyright © 2017 DataScience.US All Rights Reserved.
Vectorizing Words For Natural Language Processing: Why It’s Necessary
The ability to process natural human language is one of the things that makes machine learning algorithms so powerful. However, when doing natural language processing, words must be converted into vectors that machine learning algorithms can make use of.
If your goal is to do machine learning on text data, like movie reviews or tweets or anything else, you need to convert the text data into numbers. This process is sometimes referred to as “embedding” or “vectorization”.
In terms of vectorization, it is important to remember that it isn’t merely turning a single word into a single number. While words can be transformed into numbers, an entire document can be translated into a vector. Not only can a vector have more than one dimension, but with text data vectors are usually high-dimensional. This is because each dimension of your feature data will correspond to a word, and the language in the documents you are examining will have thousands of words.
The Bag of Words Approach
There are many different approaches to vectorizing words, and one of the most simple approaches is known as a “Bag of Words” approach. In a Bag of Words approach, there will be many documents in a corpus under examination, and each document will have many words within it. The union of all of the words in the corpus will be the dimensionality of the vector that needs to be created, 800 words with no duplicates would translate to an 800-dimensional vector. Every unique word has its own place within that vector, since there has to be some sort of consistent mapping to enable the transition back and forth between numbers and words.
The map denotes that a specific portion of a vector corresponds to a specific word, and if that word is discovered in a document, a “one” is placed at that location in the vector. If the word is not discovered in a document, a zero is placed there. This process is called “one-hot encoding”, and it leads to the entire document being represented as a series of ones or zeroes. The ones will correspond to all the words in the document, and most places in the vector will be zeroes since most words don’t appear in most documents.
There are variations of this process that aren’t binary, and instead would note a different number for how many times a specific words occurs, say if the word “infamous” appeared four times in a document it would be given a four instead of a one.
It may also be helpful to remove “stop words” from your vectors. Stop words are the words that are extremely common and aren’t very helpful for distinguishing meaning in a document. These are typically grammatical “filler” words, such as: and, or, if, the, is, etc. To remove stop words would frequently preserve most of the information through the preservation of nouns and adjectives, but it would reduce the dimensionality of a vector.
Limitation of Bag of Words
The Bag of Words approach is naïve in the sense that it does not distinguish the context of how a sentence or paragraph is structured. It takes into account the frequency of words, but it doesn’t encode things like the words around it or the position of the word in a sentence. This can be a problem because the meaning of words can be changed significantly by the presence of other words or by their position.
One of the solutions to this problem is the formation of “n-grams”. N-grams work by encoding phrases or chunks of words, typically in blocks of two, three, of four words. While this approach enables the preservation of more semantic meaning, it increases the dimensionality of vectors significantly. This means that it is more difficult to tease apart what specific words or values appear in which documents. You would only know that a certain N-gram appears in specific documents, which dramatically reduces their rate of occurrence and limits what you can learn from them.
Bag of Words also can’t tell you whether or not words are unique, as the fact that a word is showing up repeatedly within a certain type of document can hint at its importance. Niche words that are uncommon to documents from a certain source can be helpful in distinguishing the meaning of a document, say if a news article from the Washington Post repeatedly used the word “neurology”, it would inform you that this is likely to be related to the topic of the article. However, Bag of Words cannot distinguish this.
A remedy to this problem is to employ Term Frequency-Inverse Document Frequency (TFIDF) Matrix as an approach to vectorization. TFIDF allows you to place more emphasis on infrequent words by assigning a weight to each word instead of a binary value. The weight is determined through a combination of the word’s frequency in a document, and how rare the word is in the entire corpus. This gives more information about the context of the words as applied to their distribution in the documents overall.
Applications For Vectorized Words And Beyond
After the words in the document have been pre-processed and vectorized, they can be applied to machine learning tasks. An example of a text classification task would be determining if an email is spam, using a naive Bayes classifier, which is an example of binary classification. For a more complicated task, like machine translation or similar, a recurrent neural network could be used.
More recently, Google’s researchers have created Word2Vec, a powerful method of producing word embeddings and vectorizing. Word2Vec is able to handle compl ex tasks like latent semantic analysis. Whatever methods are used to embed or vectorize words for use by machine learning algorithms, the process is absolutely critical to enabling the sophisticated processing of natural language by neural networks.