From Linguistic Symbols to Vector Spaces: A Beginner’s Guide to Word Embeddings in NLP

Vector Representations

One of the primary challenges in NLP is – how to translate everything that makes a human language, not just words, phrases and sentences, but associations between them, nuances, emotional tones, etc. to a format that machines can understand. How to convey all the layers of human linguistic intuition to the binary logic of machines?

This is where vector representations come into play. Commonly known as embeddings, these are mathematical representations of data in a vector space, which convert information into array of numbers (vectors) suitable for NLP tasks. So, they are not just converting text into numerical form, but also capturing semantic and contextual nuances (relationships) of words.

Word embeddings are typically used to allow machine learning models to understand and perform computations with textual data. They often result in real-valued dense vectors (the length of the vector usually ranges from a few dozen to a few hundred dimensions), where words with similar meanings or words that appear in similar contexts have vectors that are close together in the vector space. This makes them suitable for tasks where semantic understanding is crucial, such as sentiment analysis, machine translation, or text classification.

Screenshot from DataStax Astra DB vector database

Different Types of Word Embeddings

Most common way of dividing different types of word Embeddings is based on the method of generation, i.e., techniques used to produce vectors that represent words (or larger units of text). There are two main categories:

1. Frequency-based Embeddings – rely on statistical measures to generate word embeddings. The assumption is that the meaning of the word can be inferred from the context and the words that it frequently co-occurs with. By analyzing how often words appear together in large datasets, these techniques try to capture semantic and syntactic relationships.

  • Count Vectors – also known as “Bag of Words” (BoW) technique, it counts the frequency of each word in a document while ignoring the order of words, so it doesn’t capture the semantic meaning of words. Text is transformed into bag-of-words and each token (word or n-gram) is represented by its frequency count. Scikit-learn library has even a class CountVectorizer, so it’s very straightforward.

  • TF-IDF Vectors (Term Frequency-Inverse Document Frequency) – statistical measure that evaluates the importance of a word in a document relative to a collection of documents (corpus), so the words that often appear in a particular document, but not in many get highest scores and are identified as important and unique to that particular document. This is useful for text classification, text clustering, topic modeling. However, this technique doesn’t capture semantic meaning as well, nor word order. Some NLP libraries that have TF-IDF classes: scikit-learn – TfidfVectorizer; genism: TfidfModel.

How it works:

  • Co-Occurrence Matrix with SVD (Latent Semantic Analysis) – a matrix that represents the co-occurrence frequencies of terms in a given corpus, it captures how often terms co-occur within a given context. It can be useful for document similarity and clustering tasks, topic modelling, corpus dimensionality reduction, synonym detection, text summarization, sentiment analysis, etc. NLP libraries that have this class for easy use: scikit-learn: TruncatedSVD, genism: LsiModel.

How it works:

Co-Occurrence Matrix provides a matrix of values of how many times two words appear together. In order to simplify the huge matrix, SVD – a matrix factorization technique is used, for dimensionality reduction. By reducing the dimensions we group together terms that are contextually related. The process of simplifying the co-occurrence matrix using SVD is called Latent Semantic Analysis (LSA).

2. Prediction-based Embeddings – learned by training models (neural networks) to predict words based on their surrounding context. The basic idea is the distributional hypothesis – words that appear in similar contexts tend to have similar meanings, so they can capture deeper semantic and syntactic relationships, capture meanings in various context, etc.

  • Word2Vec – one of the most foundational model in NLP. It primarily uses 2 architectures to either predict a word given its context (Skip-Gram) or predict context words given a target word (Continuous Bag of Words – CBOW). It captures a wide range of semantic relationships, can be used with any amount of data (but it requires a lot of data, so it makes sense to use a pre-trained model) and can perform arithmetic operations on word vectors that have semantic implications (for example: king – man + woman = queen). It is used for text classification, information retrieval, text summarization, recommendation systems, language modeling & sequence tagging. NLP libraries: genism: Word2Vec; spaCy and NLTK doesn’t have direct implementation of Word2Vec, but support loading and querying Word2Vec vectors trained by genism or other tools; it can also be trained from scratch with TensorFLow and Keras.

  • FastText – an extension of Word2Vec, but unlike Word2Vec which considers whole words, it represents them as bags of character n-grams. This allows it to generate embeddings for out-of-vocabulary words (not seen during training) and offers better representations for morphologically rich languages (like Turkish, Finnish, Russian, Hungarian). For example, the word “reading” might be broken down into n-grams like “rea”, “eadi”, “ading”, and so on. The word’s vector is then formed by summing these n-grams’ vectors. For using this technique, FastText library needs to be installed.

  • GloVe (Global Vectors for Word Representation) – unsupervised learning algorithm that captures global statistical information by constructing a word co-occurrence matrix from the corpus and then factorizing it (like in LSA). While it is built on global statistics (like frequency-based embeddings), the method of learning is predictive, trying to optimize the product of word vectors to match the co-occurrence matrix. GloVe embeddings are available for various datasets and vocab sizes. If your task doesn’t have a large amount of training data, you can benefit from using these pre-trained embeddings. After downloading pre-trained GloVevectors or use GloVe with genism (glove2word2vec).

Besides these, there are two more models of newer generation that are more dynamic (they produce dynamic embeddings that change based on the context), that leverage deep architectures (LSTMs and Transformers) and benefit from a pre-train/fine-tune paradigm:

  • ELMo (Embeddings from Language Models) – generates embeddings using a bidirectional language model trained on a large corpus. Specifically, it utilizes bidirectional LSTMs to predict the next word in a sequence, considering both past (left-to-right) and future (right-to-left) contexts, unlike Word2Vec or GloVe, where a word has the same embedding regardless of its context. So, the word “two” in “two times” and “New York Times.” would have different embeddings.

  • BERT (Bidirectional Encoder Representations from Transformers) – based on the Transformer architecture, which uses attention mechanisms to weigh the importance of different words in a sentence when generating representations. Unlike traditional language models that predict the next word in a sequence, BERT is trained using a masked language model approach (MLM- type of language model where certain words in sentence are masked/replaced by a token, and the objective is to predict the original word given the context), where it tries to predict a masked word in a sentence based on its context. BERT’s embeddings capture information from both left and right context due to its bidirectional nature. While you can obtain word-level embeddings from BERT, the model itself produces embeddings at various levels: token-level (word/subword), sentence-level, or even paragraph-level.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *