Recently I have talked to a handful of fellow students and scholars who had research interests which involved the analysis of free-form text. Unfortunately to everyone, gaining meaningful insight to written natural language is not a trivial task by any measures. Close reading is of course an option, but you would ideally prefer to look at textual data through a more macro-analytical/quantitative lens as well. Not to mention that in the age of big data close reading is rarely a feasible option.
By far my favorite way to conduct exploratory data analyses on corpora is with topic models, and I have written multiple articles about how to go about this in the least painful way possible. While topic models are awesome, they are not universally the best method for all things text.
Embeddings are numerical representations of textual data, and have become the canonical approach for semantic querying of text. In this article we will explore some of the ways in which we can explore textual data with the use of embeddings.
Capturing Relations between Concepts with Word Embeddings
Word embedding models are a set of approaches that learn latent vector representations of terms in an unsupervised fashion. When learning word embeddings from natural language, one essentially obtains a map of semantic relations in an embedding space.
Word embeddings are typically trained on large corpora so that they can capture general word-to-word relations in human language. This is useful, because one can infuse general knowledge about language into models for specific applications. This is also known as transfer learning, and has been a hot topic in machine learning for quite some time.
What if, instead of wanting to transfer general knowledge into a specific model, we just want to get a mapping of the semantically specific aspects of a smaller corpus? Let’s say that we have a a corpus of comments from a forum and we want to explore what kinds of associative relations can be found in them.
One way we may achieve this is by training a word embedding model from scratch on this corpus instead of using one that has been pretrained for us. In this example I am going to use the 20Newsgroups dataset as the corpus, in which we will explore semantic relations.
Train a Model
Now let’s start with a word embedding model. You might be familiar with Word2Vec, which is the method that popularized the use of static word embeddings in research and practice. On the other hand GloVe, developed by the folks over at Stanford seems to be a better method under most circumstances, and my anecdotal experience indicates that it gives much higher quality embeddings, especially on smaller corpora.
Unfortunately GloVe is not implemented in Gensim, but luckily I have made a fully Gensim compatible interface for the original GloVe code, we are going to use this for training the model.
Let’s install gensim, glovpy, scikit-learn, so we can fetch 20Newsgroups as well as embedding-explorer:
pip install glovpy gensim scikit-learn
We first have to load the dataset, and tokenize it, for this we will use gensim’s built in tokenizer. We are also going to filter out stop words, as they do not bear any meaningful information for the task at hand.
from gensim.utils import tokenize
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
def clean_tokenize(text: str) -> list[str]:
“””This function tokenizes texts and removes stop words from them”””
tokens = tokenize(text, lower=True, deacc=True)
tokens = [token for token in tokens if token not in ENGLISH_STOP_WORDS]
# Loading the dataset
dataset = fetch_20newsgroups(
remove=(“headers”, “footers”, “quotes”), categories=[“sci.med”]
newsgroups = dataset.data
# Tokenizing the dataset
tokenized_corpus = [clean_tokenize(text) for text in newsgroups]
After this we can easily train a GloVe model on the tokenized corpus.
from glovpy import GloVe
# Training word embeddings
model = GloVe(vector_size=25)
We can already query this word embedding model, let’s check for example which ten words are closest to “child”.
| age | 0.849304 |
| consistent | 0.844267 |
| adult | 0.805101 |
| range | 0.800615 |
| year | 0.798799 |
| hand | 0.792965 |
| children | 0.792113 |
| use | 0.789804 |
| restraint | 0.773764 |
| belt | 0.77003 |
Individually investigating every words’ relation to other words becomes tedious very quickly though. Ideally we would also like to visualize relations, maybe even get some networks.
Luckily, the embedding-explorer package can help us here, which I have developed. Working in computational humanities we often make use of word embedding models and the semantic networks built from relations in those models, and embedding-explorer helps us explore these in an interactive and visual manner. The package contains multiple interactive web applications, we will first look at the “network explorer”.
The idea behind this app is that concepts in embedding models naturally form some sort of network structure. Words that are closely related have strong links, while others might not have any. In the app you can build concept graphs based on a set of seed words you specify and two levels of free association.
At each level of association we take the five closest words in the embedding model to the words we already have and we add the to our network with a connection to the word it was associated to. The strength of the connection is determined by the cosine distance of concepts in embedding space. These kinds of networks have proven useful for multiple research projects me or my colleagues have worked on.
Let’s start up the app on our word embedding model.
from embedding_explorer import show_network_explorer
vocabulary = model.wv.index_to_key
embeddings = model.wv.vectors
This will open a browser window where you can freely explore the semantic relations in the corpus. Here is a screenshot of me looking at what networks arise around the words “jesus”, “science” and “religion”.
Exploring Semantic Relations in our GloVe Model
We can for example see that the way people talk about these subjects online seems to suggest that religion and science seem to relate through politics, society and philosophy, which makes a lot of sense. It is also interesting to observe how education is somewhere mid-way between science and religion, but is clearly more connected to science. This would be interesting to explore in more detail.
Networks of N-grams with Sentence Transformers
Now what if we not only want to look at word level relations, but phrases or sentences?
My suggestion is to use N-grams. N-grams are essentially just N terms that follow each other in text. For example in the sentence “I love my little cute dog” we would have 4-grams: “I love my little”, “love my little cute” and “my little cute dog”. Now the question is, how do we learn good semantic representations of N-grams?
You could technically still do this with GloVe by treating a phrase or sentence as a token, but there is a catch. Since the variety of N-grams increases drastically with N, a particular N-gram might only occur once or twice, and we might not be able to learn good representations of them.
How about taking the mean of word embeddings in the phrase? Well this could go a long way, but the problem is that we completely lose all information about the importance of different words, their order in the sentence and all contextual information as well.
The solution to this issue is to use sentence transformers, deep neural language models that produce contextually sensitive representations of text. They have outperformed all other approaches for a few years now, and have become the industry standard for embedding text. Now training such a model takes a lot of data, that we do not have at hand, but luckily we can use a handful of good pretrained models.
Let us first extract N-grams from our corpus. I chose to go with four-grams, but you can choose any number you would like. We are going to use scikit-learn’s CountVectorizer for doing this.
from sklearn.feature_extraction.text import CountVectorizer
# First we train a model on the corpus that learns all 4-grams
# We will only take the 4000 most frequent ones into account for now,
# But you can freely experiment with this
feature_extractor = CountVectorizer(ngram_range=(4,4), max_features=4000)
# Then we get the vectorizer’s vocabulary
four_grams = feature_extractor.get_feature_names_out()
We will need an embedding model for text representation. As I said earlier, we are going to use a pretrained model. I chose all-MiniLM-L6-v2 as it is very stable, widely used and is quite small, so it will probably run smoothly even on your personal computer.
We will use yet another package, embetter, so that we can use sentence transformers in a scikit-learn compatible manner.
pip install embetter[text]
We can load the model in Python like this:
from embetter.text import SentenceEncoder
encoder = SentenceEncoder(“all-MiniLM-L6-v2”)
We can then load the model and the n-grams into embedding-explorer.
from embedding_explorer import show_network_explorer
Note that this allows us to specify any arbitrary seed instead of just the ones that are in our vocabulary of four-grams. Here’s a screenshot of me putting in two sentences and seeing what kinds of networks are built from the four grams around them.
Exploring Phrases and Sentences in the Corpus
Interesting to observe yet again which phrases lie in the middle. It looks like law and history serve as sort of a connection between religion and science here.
Investigating Corpus-Level Semantic Structure with Document Embeddings
We have now looked at our corpus on the word and phrase level, and seen the semantic structures that naturally arise in them.
What if we wanted to gain some information about what happens on the level of documents? What documents lie close to each other, and what kinds of groups show up?
Note that one natural solution to this problem is topic modeling, which you should have a look into if you haven’t done it yet. In this article we will explore other tangentially related conceptualizations of this task.
As before, we need to think about how we are going to represent individual documents so that their semantic content gets captured.
More traditional machine learning practice would typically use Bag-of-Words representations or would train a Doc2Vec model. These are all good options (and you could and should experiment with them), but they again, lack contextual understanding of text. Since texts in our corpus are not too long, we can still use sentence transformers for embedding them. Let’s continue with the same embedding model we used for phrases.
Projection and Clustering
A natural way to explore semantic representations of documents is to project them into lower dimensional spaces (usually 2D) and use these projections for visualizing the documents. We can also look at how documents get clustered given some clustering approach.
Now this is all great, but the space of projection, dimensionality reduction and clustering approaches is so vast, that I constantly found myself wondering: “Would this look substantially different if I had used something else?” To counteract this issue I added another app to embedding-explorer, where you can freely and quickly explore what kinds of visualizations you would get out of all sorts of different methods.
Here’s our workflow:
1. We may or may not want to reduce the dimensionality of the embeddings before we proceed. You can choose from all sorts of dimensionality reduction methods, or you can turn it off.
2. We want to project our embeddings into 2D space so we can visualize them.
3. We might want to cluster the embeddings to see what kinds of documents get grouped together.
Clustering and Projection Workflow in embedding-explorer
Now we also have to know some outside information about the documents when we do this (textual content, title, etc.) otherwise there isn’t much for us to interpret.
Let’s create a data frame with columns that contain:
1. The first 400 characters of each document, so we can get a feel for what the text is about.
2. The length of the text, so we can see which texts are long and which ones are short in the visualizations.
3. The group from which they come from in our data set.
import pandas as pd
import numpy as np
# Extracting text lengths in number of characters.
lengths = [len(text) for text in corpus]
# Extracting first 400 characters from each text.
text_starts = [text[:400] for text in corpus]
# Extracting the group each text belongs to
# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = np.array(dataset.target_names)[dataset.target]
# We build a dataframe with the available metadata
metadata = pd.DataFrame(dict(length=lengths, text=text_starts, group=group_labels))
We can then start the application with the metadata passed along so we can hover and look at information about the documents.
from embedding_explorer import show_clustering
hover_name=”group”, # Title of hover box is going to be the group
hover_data=[“text”, “length”] # We would also like to see these on hover
When the app launches first you’ll be presented with this screen:
Options in the Clustering App
After running the clustering you will be able to look at a map of all documents colored by cluster. You can hover on points to see metadata about the document…
Clustering App Screenshot
and in the bottom you can even choose how the points should be colored, labelled and sized.
Clusters with Document Sizes
Exploratory analysis of textual data is difficult. We have looked at a handful of approaches for interactive investigation using state-of-the-art machine learning technology. I hope the methods discussed in this article and the embedding-explorer Python package will be useful for you in your future research/work.
((all images in the article were taken from embedding-explorer’s documentation, which was produced by the author))
Explore Semantic Relations in Corpora with Embedding Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.