Tutorial 19: Text analysis with gensim

Last week we saw how to use the XML module to access and clean the textual data given in the MediaWiki JSON file. Today, we'll se how to use the gensim module to actually parse the text itself. Eventually you will do all of this within the wikitext.py module and will not need to call gensim directly, but it will be helpful to understand how it works for some of the later projects.

Start by loading a few functions from the module:

In [1]:
from gensim import corpora, matutils, models, similarities
from gensim.similarities.docsim import MatrixSimilarity

In this tutorial we will work with a small set of text documents rather than the longer wikipedia example. This will make it easier to understand exactly what is going on. Here are the documents we'll use:

In [2]:
documents = ["Human machine interface for lab human computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",              
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

Tokenization

The first step in most text processing tasks is to split the raw data into individual words. We have already seen how to do this with regular expressions by detecting sequences of word characaters. Here we also convert the string to lower case:

In [3]:
import re
re.findall('(\\w+)', documents[0].lower())
Out[3]:
['human',
 'machine',
 'interface',
 'for',
 'lab',
 'human',
 'computer',
 'applications']

We can create a list of the words in every document by wrapping this up in a for loop:

In [4]:
word_list = []
for doc in documents:
    word_list.append(re.findall('(\\w+)', doc.lower()))
    
word_list
Out[4]:
[['human',
  'machine',
  'interface',
  'for',
  'lab',
  'human',
  'computer',
  'applications'],
 ['a',
  'survey',
  'of',
  'user',
  'opinion',
  'of',
  'computer',
  'system',
  'response',
  'time'],
 ['the', 'eps', 'user', 'interface', 'management', 'system'],
 ['system', 'and', 'human', 'system', 'engineering', 'testing', 'of', 'eps'],
 ['relation',
  'of',
  'user',
  'perceived',
  'response',
  'time',
  'to',
  'error',
  'measurement'],
 ['the', 'generation', 'of', 'random', 'binary', 'unordered', 'trees'],
 ['the', 'intersection', 'graph', 'of', 'paths', 'in', 'trees'],
 ['graph',
  'minors',
  'iv',
  'widths',
  'of',
  'trees',
  'and',
  'well',
  'quasi',
  'ordering'],
 ['graph', 'minors', 'a', 'survey']]

We have already seen how these tokenized lists, over longer documents, are useful in detecting the most frequent terms in a given document. Now, we'll see how to do much more with these words.

Lexicon

A lexicon, in linguistics proper, refers to all of the terms that a particular person has at their disposal. In the context of text processing, a lexicon refers to all of the terms across of our corpus (collection of documents) that we want to consider. To construct a lexicon using gensim, we call the Dictionary object:

In [5]:
lexicon = corpora.Dictionary(word_list)

One attribue of the lexicon is a mapping from each term into a numeric id. We can see this with:

In [6]:
lexicon.token2id
Out[6]:
{'applications': 0,
 'computer': 1,
 'for': 2,
 'human': 3,
 'interface': 4,
 'lab': 5,
 'machine': 6,
 'a': 7,
 'of': 8,
 'opinion': 9,
 'response': 10,
 'survey': 11,
 'system': 12,
 'time': 13,
 'user': 14,
 'eps': 15,
 'management': 16,
 'the': 17,
 'and': 18,
 'engineering': 19,
 'testing': 20,
 'error': 21,
 'measurement': 22,
 'perceived': 23,
 'relation': 24,
 'to': 25,
 'binary': 26,
 'generation': 27,
 'random': 28,
 'trees': 29,
 'unordered': 30,
 'graph': 31,
 'in': 32,
 'intersection': 33,
 'paths': 34,
 'iv': 35,
 'minors': 36,
 'ordering': 37,
 'quasi': 38,
 'well': 39,
 'widths': 40}

And a particular id comes from using the token2id attribute as a dictionary:

In [7]:
lexicon.token2id['human']
Out[7]:
3

To go in the other direction, treat the lexicon as a list and grab an id by reference.

In [8]:
lexicon[3]
Out[8]:
'human'

Numeric representation of corpus

Consider the first document in our corpus:

In [9]:
print(word_list[0])
['human', 'machine', 'interface', 'for', 'lab', 'human', 'computer', 'applications']

Using our lexicon object it is possible to represent this document a list of integer ids. We could do this through a complex double for loop, but gensim provides the doc2idx function to make this easy:

In [10]:
lexicon.doc2idx(word_list[0])
Out[10]:
[3, 6, 4, 2, 5, 3, 1, 0]

Why is this numeric representation useful? For one thing, it takes up considerably less space (at least if we have a large corpus). It is also easier to program with integer ids. This specific representation is useful, as well, if you want to use deep learning models for languages. We may have a chance to see these towards the end of the semester.

Bag of words

The numeric representation of the terms above does not lose any information in the original document. A bag of words representation, instead, removes all of the order information in a document. It simply counts how often each term in the lexicon occurs within the document. This representation will be essential for nearly all of the methods we will see for processing text.

The method doc2bow of the lexicon converts a list of words into a bag of words. The bag of words is given as a list of tuples given as (word id, count). Here we see, for example, that the first document uses word number 3 twice:

In [11]:
lexicon.doc2bow(word_list[0])
Out[11]:
[(0, 1), (1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1)]

We can cycle through the entire set of documents to create the complete bag of words representation of the corpus.

In [12]:
bow = []
for t in word_list:
    bow.append(lexicon.doc2bow(t))
    
bow
Out[12]:
[[(0, 1), (1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1)],
 [(1, 1), (7, 1), (8, 2), (9, 1), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1)],
 [(4, 1), (12, 1), (14, 1), (15, 1), (16, 1), (17, 1)],
 [(3, 1), (8, 1), (12, 2), (15, 1), (18, 1), (19, 1), (20, 1)],
 [(8, 1),
  (10, 1),
  (13, 1),
  (14, 1),
  (21, 1),
  (22, 1),
  (23, 1),
  (24, 1),
  (25, 1)],
 [(8, 1), (17, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1)],
 [(8, 1), (17, 1), (29, 1), (31, 1), (32, 1), (33, 1), (34, 1)],
 [(8, 1),
  (18, 1),
  (29, 1),
  (31, 1),
  (35, 1),
  (36, 1),
  (37, 1),
  (38, 1),
  (39, 1),
  (40, 1)],
 [(7, 1), (11, 1), (31, 1), (36, 1)]]

Term frequency matrix

A term frequency matrix is an equivalent representation of a bag of words as a matrix, a rectangular table of numeric values. The table has one row for each term in the lexicon and one column for each document (I will also call a matrix with terms in the columns and documents in the rows a term frequency matrix; context should make it clear which one we are working with). The method corpus2dense produces such as matrix for us:

In [13]:
tf_array = matutils.corpus2dense(bow, num_terms=len(lexicon.token2id))
tf_array
Out[13]:
array([[1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [2., 0., 0., 1., 0., 0., 0., 0., 0.],
       [1., 0., 1., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 1.],
       [0., 2., 0., 1., 1., 1., 1., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 1.],
       [0., 1., 1., 2., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 0., 0., 0.],
       [0., 1., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 1., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 1., 1., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 1., 1., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 1., 1.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0.]], dtype=float32)

You will notice that this matrix contains a lot of zero elements. The preponderance of zeros only gets worse with larger corpora. We can avoid these by creating instead sparse matrix with corpus2csc:

In [14]:
tf_sparse_array = matutils.corpus2csc(bow)
tf_sparse_array
Out[14]:
<41x9 sparse matrix of type '<class 'numpy.float64'>'
	with 66 stored elements in Compressed Sparse Column format>

The sparse array can (mostly) be manipulated just like the dense version, but saves a lot of space and (sometimes) is much faster to operate on. We won't need either of these term frequency matricies directly until we start building predictive models in a few weeks, but they are a nice way of thinking about the bag of words model.

Term frequency inverse document frequency

If we take the term frequency matrix and determine the largest values in each column, we will get the most frequent terms within each document. We've seen that this can be somewhat useful for understanding the content of a Wikipedia page. However, we have also seen that many of the terms are not very interesting but are instead commonly used across the entire corpus (either grammatic words like "the", "and", "because" or words that would be appear in anything related to our topic such as "cake" or "Virginia").

A solution is to also weight each term as a function of how often it occurs in all documents. There are several ways to do this; I'll explain just the most common technique. Let $f_{t, d}$ count the number of times that term $t$ occurs in document $d$, $n_t$ the number of documents that use term $t$ at least once, and $D$ be the total number of documents. Then, for each term $t$ and document $d$ we have the tf-idf (term frequency inverse document frequency) of:

$$ tfidf(t, d) = f_{t, d} \times log_2 \left( \frac{D}{n_t} \right) $$

So, the score will be higher if the term is used more frequently in a document but lower if the term is used in more documents. The idea is that terms with the highest tfidf score for a given document are the most distinguishing ones for that particular document.

We can build a tfidf model using the TfidfModel method:

In [15]:
tfidf = models.TfidfModel(bow)

The model can then be applied to any particular document of interest:

In [16]:
tfidf[bow[1]]
Out[16]:
[(1, 0.3431629039936792),
 (7, 0.3431629039936792),
 (8, 0.1850178511620907),
 (9, 0.5013079568252677),
 (10, 0.3431629039936792),
 (11, 0.3431629039936792),
 (12, 0.25065397841263387),
 (13, 0.3431629039936792),
 (14, 0.25065397841263387)]

The most relevant term in document 1 is then term number 9, even though it is used only once:

In [17]:
lexicon[9]
Out[17]:
'opinion'

We can visually determin the most interesting term for a single document, but it is useful to automatically sort the list for use in general programming.

In [18]:
tf_obj = tfidf[bow[1]]
sorted(tf_obj, key=lambda x: x[1], reverse=True)[:5]
Out[18]:
[(9, 0.5013079568252677),
 (1, 0.3431629039936792),
 (7, 0.3431629039936792),
 (10, 0.3431629039936792),
 (11, 0.3431629039936792)]

And now we see the top five terms for this particular document:

In [19]:
n_terms = 5

top_terms = []
for obj in sorted(tf_obj, key=lambda x: x[1], reverse=True)[:n_terms]:
    top_terms.append("{0:s} ({1:01.03f})".format(lexicon[obj[0]], obj[1]))

print(top_terms)
['opinion (0.501)', 'computer (0.343)', 'a (0.343)', 'response (0.343)', 'survey (0.343)']

It is possible to also represent the TF-IDF object as a matrix, as show by the following code:

In [20]:
tfidf_corpus = []
for doc in bow:
    tfidf_corpus.append(tfidf[doc])

tfidf_mat = matutils.corpus2dense(tfidf_corpus, num_terms=len(lexicon.token2id))
tfidf_mat[:40, :4]
Out[20]:
array([[0.3831578 , 0.        , 0.        , 0.        ],
       [0.26228496, 0.3431629 , 0.        , 0.        ],
       [0.3831578 , 0.        , 0.        , 0.        ],
       [0.5245699 , 0.        , 0.        , 0.32487264],
       [0.26228496, 0.        , 0.41758764, 0.        ],
       [0.3831578 , 0.        , 0.        , 0.        ],
       [0.3831578 , 0.        , 0.        , 0.        ],
       [0.        , 0.3431629 , 0.        , 0.        ],
       [0.        , 0.18501785, 0.        , 0.08757829],
       [0.        , 0.50130796, 0.        , 0.        ],
       [0.        , 0.3431629 , 0.        , 0.        ],
       [0.        , 0.3431629 , 0.        , 0.        ],
       [0.        , 0.25065398, 0.30501547, 0.47458872],
       [0.        , 0.3431629 , 0.        , 0.        ],
       [0.        , 0.25065398, 0.30501547, 0.        ],
       [0.        , 0.        , 0.41758764, 0.32487264],
       [0.        , 0.        , 0.61003095, 0.        ],
       [0.        , 0.        , 0.30501547, 0.        ],
       [0.        , 0.        , 0.        , 0.32487264],
       [0.        , 0.        , 0.        , 0.47458872],
       [0.        , 0.        , 0.        , 0.47458872],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        ]], dtype=float32)

Now the values are weighted according to how frequent the word is across the corpus.

Similarity

Consider each column of the term frequency of TF-IDF matrix. This sequence of numbers describes the terms used with a given document. In other words, we have projected each document in a D-dimensional space (where $D$ is the number of terms in our lexicon). Now that our documents are just points in space, there are a lot mathy things we can do with them. One example is determining which documents are similar to which other documents. We could just compute the (Euclidean) distance between two documents, but there is another metric that is popular in text analysis called cosine similarity. It measures the angle between two documents; the details are not important but what matters is the cosine similarity is not sensitive to how long a particular document is.

We can apply cosine similarty to either the TF matrix or the TF-IDF matrix. The latter is generally recommended. Here, we use the MatrixSimilarity class to compute a similarity score for our corpus.

In [21]:
matsim = MatrixSimilarity(tfidf_corpus, num_features=len(lexicon))

As with the TF-IDF class, we need to apply this similarity score to a particular document. Here we apply it to document 1:

In [22]:
matsim[tfidf_corpus[1]]
Out[22]:
array([0.09000647, 0.99999994, 0.15290669, 0.1351611 , 0.25229472,
       0.01603428, 0.01755149, 0.01342144, 0.3651125 ], dtype=float32)

A cosine similarity score of 1 means that two documents are perfectly identical; notice that document 1 has a score of $0.99999994$ to itself due to rounding errors. The next most similar document is document 8, probably because both use the rare term "survey":

In [23]:
print("Document 1: '" + documents[1] + "'")
print("Document 8: '" + documents[8] + "'")
Document 1: 'A survey of user opinion of computer system response time'
Document 8: 'Graph minors A survey'

As with the term frequency measurement, we could automate detecting the closest document and use this to explore a corpus of text. And again, we could think of these similarities as a matrix, here a square one with one row and column for each document:

In [24]:
import numpy as np
np.round(matsim[tfidf_corpus], 3)
Out[24]:
array([[1.   , 0.09 , 0.11 , 0.17 , 0.   , 0.   , 0.   , 0.   , 0.   ],
       [0.09 , 1.   , 0.153, 0.135, 0.252, 0.016, 0.018, 0.013, 0.365],
       [0.11 , 0.153, 1.   , 0.28 , 0.061, 0.072, 0.078, 0.   , 0.   ],
       [0.17 , 0.135, 0.28 , 1.   , 0.006, 0.008, 0.008, 0.094, 0.   ],
       [0.   , 0.252, 0.061, 0.006, 1.   , 0.006, 0.007, 0.005, 0.   ],
       [0.   , 0.016, 0.072, 0.008, 0.006, 1.   , 0.129, 0.052, 0.   ],
       [0.   , 0.018, 0.078, 0.008, 0.007, 0.129, 1.   , 0.108, 0.1  ],
       [0.   , 0.013, 0.   , 0.094, 0.005, 0.052, 0.108, 1.   , 0.22 ],
       [0.   , 0.365, 0.   , 0.   , 0.   , 0.   , 0.1  , 0.22 , 1.   ]],
      dtype=float32)

Notice that, when rounded, the diagonal elements are all one. Why is this?

Clustering

Another task we can do given that our documents are projected in a high-dimensional numeric space is to cluster the documents according to the words they use. We have already seen how to do this with the network data; we are just doing this same thing now with the words themselves.

Due to something called the curse of dimensionality, applying classical clustering techniques to our data will not work very well. There are too many dimensions, the number of words in our lexicon, to reasonably do any clustering of the documents. One clustering technique that does work very well works directly with a matrix of similarity scores; it is called spectral clustering and we can apply it as follows:

In [25]:
from sklearn.cluster import SpectralClustering


scmodel = SpectralClustering(n_clusters=3, affinity='precomputed')
similarity_matrix = matsim[tfidf_corpus]
scmodel.fit_predict(similarity_matrix)
Out[25]:
array([0, 1, 0, 0, 1, 2, 2, 2, 1], dtype=int32)

The spectral clustering splits our documents into three groups and returns the id of each group. If you want to know more about the clustering algorithm itself, I'll put some notes on the main course site. It's hard to explain the technique unless you've had linear algebra.

Topic Models

A topic model is a probabilistic technique for understanding topics that occur within a corpus. Here, topics are understood as groups of words that tend to co-occur within documents. For example, the words 'flour', 'oil', 'sugar', and 'oven' may all group together to form a topic about baking. By far the most popular technique for detecting topics is an approach called Latent Dirchlet Allocation, or LDA. Here we will use gensim to apply the model to our corpus:

In [26]:
from gensim.models import LdaModel
lda = LdaModel(bow, id2word=lexicon, num_topics=3, alpha='auto', iterations=50)

Here are the most highly weighted words for each of the three topics (this corpus is far too small for the output to really make much sense):

In [27]:
lda.show_topics(num_words=3)
Out[27]:
[(0, '0.091*"of" + 0.063*"response" + 0.063*"user"'),
 (1, '0.081*"trees" + 0.079*"of" + 0.077*"the"'),
 (2, '0.070*"system" + 0.070*"human" + 0.055*"of"')]

Likewise, we can see how much of each document is associated with each topic:

In [28]:
list(lda.get_document_topics(bow))
Out[28]:
[[(0, 0.028455393), (1, 0.022254106), (2, 0.94929045)],
 [(0, 0.94867134), (1, 0.018175922), (2, 0.033152707)],
 [(0, 0.037746586), (1, 0.029111285), (2, 0.9331421)],
 [(0, 0.028722594), (1, 0.02233154), (2, 0.9489459)],
 [(0, 0.9456192), (1, 0.02000756), (2, 0.034373183)],
 [(0, 0.032131664), (1, 0.9236802), (2, 0.044188153)],
 [(0, 0.032311905), (1, 0.9232552), (2, 0.04443285)],
 [(0, 0.023549015), (1, 0.01839986), (2, 0.95805115)],
 [(0, 0.8843447), (1, 0.041269775), (2, 0.07438552)]]

I have also added a reference on the course website about LDA that does not require understanding all of the underlying probability theory.

Stopwords and lexicon creation

Our final topic for this tutorial is how to manually reduce the number of terms in our lexicon. There are two reasons we may want to remove a term from the lexicon:

  1. Terms that occur too frequently, or are function words that serve a primarly grammatical function, may result in erroneous results particularly in our topic models.
  2. There will be a large number of occurs that only occur a very small number of documents. These are generally not very interesting (mispelled words or proper names) and can be removed to save time and space. They also may cause issues in TF-IDF as they become too heavily weighted.

There is an easy method filter_extremes that removes terms based on how frequently they are used in the corpus. Here we keep only those terms that are used in at least two documents and in no more than 70% of the documents.

In [29]:
print(len(lexicon))
lexicon.filter_extremes(no_below=2, no_above=0.7)
print(len(lexicon))
41
16

The lexicon now has only 16 terms from the original 41. We can also use a list of pre-defined terms that we want to remove, known as stopwords. Here is a common list of English terms that we will make use of this semester:

In [30]:
with open('ranksnl_large.txt', 'r') as fin:
    sw_list = fin.read().splitlines()
    
print(sw_list)
["'ll", "'ve", 'a', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent', 'arise', 'around', 'as', 'aside', 'ask', 'asking', 'at', 'auth', 'available', 'away', 'awfully', 'b', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'between', 'beyond', 'biol', 'both', 'brief', 'briefly', 'but', 'by', 'c', 'ca', 'came', 'can', "can't", 'cannot', 'cause', 'causes', 'certain', 'certainly', 'co', 'com', 'come', 'comes', 'contain', 'containing', 'contains', 'could', 'couldnt', 'd', 'date', 'did', "didn't", 'different', 'do', 'does', "doesn't", 'doing', "don't", 'done', 'down', 'downwards', 'due', 'during', 'e', 'each', 'ed', 'edu', 'effect', 'eg', 'eight', 'eighty', 'either', 'else', 'elsewhere', 'end', 'ending', 'enough', 'especially', 'et', 'et-al', 'etc', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'except', 'f', 'far', 'few', 'ff', 'fifth', 'first', 'five', 'fix', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'found', 'four', 'from', 'further', 'furthermore', 'g', 'gave', 'get', 'gets', 'getting', 'give', 'given', 'gives', 'giving', 'go', 'goes', 'gone', 'got', 'gotten', 'h', 'had', 'happens', 'hardly', 'has', "hasn't", 'have', "haven't", 'having', 'he', 'hed', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'heres', 'hereupon', 'hers', 'herself', 'hes', 'hi', 'hid', 'him', 'himself', 'his', 'hither', 'home', 'how', 'howbeit', 'however', 'hundred', 'i', "i'll", "i've", 'id', 'ie', 'if', 'im', 'immediate', 'immediately', 'importance', 'important', 'in', 'inc', 'indeed', 'index', 'information', 'instead', 'into', 'invention', 'inward', 'is', "isn't", 'it', "it'll", 'itd', 'its', 'itself', 'j', 'just', 'k', 'keep', 'keeps', 'kept', 'kg', 'km', 'know', 'known', 'knows', 'l', 'largely', 'last', 'lately', 'later', 'latter', 'latterly', 'least', 'less', 'lest', 'let', 'lets', 'like', 'liked', 'likely', 'line', 'little', 'look', 'looking', 'looks', 'ltd', 'm', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'mean', 'means', 'meantime', 'meanwhile', 'merely', 'mg', 'might', 'million', 'miss', 'ml', 'more', 'moreover', 'most', 'mostly', 'mr', 'mrs', 'much', 'mug', 'must', 'my', 'myself', 'n', 'na', 'name', 'namely', 'nay', 'nd', 'near', 'nearly', 'necessarily', 'necessary', 'need', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'nine', 'ninety', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'normally', 'nos', 'not', 'noted', 'nothing', 'now', 'nowhere', 'o', 'obtain', 'obtained', 'obviously', 'of', 'off', 'often', 'oh', 'ok', 'okay', 'old', 'omitted', 'on', 'once', 'one', 'ones', 'only', 'onto', 'or', 'ord', 'other', 'others', 'otherwise', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'owing', 'own', 'p', 'page', 'pages', 'part', 'particular', 'particularly', 'past', 'per', 'perhaps', 'placed', 'please', 'plus', 'poorly', 'possible', 'possibly', 'potentially', 'pp', 'predominantly', 'present', 'previously', 'primarily', 'probably', 'promptly', 'proud', 'provides', 'put', 'q', 'que', 'quickly', 'quite', 'qv', 'r', 'ran', 'rather', 'rd', 're', 'readily', 'really', 'recent', 'recently', 'ref', 'refs', 'regarding', 'regardless', 'regards', 'related', 'relatively', 'research', 'respectively', 'resulted', 'resulting', 'results', 'right', 'run', 's', 'said', 'same', 'saw', 'say', 'saying', 'says', 'sec', 'section', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sent', 'seven', 'several', 'shall', 'she', "she'll", 'shed', 'shes', 'should', "shouldn't", 'show', 'showed', 'shown', 'showns', 'shows', 'significant', 'significantly', 'similar', 'similarly', 'since', 'six', 'slightly', 'so', 'some', 'somebody', 'somehow', 'someone', 'somethan', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'specifically', 'specified', 'specify', 'specifying', 'still', 'stop', 'strongly', 'sub', 'substantially', 'successfully', 'such', 'sufficiently', 'suggest', 'sup', 'sure', 't', 'take', 'taken', 'taking', 'tell', 'tends', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', "that'll", "that've", 'thats', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', "there'll", "there've", 'thereafter', 'thereby', 'thered', 'therefore', 'therein', 'thereof', 'therere', 'theres', 'thereto', 'thereupon', 'these', 'they', "they'll", "they've", 'theyd', 'theyre', 'think', 'this', 'those', 'thou', 'though', 'thoughh', 'thousand', 'throug', 'through', 'throughout', 'thru', 'thus', 'til', 'tip', 'to', 'together', 'too', 'took', 'toward', 'towards', 'tried', 'tries', 'truly', 'try', 'trying', 'ts', 'twice', 'two', 'u', 'un', 'under', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'up', 'upon', 'ups', 'us', 'use', 'used', 'useful', 'usefully', 'usefulness', 'uses', 'using', 'usually', 'v', 'value', 'various', 'very', 'via', 'viz', 'vol', 'vols', 'vs', 'w', 'want', 'wants', 'was', 'wasnt', 'way', 'we', "we'll", "we've", 'wed', 'welcome', 'went', 'were', 'werent', 'what', "what'll", 'whatever', 'whats', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'wheres', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whim', 'whither', 'who', "who'll", 'whod', 'whoever', 'whole', 'whom', 'whomever', 'whos', 'whose', 'why', 'widely', 'willing', 'wish', 'with', 'within', 'without', 'wont', 'words', 'world', 'would', 'wouldnt', 'www', 'x', 'y', 'yes', 'yet', 'you', "you'll", "you've", 'youd', 'your', 'youre', 'yours', 'yourself', 'yourselves', 'z', 'zero']

The following code removes these terms from our lexicon.

In [31]:
sw_list = set(sw_list).intersection(lexicon.token2id.keys())
ids = [lexicon.token2id[x] for x in sw_list]
lexicon.filter_tokens(ids)
len(lexicon)
Out[31]:
12

The result now contains four fewer terms.