Hand-in-hand: Natural language processing is too difficult? To follow this routine is to chop melons and vegetables! (Attached Python code)

Hand-in-hand: Natural language processing is too difficult? To follow this routine is to chop melons and vegetables! (Attached Python code)

Douban Shuijun detection, continuation of "Quanyou", more and more God Google translation...

Recently, various applications of natural language processing (NLP) have been very popular.

These NLP applications look so cool and unreasonable, but the principles behind them are not difficult to understand.

Today, Digest Bacteria will take a look at the most commonly used natural language processing techniques and models, and teach you how to make a simple and magical little application.

If you don't blow up, you can use similar methods to solve 90% of NLP problems.

Today s tutorial teaches you natural language processing from three stages of data processing:

  • Collect, prepare, and inspect data
  • Build simple models (including deep learning models)
  • Explain and understand your model

The Python code for the entire tutorial is here:


Get started now!

Step 1: Collect data

There are too many sources of natural language data! Taobao reviews, Weibo, Baidu Baike, etc.

But today, the data set we are going to deal with comes from the Twitter "Disasters on Social Media dataset" (Disasters on Social Media dataset).

We will use a data set called "Social Media Disaster" generously provided by CrowdFlower. The data set consists of more than 10,000 tweets related to the disaster.

Some of the tweets did describe the disaster, while the rest were strange things such as movie reviews, jokes, etc. =

Our task will be to detect which tweets are about a catastrophic event, rather than an unrelated topic such as a movie. Why do you want to do this? Relevant departments can use this small application to get disaster event information in time!

Next, we will refer to tweets about disasters as "disaster" and other tweets as "irrelevant".


Note that we are using labeled data. As NLP god Socher said, instead of spending a month using unsupervised learning to process a bunch of unlabeled data, it is better to spend a week labeling a little data and a whole classifier.

Step 2: Clean the data

The first principle we follow is: "No matter how good the model is, the same data can't be saved." So, let's clean up the data first!

We do the following:

1. Delete all irrelevant characters, such as any non-alphanumeric characters

2. Mark your article by dividing the text into individual words

3. Delete irrelevant words, such as "@" Twitter or URL

4. Convert all characters to lowercase letters so that words such as "hello", "Hello" and "HELLO" can be treated as the same word

5. Consider integrating misspelled or multiple spelled words, using a single word representation (such as "cool"/"kewl"/"cooool") to combine

6. Consider lemmatization (reducing "am", "are", "is" and other words into common forms like "be")

After following these steps and checking for other errors, we can start training the model with clean labeled data!

Step 3: Find a good data representation

After the data is cleaned up, we have to convert these words into numerical values-so that the machine can understand it!

For example, in image processing, we need to convert the picture into a digital matrix that represents the RGB intensity of the pixel.

A smiling face represents a matrix of numbers

The representation in natural language processing is a bit more complicated. We will try a variety of representation methods.

One-hot encoding (bag of words)

A natural way to represent computer text is to encode each character separately as a number (such as ASCII).

For example, we can build a vocabulary of all unique words in the data set, and associate the unique index with each word in the vocabulary. Then, each sentence is represented as a list as long as the number of unique words in our vocabulary. At each index in this list, we mark the number of times a given word appears in our sentence. This is the so-called bag-of-words model, because it is a form of expression that completely ignores the order of words in our sentences. As follows.

Represents the sentence as a bag of words. On the left is the sentence, and on the right is its representation. Each index in the vector represents a specific word

Visual embedding

In the "disaster of social media" example, we have a vocabulary of about 20,000 words, which means that each sentence will be represented as a vector of 20,000 in length. This vector will contain most of the zeros, because each sentence contains only a small subset of our vocabulary.

In order to understand whether our representation captures information related to our problem (ie whether the tweet is related to a disaster), it is a good idea to let us visualize them and see if the classes seem to be well separated. Since the vocabulary is usually very large and it is impossible to display data in 20,000 dimensions, techniques like PCA will help to project data into two dimensions. as the picture shows:


The two categories do not seem to be separated very well. This may be a feature we embed, or simply because of our reduced dimensionality. In order to see if the bag of words feature is useful, we can train a classifier based on them.

Step 4: Classification

When you encounter a problem first, the general best practice is to start with the simplest tool to solve the problem. When it comes to classifying data, a common preference based on versatility and interpretability is Logistic regression. The training is very simple and the results can be interpreted, because you can easily extract the most important coefficients from the model.

We divide the data into a training set to fit the model and a test set to evaluate the generalization ability of the model, in order to generalize to invisible data. After the training, we got 75.4% accuracy. not bad! If we simply guess the most frequent class ("irrelevant"), the accuracy rate can only reach 57%. However, even if 75% accuracy is sufficient to meet our needs, we should not release a model without trying to understand it.

Step 5: Check

Confusion matrix

The first step is to understand the types of errors in our model and which types of errors are the most undesirable. In our example, false positives are classified as irrelevant, in fact diasaster, and false negatives are actually disaster, and classified as irrelevant. If we are to prioritize every potential event, we will want to reduce our false negatives. However, if we are limited by resources, we may prioritize lower false positives to reduce false positives. A good way to visualize this information is to use a confusion matrix, which compares our model's predictions with real labels. Ideally, the matrix will be a diagonal line from the upper left corner to the lower right corner (prediction and actual match perfectly).

Confusion matrix (green is high, blue is low)

Compared with false positives, our classifier produces more false negatives proportionally. In other words, the most common mistake of our model is to classify disaster as irrelevant. If false positives lead to high enforcement costs, this allows our classifier to have a good bias.

Model analysis

In order to verify our model and analyze the accuracy of its predictions, we need to see which words it uses to make decisions, which is very important. If our data is biased, then the classifier will only be able to make accurate predictions in the sample data, and this model cannot be well generalized in the real world. Here, we have drawn a table of "the most critical words" for disaster and irrelevant respectively. Since we can extract and rank the coefficients of the model used for prediction, it is actually very simple to use bag-of-words and logistic regression to calculate the importance of words.

Bag of words: keywords

Our classifier correctly adopts some patterns (such as Hiroshima Hiroshima, Massacre Massacre), but there are obviously some seemingly meaningless overfitting (heyoo blues rock, x1392 topic abbreviation). Now, our packet-of-words model is dealing with a huge vocabulary containing a variety of different words and treating all words equally. However, some of these words appear very frequently and will only affect our predictions. Next, we will try a new method to represent sentences that can count word frequencies, and see if we can get more signals from our data.

Step 6: Statistics vocabulary structure

In order to make our model focus more on meaningful words, we can use TF-IDF scoring (term frequency, inverse document frequency) on top of the bag-of-words model. TF-IDF determines the weight of words by the frequency of occurrence in the data set, reducing the weight of words that appear too frequently and adding to the noise interference. The figure below is the PCA prediction for our newly embedded data.

TF-IDF embedded visualization

As can be seen from the picture above, there is a relatively clear boundary between these two colors. This will make it easier for our classifier to divide them into two groups. Let's see if this will bring better performance! Next, train another Logistic regression parameter on our newly embedded data, and we get an accuracy of 76.2%.

This is a very slight improvement. Has our model started to adopt more critical words? If we achieve better results while preventing our model from "cheating", then we can truly think that this model has achieved a breakthrough.

TF-IDF: Keywords

The words taken by the model look more relevant! Although the indicators of our test set have only slightly increased, we will have more confidence in the terminology used in the model, so it will be more comfortable to apply it to the system that interacts with customers.

Step 7: Clever use of semantics

Convert words to vectors

Our latest model manages to take words with high signal. However, if we configure this model, we are likely to encounter words that we did not see in the training set before. However, even if very similar words are seen during training, the previous model cannot accurately distinguish these interferences.

In order to solve this problem, we need to capture the semantics of words, which means we need to understand that words such as "good" and "positive" are more similar than "xing" and "mainland". We will use a tool called Word2Vec to help us capture semantics.

Use pre-trained words

Word2Vec is a technology for embedding continuous words. It learns by reading a lot of text and remembering which words tend to appear in similar contexts. After training enough data, it will generate a 300-dimensional vector for each word in the vocabulary, and words with similar meanings will be closer to each other.

The author of this article has open sourced a model, which is pre-trained on a very large corpus. We can use this corpus to incorporate some semantic knowledge into our model. The pre-trained vectors can be found in the knowledge base associated with this article.

Sentence-level representation

A quick way to obtain sentence embeddings for our classifier is to average the Word2Vec scores of all words in the sentence. This is also a bag-of-words method as before, but this time we only discard the syntax of the sentence and retain some semantic information.

Word2Vec sentence embedding

The following figure is a new embedding visualization obtained using previous techniques:

Word2Vec embedded visualization

The boundary between the two groups of colors seems more obvious, and our new embedding technology will definitely help our classifier find the separation between the two classes. After training the same model for the third time (using Logistic regression), we got an accuracy of 77.7%, which is the best result we have got so far! Next, it's time to check our model.

The trade-off between complexity and interpretability

Since the new embedding technology does not represent each word as a one-dimensional vector like our previous model, it is difficult to see which words are most relevant to our classification. Although we can still use the coefficients of Logistic regression, they are only related to the 300 dimension we embed, not to the vocabulary index.

For such low accuracy, losing all interpretability seems to be a difficult trade-off. However, for more complex models, we can use black box interpreters such as LIME to gain insight into the working principle of the classifier.

Github provides LIME through an open source package. The black box interpreter allows the user to interpret any classifier's decision by interfering with input (in our example, removing words in the sentence) and to see the changes in the prediction.

Next, let's take a look at the explanations of several sentences in our data set.

However, we did not have time to explore the thousands of cases in the dataset. What we should do is to continue to run LIME on the typical examples of test cases to see which words still have the highest share of the market. In this way, we can obtain the importance score of words like the previous model and verify the model's prediction.

Word2Vec: Keywords

The model seems to be able to extract highly relevant words, which means it may be able to make intelligible decisions. These seem to be the most relevant words in all previous models, so we prefer to configure them in actual operations.

Step 8: Use an end-to-end approach to cleverly use semantics

We have introduced a fast and effective method to generate compact sentence embeddings. However, by omitting the word order, we give up all the syntactic information of the sentence. If these methods cannot provide sufficient conclusions, a more complex model can be used that takes the entire sentence as input and predicts the label, without the need to establish an intermediate representation. A common method is to use the sentence as a sequence of word vectors , using Word2Vec or newer methods, such as GloVe or CoVe. This is what we will do below.

Efficient end-to-end architecture (source)

The training of convolutional neural networks for sentence classification is very fast, and as an entry-level deep learning system, it can complete tasks well. Although convolutional neural networks (CNN) are mainly known for their performance on image data, they have already provided excellent results on text-related tasks and are generally better than most complex NLP methods (such as LSTM and encoder/Decoder architecture) can train faster. This model preserves the word order and learns valuable information about which word sequences can predict our target class. Contrary to the previous model, it can distinguish the difference between "Alex eats plants" and "Plants eats Alex".

Training this model not only does not require more work than the previous method (see the code for details), but also gives us a better model than the previous method, with an accuracy rate of 79.5%! As with the above models, the next step should be to continue to use the methods we describe to explore and interpret predictions to verify that it is indeed the best model configured for users. Now, you should be able to deal with this problem yourself.


  • Start with a simple and quick model
  • Explain its prediction
  • Understand the type of mistake it is making
  • Use this knowledge to determine the next step: whether the model is valid for the data, or whether a more complex model should be used

These methods are applied to specific cases, such as understanding and using short text models such as tweets, but in fact these ideas are widely applicable to various problems!

Just like what we said at the beginning, 90% of natural language processing problems can be solved with this routine. It's not about cutting melons and vegetables!

The original publication time is: 2018-02-28

Author: Abstracts bacteria

This article is from the " Big Data Digest ", a partner of Yunqi Community . For relevant information, please follow the " Big Data Digest " WeChat official account