Sentiment prediction on text reviews using BERT

Wanda Juan

Data Scientist, MBA, Mom

I can have it all - just enjoy

Sentiment prediction on text reviews using BERT

DataScience
NLP
BERT

Feb 21, 2020

View this markdown in hack.md.

Introduction

BERT is Google’s neural network-based technique for natural language processing (NLP) pre-training. BERT stands for Bidirectional Encoder Representations from Transformers.

Traditionally, NLP requires many preprocessing techniques and the model performance compromises due to the trade off between data size and computation efficiency. Luckily Google open sourced BERT, which has trained on plenty of internet text data, including wikipedia and news using neural networks, and has made itself a generic but powerful pre-trained NLP model. Utilizing BERT to NLP can effectively save time, computation and boost the model performance even without too many data.

BERT setup

Install and import BERT

Data Preparation

InputExample

BERT expects four parameters when constructing InputExample in classifier: guid, text_a, text_b and label.

class InputExample(object):
  """A single training/test example for simple sequence classification."""

  def __init__(self, guid, text_a, text_b=None, label=None):
    """Constructs a InputExample.
    Args:
      guid: Unique id for the example.
      text_a: string. The untokenized text of the first sequence. For single
        sequence tasks, only this sequence must be specified.
      text_b: (Optional) string. The untokenized text of the second sequence.
        Only must be specified for sequence pair tasks.
      label: (Optional) string. The label of the example. This should be
        specified for train and dev examples, but not for test examples.
    """
    self.guid = guid
    self.text_a = text_a
    self.text_b = text_b
    self.label = label

In our simple classifier, we may just define text_a and label as the text field and the sentiment label as below.

Tokenizer & convert to features

As instructed in Google’s tutorial, we load tokenizer from the pretrained BERT model from the hub for the first time. Then, tokenize and convert the InputExample to features that BERT understands.

Create Model

For me, this is the hardest part to understand, but Google does a great job organizing the code so we can mostly copy and modify code from below. Note that BERT model is loaded again as a layer.

Train

Once the estimator is in place, we will then build an input_fn_builder from the feature, and have the model/estimator to train on it.

Evalulate

Not the best practice, but straighforward to evaluate the model fittness, I am feeding the same reviews_feature to build another input_fn (but is_training=False this time), and use the trained estimator to evaluate metrics. (shouldn’t do this because of overfitting, but still can check if surpass underfitting)

The performance is astonishingly well. {‘auc’: 0.7391304, ‘eval_accuracy’: 0.9877925, ‘f1_score’: 0.9937888, ‘false_negatives’: 0.0, ‘false_positives’: 12.0, ‘global_step’: 92, ‘loss’: 0.021924153, ‘precision’: 0.9876543, ‘recall’: 1.0, ‘true_negatives’: 11.0, ‘true_positives’: 960.0}

The performance at a glance is very good with 98.78% accuracy, but there are more false positives than true negatives mainly because of the imbalanced data. To improve this, we may resample (oversample minority class or undersample majority class), but this is outside of this post’s scope.

Prediction

If we want to make a new prediction based on a new text (review comment, for example), we can do that, but we need to preprocess the new data the way we did on train/test datasets - first construct to InputExample, then tokenized features and finally input_fn before putting to the model.

[(‘excellent cache plugin in comparison to others work out of the box my server is openlitespeed with some additional optimization i get an average 3sec page load pagespeed score 92% at gtmetrixcom nice image optimization builtin cloudflare developer mode in one click and more’, array([-6.444571e+00, -1.590417e-03], dtype=float32), ‘Positive’), (“first i want to say normally i don’t take the time to do reviews but for this case i thought i must say that this plugin is top class i had w3 cache great rated great quality before installed as a comparison litespeed cache plugin improved my loading speed by x2.5 times better that means a huge difference the test comparison is true plus it has great and other free optionsfeatures that other plugins in his category don’t have 5 stars for litespeed cache well deserved”, array([-4.31978 , -0.01339214], dtype=float32), ‘Positive’), (‘very satisfied with the performance of plugin and ease of use’, array([-6.6634393e+00, -1.2775840e-03], dtype=float32), ‘Positive’), (‘so easy any questions use the plugin support forum answers everything’, array([-6.5263772e+00, -1.4653194e-03], dtype=float32), ‘Positive’), (‘better than any plugin out there litespeed is superb’, array([-6.7109399e+00, -1.2182918e-03], dtype=float32), ‘Positive’), (‘amazing plugin converting images to webp is simply unique’, array([-6.2855477e+00, -1.8647201e-03], dtype=float32), ‘Positive’)]

Summary

Okay, now we have gently walked through text classification using BERT. The full notebook and repo are on github.

Reference

https://github.com/google-research/bert
https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb