Using a Deployed API

The purpose of this article is to show an example of a use case for a deployed API. An API can be used to automatically score incoming data with a pre-trained deployed model.

Business Use Case

This article considers a major hotel chain as an example. The hotel chain uses customer reviews to identify potential problems. The management wants to reduce time spent reading through large volumes of customer reviews. In this case, the deployed API will take reviews as input and assign category labels. Labeled reviews can then be sorted, summarized with a BI tool, and addressed by appropriate departments.

Training the Model

This example will use a technique called Topic Modeling (Latent Dirichlet Allocation, or LDA) to discover topics mentioned in reviews. The algorithm will search through the review texts and summarize them in a handful of topics as words and phrases that customers tend to use together. Training the model is done using the gensim Python library. The first step is to read in the data:

# Platform Kernel: python2
# Libraries: boto==2.48.0, pandas==0.20.3, numpy==1.13.1, gensim==2.3.0, nltk==3.2.4, re==2.2.1, requests==2.18.3

import os
import pandas as pd
import numpy
import string
import re
from gensim import corpora
from gensim.models import Phrases
from gensim.models.ldamodel import LdaModel
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

# utility functions for reading/writing to AWS S3
def get_bucket():
    conn = S3Connection(access_key, secret_key)
    return conn.get_bucket(env_name, validate=False)

 def pull_file_from_s3(self, key, tmp_localdir=''):
        s3_bucket = self.get_bucket()
        payload = s3_bucket.get_key(key)
        if not os.path.exists(os.path.dirname(tmp_localdir+key)):
            os.makedirs(os.path.dirname(tmp_localdir+key))

        local_file = payload.get_contents_to_filename(tmp_localdir+key)
        print "Grabbed %s from S3. Local file %s is now available." % (key, tmp_localdir+key)

 def write_obj_to_s3(obj, localpath, key) :
    '''Write pickled object to s3. localpath should be the same as key.
    The file will be written to localpath first and then transfered to s3
    with key.

    '''
    if key != None :
        name=key
    else :
        print('Specify an s3 key')
    k=name.replace(' ','-')
    print('Modified s3 key %'.format(k))
    s3_key=Key(get_bucket())
    s3_key.key=k
    pickle.dump(obj,open(localpath,'wb'))
    s3_key.set_contents_from_file(open(localpath,'r'))
    print("Sent obj %s to S3 with key '%s'"%(localpath,k))
    def pull_pickle_from_s3(key, tmp_localdir=''):
    '''
    Grab pickled object from s3
    '''
    local_path = tmp_localdir+key
    local_dir = os.path.dirname(local_path)
    if not os.path.exists(os.path.dirname(tmp_localdir+key)):
        os.makedirs(os.path.dirname(tmp_localdir+key))

    s3_bucket = get_bucket()
    payload = s3_bucket.get_key(key)
    local_file = payload.get_contents_to_filename(local_path)
    print("Grabbed %s from S3. Local file %s is now available." % (key, key))
    return pickle.load(open(local_path, 'rb'))


dir_path = 'my-path-to-s3/hotel_reviews.txt'
pull_file_from_s3(dir_path, tmp_localdir=' ')


f = open('../'+dir_path,'r')
raw=f.read()
f.close()

# process the text file
lines = raw.splitlines() # split on lines and carriages

reviews = [line.strip('<Content>') for line in lines if '<Content>' in line]

Next, apply standard cleaning by removing bad characters and stop words, and applying stemming. Finally, the gensim library is used to convert clean word tokens into a dictionary and bag-of-words corpus for modeling:

# Utility functions for text processing
def default_clean(text):
    '''
    Removes default bad characters
    '''
    if not (pd.isnull(text)):
        text = filter(lambda x: x in string.printable, text)
        bad_chars = set(["@", "+", '<br>', '<br />', '/', "'", '"', '\', '(',')', '<p>',
                         '\n', '<', '>', '?', '#', ',', '.', '[',']', '%', '$', '&', ';',
                         '!', ';', ':', '-', "*", "_", "=", "}", "{"])
        for char in bad_chars:
            text = text.replace(char, " ")
        text = re.sub('d+', "", text)

    return text


def stop_and_stem(text, stemmer = PorterStemmer()):
    '''
    Removes stopwords and does stemming
    '''
    stoplist = stopwords.words('english')

    text_stemmed = [[stemmer.stem(word) for word in document.lower().split()
                     if word not in stoplist] for document in text]
    return text_stemmed


clean_reviews = [default_clean(d).lower() for d in reviews]
stemmed = stop_and_stem(clean_reviews)

# clean up raw reviews and prepare dataset for model
numpy.random.seed(seed=44)
dictionary = corpora.Dictionary(stemmed)
corpus = [corpora_dict.doc2bow(t) for t in stemmed]

Fitting the Model

Now, fit the model with five topics. Note that usually the number of topics is chosen through optimizing some goodness of fit metric such as topic coherence or log-perplexity.

# number of topics
K=5

# Run LDA model to extract topics
lda = LdaModel(corpus=corpus, id2  word=dictionary, num_topics=K, passes=10)

Saving the Model

Trained models can be saved in a serialized format on AWS S3 to be accessed by the API later:

# Save the model
write_obj_to_s3(lda, 'lda_model', 'my-path-to-s3/lda_model')
write_obj_to_s3(dictionary, 'dictionary', 'my-path-to-s3/dictionary')

Deploying the Model

To deploy the trained model, write a function that takes in a review as input and produces a label or list of labels as output. This function can use the serialized model we just trained and saved. The deploy function and any supporting code should be saved in a .py script, which will get deployed behind an API. This deploy script is what will make predictions when new data hits the API:

def pull_pickle_from_s3(key, tmp_localdir=''):
    '''
    Grab pickled object from s3
    '''
    local_path = tmp_localdir+key
    local_dir = os.path.dirname(local_path)
    if not os.path.exists(os.path.dirname(tmp_localdir+key)):
        os.makedirs(os.path.dirname(tmp_localdir+key))

    s3_bucket = get_bucket()
    payload = s3_bucket.get_key(key)
    local_file = payload.get_contents_to_filename(local_path)
    print("Grabbed %s from S3. Local file %s is now available." % (key, key))
    return pickle.load(open(local_path, 'rb'))


def max_topic(scored_list):
    return max(scored_list, key=lambda item: item[1])[0]

# read in the model
dictionary = pull_pickle_from_s3('my-path-to-s3/dictionary')
lda_model = pull_pickle_from_s3('my-path-to-s3/lda_model')


# Main deploy function
def label_review(new_review):
    '''
    Take a new review as a list and assign it to one of the existing pre-trained topics
    '''

      # topic labels
    name_dict = {0: "Front Desk",
               1: "Pool Feedback",
               2: "Restaurant and Bar Service",
               3: "Happy Customers",
               4: "Complaints"}

    # transform text into the bag-of-words space
    clean_review = [default_clean(d).lower() for d in new_review]
    stemmed = stop_and_stem(clean_review)

    # Predict label
    new_vector = [dictionary.doc2bow(t) for t in stemmed]

    lda_vector = lda_model[new_vector]

    id_ = map(max_topic, lda_vector)

    print("Review Categories")
    return map(name_dict.get, id_)

Now you are ready to deploy the model. Use the Deploy API option found under Quick Actions:

../_images/launch_deploy_api.png

Provide information in each of the prompts:

../_images/deploy_option.png

Specify your branch, deploy script name, deploy function name, and example input. Remember to specify the dependencies of your deployed function. This is done by clicking the Add Requirements option. Simply include the name of your pip requirements file. deploy_requirements.txt lists all package dependencies needed for the deploy function to run. It is recommended to put that file in the same folder as the deploy Python script.

../_images/deploy_in_progress.png

The review topic model is now deployed behind an API. You can access all currently running APIs in the project Outputs tab of the Platform.

Calling the API

The batch of new incoming reviews can now be passed as an input to the API. The text below shows an example of a raw review.

[We had a lovely stay ... The food is great but I wish there were more
choices ...  The pool was too crowded for me, although the service
was perfect]
And a simple input construct to query the deployed model {"new_review":["we had a great day"]}

The call to the API needs a valid deployed model URL, cookie, and model input in JSON format. An example call is already pre-written for you in the Versions tab:

../_images/query_deployed_model.png

Note that here we are passing a list of reviews to the API as a batch. The deploy function label_review will process this input and return a list of output labels.

The request returns a string of labels as the response, stored in the variable body:

# call the model API with reviews array
url = 'https://my.datascience.com/deploy/my-deployed-model-v1/'
body = requests.post(url,
                     json={"new_review": reviews},
                     cookies={'datascience-platform': 'eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqdGkiOiI5OGQ4YzU5Mi1kNzFkLTQ4ZWEtYmVlNC0zYWFiNzNiNmFkYTQiLCJzZXJ2aWNlTmFtZSI6ImRlcGxveS10b3BpYy1tb2RlbGVyLWRlcGxveS0yOTMxNS12MSIsImlhdCI6MTUwMDQ4NjI3Mn0.LM5YemhQjkke342mbBaU171o'
                              })

labels = body.text.split(',')

Saving the Output

Finally, we can store the reviews and labels on an AWS S3 location:

# output to save to s3
out = zip(reviews, labels)

with open('../labeled_sample_reviews.txt', "w") as out_file:
    out_file.write(str(out))


# save to s3
write_obj_to_s3(out, 'labeled_reviews.txt', 'my_path_to_s3/labeled_reviews.txt')
print('Processed %s reviews'%len(reviews))

This process of reading, processing, and scoring new reviews can be run as a nightly scheduled job. The batch scoring job can write the output to a data store used by your BI tools on a regular schedule. For example, reviews labeled by the API can be powering a dashboard summarizing problematic reviews and displaying review category trends.