Recsystem Interface Documentation

If you’ve worked though the Quickstart Guide you’ve seen the basics of how to write and run a recsystem for Renewal competitions.

That guide introduced the main function all recsystems have to implement: recommend(). But it did not delve into all the details that might be needed to implement a non-trivial recsystem.

This guide lists all of the other special “hook” functions you can write to make your recsystem respond to user activity from the mobile apps, and also explains how you can use the state dict to hold data specific to your recommendation algorithm.

The hook module

This is the main entry-point to your recommendation system. The renewal_recsystem package handles all the heavy-lifting of managing the network protocols and concurrency, so that you can just focus on writing a few functions that instruct the package on how you want to provide recommendations to users, as well as collect data on users’ activity (as well as any additional background work you want the recsystem to perform).

The hook module is the Python file you pass as the argument to the python -m renewal_recsystem script. If your code is complex enough that it needs to grow beyond one file, your hook module is free to import code from other files, as well as from any Python packages installed on your system.

See the next section for a complete list of the functions understood by renewal_recsystem that you can include in your hook module.

Available hook functions

The following hook functions are pre-defined by the system. Some of them have exact names and some of them have naming patterns you can follow. All of them should be implemented with the exact call signatures defined here.

recommend

def recommend(state, user, articles, min_articles, max_articles):
    ...
    return [<list of recommendations>]

This is the only function that is required to be implemented in your hook module. It is called every time one of the users assigned to your recsystem requests a list of news recommendations.

In a future version it also may be called periodically by the Backend in order to pre-queue recommendations for users, but from the perspective of your recsystem the two cases are no different (except that you should make sure to return unique sets of recommendations on each call, for each user.

  • The state argument is explained in Recsystem state.

  • The user argument is a User object representing the user your are making recommendations for.

  • The articles argument is a pandas.DataFrame representing candidate articles to recommend to the user. To the extent possible (outside race conditions) this contains articles not already recommended to that user by any recsystem.

  • The min_articles and max_articles arguments are hints specifying how many recommendations the function should return. If less than min_articles are returned, this recommendation list will be discarded by the backend, as there are not enough recommendations to make a meaningful head-to-head contest with the other recsystem(s) assigned to the user. Returning more than max_articles does not currently disqualify your recommendations, but recommendations beyond the first max_articles returned will be discarded.

article_interaction

def article_interaction(state, user, articles, article_id, interaction):
    ...

This optional hook function is called every time one of your recsystem’s assigned users interacts with an article in any way. This can be used to make real-time updates to your model of the user, e.g. any statistics you are keeping of the user’s preferences. The User.interactions attribute also keeps a tally of all the user’s interactions with all articles the user has seen. This is updated automatically whenever a user interacts with an article, before your article_interaction function is called. For example, if a user clicked on article 1234, then the following will be true:

user.interactions[1234]['clicked'] == True

So you don’t have to keep track of these basic metrics yourself.

You can see an example implementation of article_interaction in the keywords-based example recsystem. This recsystem keeps scores for keywords found in articles that each user interacts with.

  • The state argument is explained in Recsystem state. In the case of article_interaction you might use the state dict to track running statistics of the user’s preferences, such as similarity scores.

  • The user argument is the User object representing the user.

  • The articles argument is the pandas.DataFrame containing the corpus of articles available to your recsystem. This is similar to the articles argument to recommend except it also contains articles the user has not interacted with yet.

  • The article_id argument is the ID of the article the user interacted with. Thus, you can look up the full record for that article by using:

    article = articles.loc[article_id]
    
  • The interaction argument is a dict specifying the type of interaction that took place. Typically it has one or two keywords specifying the type of interaction. Here are the current possibilities:

    • {'recommended': True} this is a special case that just means the user recently refreshed the app and received this article as recommendation (but has not yet clicked on it or rated it).

    • {'clicked': True} the user clicked on the article to read it.

    • {'rating': 1, 'prev_rating': 0} the user “rated” the article’s interest to them (whether or not they read it). The rating can be either -1 (the article is not interesting), 1 (the article is interesting), or 0 (no opinion). You will only ever see {'rating': 0} if a user previously rated the article and then changed their mind. The 'prev_rating' is the user’s previous rating of the article. This can be used to recalculate scores in case a user rates an article, but then later change their minds (for example they might rate it 1, but then read the article, decide it wasn’t interesting, and change their rating to -1).

    • {'bookmarked': True} the user added the article to their bookmarks.

    More interaction details will be added in a future version, including the percent-read of the article, and geolocation details (if the user has allowed geolocation).

initialize and shutdown

def initialize(state, users, articles):
    ...
def shutdown(state):
    ...

These are lifecycle hooks that are called shortly after your recsystem starts up, and before it exits cleanly (where “cleanly” means it is not terminated forcefully such as with kill -9).

This can be used for any additional steps you want to perform at the startup of your recsystem, such as initialize the state or save the state at shutdown.

See State persistence and the keywords recsystem for an example of how you can load and save your state dict from a pickle file. Though in the future state persistence will be handled automatially (see issue #15).

background_*

def background_<name>(state, users, articles):
    ...

If you define any function whose name begins with background_ (the rest of the name is up to you) that function is run repeatedly in the background in an infinite loop. For example if you have a function named background_work it is run (schematically) like:

while True:
    update = background_work(state, users, articles)
    state = apply_state_update(state, update)

This can be used for example to perform intensive calculations that take a long time, and that would otherwise introduce too much latency into functions like recommend(). For example, it could be performing running updates of similarity calculations between articles.

Warning

Be careful to use background_ functions for work that is performed very “fast” (e.g. less than a few milliseconds). See How to Profile Your Code in Python for tips on how to measure the execution speed of your functions.

This is because every background_ function is called repeatedly in an infinite loop, and could create a bottleneck if it is being called too often. For tasks that might be short but that you still want to call periodically, see every_<second|minute|hour|day>.

every_<second|minute|hour|day>

def every_<second|minute|hour|day>(state, users, articles):
    ...

or

def every_<n>_<seconds|minutes|hours|days>(state, users, articles):
    ...

These are like background_ but allow you to define hook functions that are scheduled periodically. For example, if you write a function named every_minute_calculate_scores that function will be called once every minute.

Alternatively, you can use a name scheme like every_30_seconds_calculate_scores to run the function every 30 seconds.

The time units “seconds”, “minutes”, “hours”, and “days” are available.

The function is re-scheduled after its last call completes. So for example if you have a function that is called every second, but it takes more than a second to complete, its next call will be one second after it completed.

In other words, you won’t have multiple calls of the same periodic hook running simultaneously. So you might choose a period that represents an upper bound on the time performance of the hook function.

Recsystem state

Here we explain the use of the state argument that is passed as the first argument to all hook functions.

The state argument is a Python dict which may contain any number of nested dicts. It’s your recsystem’s own work area where it can store any data specific to your recsystem’s functionality. For example, say you are performing sentiment analysis on articles. You would like to peridically compute sentiment scores for articles, and you will need a place to save these scores (in order to avoid recomputing them).

You could add a key to your state named "article_sentiments" containing a dictionary mapping article IDs to the sentiment analysis results. In this case the state (or this portion of it) could look something like:

{
    "article_sentiments": {
        12345: "happy",
        12346: "sad",
        12347: "neutral"
    }
}

Note

Technical note for the curious: You may ask “Why do I need to pass this state argument around? Why can’t I just use a global variable?”

In many cases using a global variable will not work, because in order to keep your recsystem able to handle many events concurrently, your hook functions may be run in some separate processes. If you use a global variable for this, changes you make to its value will not be propagated correctly to the whole system.

This is also why your hook functions should return State updates.

The keys in the state dict may be any type that can be used as a dictionary key in Python (strings, integers, tuples, etc). However, keys and values must be able to be pickled. Fortunately, this is true for most types you will likely encounter in Python data science, such as Numpy arrays and Pandas DataFrames, etc.

State updates

Most of the hook functions defined may return a value referred to as a “state update” performed by that function. It informs the system which parts of the state you want your hook function to modify.

The state update is also a dict, but you should not simply modify the original state dict and return it. This could result in your hook functions overstepping each other and clobbering each other’s results. Instead, each call to a hook function should only return a dict representing the parts of the state changed by that call. This update will be automatically merged into the “real” state that will be passed to future hook calls.

Returning to the previous example, if you have a function every_10_seconds_perform_sentiment_analysis to update the sentiment analysis for new articles, and it finds a new article with ID 12456, the hook function should return a state update like:

{
    "article_sentiments": {
        12456: "mixed"
    }
}

This informs the system that there is a new key/value pair to add to "article_sentiments" and that no other part of the state needs to be touched.

Special case: recommend()

With one exception, the return value of every hook function is a state update (or no return value if you have a hook function that does not update the state). The shutdown() hook is also a corner case since any state update it returns will be ignored, as the system is shutting down.

However, the recommend() function normally returns a list of article IDs, not a state update. If you have a recommend() function in which you also want it to update the state (e.g. maybe to keep some statistics on how many recommendations it’s made to each user) it can return a tuple: (recommendations, state_update).

State persistence

Currently (though this might change in the future) the state is not persisted automatically. That is, when you shut down your recsystem and start it up again, it will always start with an empty state ({}).

Naturally, you will probably want to be able to keep your recsystem’s data over the course of a competition. Currently, the best way to do that is to define some initialize and shutdown hooks like:

import logging
import os
import pickle

log = logging.getLogger(__name__)

STATE_FILENAME = __name__ + '.pickle'

def initialize(state, users, articles):
    # This runs every time your recsystem starts up
    if os.path.isfile(STATE_FILENAME):
        with open(STATE_FILENAME, 'rb') as fobj:
            state = pickle.load(fobj)

        log.info(f'loaded previous state from {STATE_FILENAME}')

    return state

def shutdown(state):
    # This runs every time you stop the recsystem cleanly and in most
    # cases if it crashes
    save_state(state)

def save_state(state):
    pickled_state = pickle.dumps(state)

    with open(STATE_FILENAME, 'wb') as fobj:
        fobj.write(pickled_state)

    log.info(f'saved updated state to {STATE_FILENAME}')

To protect against unfortunate catastrophes (e.g. your computer crashes) you might also want to periodically save state updates:

def every_30_seconds(state, users, articles):
    save_state(state)

Alternatives

The state dict is provided as a quick and convenient space to store your recsystem’s data at runtime. Its use is purely optional. For example, some contestants might choose instead to use an external storage method for their recsystem’s data, such as a database (SQLite, MongoDB, Redis, etc.). This is perfectly allowed.

A combination of the two can also be used, such as using state as a cache, but using a database for longer-term persistence. The choice is yours!

Async hooks functions

Todo

Explain how to write hook functions with async def instead of def, what this means, and when and why to use it.