Recsystem Interface Documentation¶
If you’ve worked though the Quickstart Guide you’ve seen the basics of how to write and run a recsystem for Renewal competitions.
That guide introduced the main function all recsystems have to implement:
recommend()
. But it did not delve into all the details that might be
needed to implement a non-trivial recsystem.
This guide lists all of the other special “hook” functions you can write to make your recsystem respond to user activity from the mobile apps, and also explains how you can use the state dict to hold data specific to your recommendation algorithm.
The hook module¶
This is the main entry-point to your recommendation system. The
renewal_recsystem
package handles all the heavy-lifting of managing the
network protocols and concurrency, so that you can just focus on writing a
few functions that instruct the package on how you want to provide
recommendations to users, as well as collect data on users’ activity (as
well as any additional background work you want the recsystem to perform).
The hook module is the Python file you pass as the argument to the python
-m renewal_recsystem
script. If your code is complex enough that it needs
to grow beyond one file, your hook module is free to import code from other
files, as well as from any Python packages installed on your system.
See the next section for a complete list of the functions understood by
renewal_recsystem
that you can include in your hook module.
Available hook functions¶
The following hook functions are pre-defined by the system. Some of them have exact names and some of them have naming patterns you can follow. All of them should be implemented with the exact call signatures defined here.
recommend
¶
def recommend(state, user, articles, min_articles, max_articles):
...
return [<list of recommendations>]
This is the only function that is required to be implemented in your hook module. It is called every time one of the users assigned to your recsystem requests a list of news recommendations.
In a future version it also may be called periodically by the Backend in order to pre-queue recommendations for users, but from the perspective of your recsystem the two cases are no different (except that you should make sure to return unique sets of recommendations on each call, for each user.
The
state
argument is explained in Recsystem state.The
user
argument is aUser
object representing the user your are making recommendations for.The
articles
argument is apandas.DataFrame
representing candidate articles to recommend to the user. To the extent possible (outside race conditions) this contains articles not already recommended to that user by any recsystem.The
min_articles
andmax_articles
arguments are hints specifying how many recommendations the function should return. If less thanmin_articles
are returned, this recommendation list will be discarded by the backend, as there are not enough recommendations to make a meaningful head-to-head contest with the other recsystem(s) assigned to the user. Returning more thanmax_articles
does not currently disqualify your recommendations, but recommendations beyond the firstmax_articles
returned will be discarded.
article_interaction
¶
def article_interaction(state, user, articles, article_id, interaction):
...
This optional hook function is called every time one of your recsystem’s
assigned users interacts with an article in any way. This can be used to
make real-time updates to your model of the user, e.g. any statistics you
are keeping of the user’s preferences. The User.interactions
attribute
also keeps a tally of all the user’s interactions with all articles the user
has seen. This is updated automatically whenever a user interacts with an
article, before your article_interaction
function is called. For
example, if a user clicked on article 1234
, then the following will be
true:
user.interactions[1234]['clicked'] == True
So you don’t have to keep track of these basic metrics yourself.
You can see an example implementation of article_interaction
in the
keywords-based example recsystem.
This recsystem keeps scores for keywords found in articles that each user
interacts with.
The
state
argument is explained in Recsystem state. In the case ofarticle_interaction
you might use thestate
dict to track running statistics of the user’s preferences, such as similarity scores.The
user
argument is theUser
object representing the user.The
articles
argument is thepandas.DataFrame
containing the corpus of articles available to your recsystem. This is similar to thearticles
argument to recommend except it also contains articles the user has not interacted with yet.The
article_id
argument is the ID of the article the user interacted with. Thus, you can look up the full record for that article by using:article = articles.loc[article_id]
The
interaction
argument is adict
specifying the type of interaction that took place. Typically it has one or two keywords specifying the type of interaction. Here are the current possibilities:{'recommended': True}
this is a special case that just means the user recently refreshed the app and received this article as recommendation (but has not yet clicked on it or rated it).{'clicked': True}
the user clicked on the article to read it.{'rating': 1, 'prev_rating': 0}
the user “rated” the article’s interest to them (whether or not they read it). The rating can be either-1
(the article is not interesting),1
(the article is interesting), or0
(no opinion). You will only ever see{'rating': 0}
if a user previously rated the article and then changed their mind. The'prev_rating'
is the user’s previous rating of the article. This can be used to recalculate scores in case a user rates an article, but then later change their minds (for example they might rate it1
, but then read the article, decide it wasn’t interesting, and change their rating to-1
).{'bookmarked': True}
the user added the article to their bookmarks.
More interaction details will be added in a future version, including the percent-read of the article, and geolocation details (if the user has allowed geolocation).
initialize
and shutdown
¶
def initialize(state, users, articles):
...
def shutdown(state):
...
These are lifecycle hooks that are called shortly after your recsystem
starts up, and before it exits cleanly (where “cleanly” means it is not
terminated forcefully such as with kill -9
).
This can be used for any additional steps you want to perform at the startup of your recsystem, such as initialize the state or save the state at shutdown.
See State persistence and the keywords recsystem
for an example of how you can load and save your state dict from a pickle
file. Though in the future state persistence will be handled automatially
(see issue #15).
background_*
¶
def background_<name>(state, users, articles):
...
If you define any function whose name begins with background_
(the
rest of the name is up to you) that function is run repeatedly in the
background in an infinite loop. For example if you have a function named
background_work
it is run (schematically) like:
while True:
update = background_work(state, users, articles)
state = apply_state_update(state, update)
This can be used for example to perform intensive calculations that take a
long time, and that would otherwise introduce too much latency into
functions like recommend()
. For example, it could be performing running
updates of similarity calculations between articles.
Warning
Be careful to use background_
functions for work that is performed
very “fast” (e.g. less than a few milliseconds). See How to Profile
Your Code in Python for tips on how to measure the execution speed of
your functions.
This is because every background_
function is called repeatedly in
an infinite loop, and could create a bottleneck if it is being called
too often. For tasks that might be short but that you still want to
call periodically, see every_<second|minute|hour|day>.
every_<second|minute|hour|day>
¶
def every_<second|minute|hour|day>(state, users, articles):
...
or
def every_<n>_<seconds|minutes|hours|days>(state, users, articles):
...
These are like background_
but allow you to define hook functions that
are scheduled periodically. For example, if you write a function named
every_minute_calculate_scores
that function will be called once every
minute.
Alternatively, you can use a name scheme like
every_30_seconds_calculate_scores
to run the function every 30 seconds.
The time units “seconds”, “minutes”, “hours”, and “days” are available.
The function is re-scheduled after its last call completes. So for example if you have a function that is called every second, but it takes more than a second to complete, its next call will be one second after it completed.
In other words, you won’t have multiple calls of the same periodic hook running simultaneously. So you might choose a period that represents an upper bound on the time performance of the hook function.
Recsystem state¶
Here we explain the use of the state
argument that is passed as the
first argument to all hook functions.
The state
argument is a Python dict
which may contain any number of
nested dicts. It’s your recsystem’s own work area where it can store any
data specific to your recsystem’s functionality. For example, say you are
performing sentiment analysis on articles. You would like to peridically
compute sentiment scores for articles, and you will need a place to save
these scores (in order to avoid recomputing them).
You could add a key to your state
named "article_sentiments"
containing a dictionary mapping article IDs to the sentiment analysis
results. In this case the state (or this portion of it) could look
something like:
{
"article_sentiments": {
12345: "happy",
12346: "sad",
12347: "neutral"
}
}
Note
Technical note for the curious: You may ask “Why do I need to pass this
state
argument around? Why can’t I just use a global variable?”
In many cases using a global variable will not work, because in order to keep your recsystem able to handle many events concurrently, your hook functions may be run in some separate processes. If you use a global variable for this, changes you make to its value will not be propagated correctly to the whole system.
This is also why your hook functions should return State updates.
The keys in the state
dict may be any type that can be used as a
dictionary key in Python (strings, integers, tuples, etc). However, keys
and values must be able to be pickled
. Fortunately, this is true
for most types you will likely encounter in Python data science, such as
Numpy arrays and Pandas DataFrames, etc.
State updates¶
Most of the hook functions defined may return
a
value referred to as a “state update” performed by that function. It
informs the system which parts of the state you want your hook function to
modify.
The state update is also a dict
, but you should not simply modify the
original state
dict and return it. This could result in your hook
functions overstepping each other and clobbering each other’s results.
Instead, each call to a hook function should only return a dict
representing the parts of the state changed by that call. This update will
be automatically merged into the “real” state that will be passed to future
hook calls.
Returning to the previous example, if you have a function
every_10_seconds_perform_sentiment_analysis
to update the sentiment
analysis for new articles, and it finds a new article with ID 12456
, the
hook function should return a state update like:
{
"article_sentiments": {
12456: "mixed"
}
}
This informs the system that there is a new key/value pair to add to
"article_sentiments"
and that no other part of the state needs to be
touched.
Special case: recommend()
¶
With one exception, the return value of every hook function is a state
update (or no return value if you have a hook function that does not update
the state). The shutdown()
hook is also a corner case since any state
update it returns will be ignored, as the system is shutting down.
However, the recommend()
function normally returns a list
of article
IDs, not a state update. If you have a recommend()
function in which
you also want it to update the state (e.g. maybe to keep some statistics on
how many recommendations it’s made to each user) it can return a tuple:
(recommendations, state_update)
.
State persistence¶
Currently (though this might change in the future) the state
is not
persisted automatically. That is, when you shut down your recsystem and
start it up again, it will always start with an empty state ({}
).
Naturally, you will probably want to be able to keep your recsystem’s data over the course of a competition. Currently, the best way to do that is to define some initialize and shutdown hooks like:
import logging
import os
import pickle
log = logging.getLogger(__name__)
STATE_FILENAME = __name__ + '.pickle'
def initialize(state, users, articles):
# This runs every time your recsystem starts up
if os.path.isfile(STATE_FILENAME):
with open(STATE_FILENAME, 'rb') as fobj:
state = pickle.load(fobj)
log.info(f'loaded previous state from {STATE_FILENAME}')
return state
def shutdown(state):
# This runs every time you stop the recsystem cleanly and in most
# cases if it crashes
save_state(state)
def save_state(state):
pickled_state = pickle.dumps(state)
with open(STATE_FILENAME, 'wb') as fobj:
fobj.write(pickled_state)
log.info(f'saved updated state to {STATE_FILENAME}')
To protect against unfortunate catastrophes (e.g. your computer crashes) you might also want to periodically save state updates:
def every_30_seconds(state, users, articles):
save_state(state)
Alternatives¶
The state
dict is provided as a quick and convenient space to store your
recsystem’s data at runtime. Its use is purely optional. For example, some
contestants might choose instead to use an external storage method for their
recsystem’s data, such as a database (SQLite, MongoDB, Redis, etc.). This
is perfectly allowed.
A combination of the two can also be used, such as using state
as a
cache, but using a database for longer-term persistence. The choice is
yours!
Async hooks functions¶
Todo
Explain how to write hook functions with async def
instead of
def
, what this means, and when and why to use it.