How to have fun with computing in your hobby time

Tag Archives: kaggle


This is what I have done so far with another Kaggle competition Event Recommendation Engine Challenge. I came in a little bit late with ten days left before the public leaderboard being closed. It turns out to be a good thing for me, as I usually find it easier to convince myself of spending spare time on competitions when they are finishing. : ). At the time when this is written, my best rank is 8th in the public board. Since I cannot offer more time on this, I teamed up with a friend to pass it over, and decided to put my story here before the competition ends. Hopefully someone can find it useful and make further improvements based on it.

Problem Understanding – collaborative filtering, classification, or ranking/regression?

1. collaborative filtering?

Before this I don’t have a lot of experience of working on a recommendation system. So at the beginning it seems to be a good idea to formalize the problem as a recommendation modeling – as suggested by the title of the competition. But after some reading and hands-on experiments (mainly with the collaborative filtering package in mahout), I found that collaborative filtering might not be the best model in this case, for the following reasons:

  1. User-based or Item-based collaborative filtering (also known as collective intelligence) is heavily based on the assumption that an effective similarity measure can be built among user/item groups. In practice, it usually means that for those models to be useful, there must be reasonable overlapping between transactions of different users/items. Unfortunately this is not true for this problem – the system built on mahout cf package cannot make any recommendations for about 40% users, due to the lack of overlapping between them. Even after enlarging the training set by including more transactions from the event_attendee and event_creator data, they are still too sparse to make useful recommendations. Even worse, some users/events in the test set have never appeared in the training set – this is usually called anonymous recommendation and almost equivalent to a blind guess.
  2. Recommendations made by collaborative filtering models are usually open-lists – the candidates are from the whole set of items (events in this case). However, this is NOT what is required in the competition – what we are asked is actually to RANK a closed-list of events for a certain user.

2. classification?

Modeling it as a classification problem does not feel right either in this case unless we only formalize the output as binary – interested_event/noninterested_event. This is not very satisfying considering we are supposed to rank the events based on a specific user’s interests in them – in other words, the output is ordinal rather than nominal.

3. ranking/regression?

So how about ranking and/or regression modeling? I haven’t found any clear-cut guidelines in the literature on how to distinguish ranking and regression problems, though there are slight differences in handling them in practice. For example, their accuracy measures are usually different to reflect different things that ppl care about in them. I chose to model the problem as a regression simply because (1) currently there are more regression models available than ranking models and (2) a user’s interest is not only ordinal but also numeric, i.e., the difference of different interest degrees also makes sense.

Data Understanding – what is the story behind them?

One of my favorite tricks for understanding data is to come up with a story line connecting as many pieces of data as possible. This also helps me to find potential inputs and outputs. Here are some of my thoughts.
(1) Suppose I myself am a user of the recommendation system. First I have to register myself as a user with my particulars (age, language, city, etc…) – this is captured in the users.csv data. The data describes the “static” nature of myself because they don’t depend on anyone else other than me and they will not change along time.

(2) Later I make friends with some other people who are also using the service – this information will be captured in my friend list as in the user_friends.csv file. My friend list tells more information about me than my static particulars, because people usually make friends based on common interests, which are not easily seen by their profiles alone.

(3) Someone (a stranger, a friend of mine or myself) creates an event someday, by registering the event with its description, location and etc. Those are “static” information of events are now captured in events.csv.

(4) The created events start to develop some “dynamic” properties when invitations are being sent out to people (e.g. by system, friends of friends or creators). Those dynamic properties will be shaped as different types of responses are sent back, e.g., by invited people who are not interested, people who are not invited but turn out to be interested. Part of this information is captured in event_attendees.csv, but for it to be useful we must connect it with other pieces, such as users’ friend lists.

(5) At the end, if I have the access to all the created events, I want to know which events I will be interested in. First I will consider the static properties of myself and the events – do I have enough time to schedule? is it in the city that I stay in? is the event popular enough? Is the topic of the event interesting in general, is it on a holiday/weekend and etc. Then I will also consider the dynamic properties between the events and me – am I invited? are a lot of my friends going or not going? are they invited? am I an acquaint of the creator? can I find people to make new friends with in the event who are friends of my friends? Those will be potential features to think about when I start building the models.

Finding Inputs & Outputs

On the input side, here are some features that I am currently using in my code, based on my understanding described above.

  1. is_user_invited
  2. user_locale
  3. is_user_in_event_city
  4. is_user_in_event_country
  5. user_gender
  6. is_event_creator_a_friend_of_user
  7. num/ratio_of ppl interested/invited/not_interested/mayb_interested in an event
  8. num/ratio_of friends interested/invited/not_interested/mayb_interested in an event
  9. time_difference_between_notification_and_event_start
  10. user_age
  11. event_cluster_by_topic
  12. user_cluster_by_community

Most of the input features are straightforward to calculate, except the last two. The topics of events are calculated based on the frequencies of common terms, as given in the events data. I used the kmeans clustering to group the relevant events into clusters, which is analogous to event themes in practice.

User clustering is a little bit different because what we have in the data are connections between friends and they are better modeled as a graph. Finding clusters in graphs is usually studied as a community detection problem.

Formalizing the output is even more important in my opinion – if the levels of the output are too sparse (just as 1, 2, 3), they will not be enough to break ties when ranking them. On the other hand, if they are too “refined”, we are at a risk of making things up because we actually don’t have the information of appropriate target values directly from the training set. It makes more sense to build the target step by step in an hierarchical way by forming different sub-targets and combining them later.

When it comes to finding a good target, usually the most useful thing is the domain knowledge, which I don’t really have for this case. Here is my best guess – I use three pieces of information in the target output, which are (1) whether the user is invited, (2) whether the user showed interest/no interest, (3) the ratio of the people who are both invited and show their interests.

I use (1) and (2) to form the main ranking 1-6 of the output – rank 1 for people who are invited but show no interest, till rank 6 for people who are not invited but end up showing interest. I use (3) as a refined adjustment for the output because there are a lot of people in the training set who are not invited, and show neither interest nor no-interest – those are people ranked as 4 in terms of their interest degrees. Adding (3) will help distinguish users in major each rank group further, which could also be an advantage for some machine-learning models that are easy to be biased by the dominating groups in the dataset.

But this set-up of output brings up another issue – now there is a smell of target leakage because some information is used in both inputs and outputs, though it should not be as bad as in a prediction problem as our final goal is to rank different events. It could be over-fitting, but so far the results showed that including those leaked-to-output information in inputs result in better performances than excluding them.

The modeling of inputs and outputs are extensible. For example, in the future we may also want to consider clustering users based on other criterion, such as the proportions of male/female in their friends, or even thee degrees of their friends as in a graph.

Putting them together – implementations

I am usually comfortable with doing data munging in Python/Command-line and exploring and trying different data models in R. One of the challenges here for me is the relatively large data in events.csv (it is about 1.1G in size and contains 3 million records). It is totally doable with mahout + hadoop, but those solutions seem a little bit overkill, since all the other files are still small in size.
What I really need are some computing tools that allow me to handle data in an out-of-memory way, without the need of loading the whole dataset into memory. I have tried both pandas and hdf5 with python, and bigmemory and biganalytics in R. All of them work pretty well for the problem. I used “foreach” and “randomForest” to train the regression model for interest rank prediction.

My code can be found here

 

UPDATE: The competition was over and now both the public and private scores (based on different sets of test data) are out. It turned out that there was a big gap between these two types of scores in my submissions – my highest public score was achieved by using the strategy described above, whereas my highest private score was achieved by using a different target output strategy -the targets were generated using (1) and (2) only, without using (3).  That makes sense because firstly there is a risk of target leakage in those settings; and secondly the target strategy (3), i.e., the one based on the ratio of the people who are both invited and show their interests, was too refined and so it was at a risk of getting overfitted to the data. But to be frank, unless we see more data, I don’t think we can tell whether such a refined strategy is too much or not beforehand.

A lot of other kagglers are publishing their solutions, such as here. More discussions about solutions are expected from the forum.



Design a site like this with WordPress.com
Get started