This is what I have done so far with another Kaggle competition Event Recommendation Engine Challenge. I came in a little bit late with ten days left before the public leaderboard being closed. It turns out to be a good thing for me, as I usually find it easier to convince myself of spending spare time on competitions when they are finishing. : ). At the time when this is written, my best rank is 8th in the public board. Since I cannot offer more time on this, I teamed up with a friend to pass it over, and decided to put my story here before the competition ends. Hopefully someone can find it useful and make further improvements based on it.
Problem Understanding – collaborative filtering, classification, or ranking/regression?
1. collaborative filtering?
Before this I don’t have a lot of experience of working on a recommendation system. So at the beginning it seems to be a good idea to formalize the problem as a recommendation modeling – as suggested by the title of the competition. But after some reading and hands-on experiments (mainly with the collaborative filtering package in mahout), I found that collaborative filtering might not be the best model in this case, for the following reasons:
- User-based or Item-based collaborative filtering (also known as collective intelligence) is heavily based on the assumption that an effective similarity measure can be built among user/item groups. In practice, it usually means that for those models to be useful, there must be reasonable overlapping between transactions of different users/items. Unfortunately this is not true for this problem – the system built on mahout cf package cannot make any recommendations for about 40% users, due to the lack of overlapping between them. Even after enlarging the training set by including more transactions from the event_attendee and event_creator data, they are still too sparse to make useful recommendations. Even worse, some users/events in the test set have never appeared in the training set – this is usually called anonymous recommendation and almost equivalent to a blind guess.
- Recommendations made by collaborative filtering models are usually open-lists – the candidates are from the whole set of items (events in this case). However, this is NOT what is required in the competition – what we are asked is actually to RANK a closed-list of events for a certain user.
2. classification?
Modeling it as a classification problem does not feel right either in this case unless we only formalize the output as binary – interested_event/noninterested_event. This is not very satisfying considering we are supposed to rank the events based on a specific user’s interests in them – in other words, the output is ordinal rather than nominal.
3. ranking/regression?
So how about ranking and/or regression modeling? I haven’t found any clear-cut guidelines in the literature on how to distinguish ranking and regression problems, though there are slight differences in handling them in practice. For example, their accuracy measures are usually different to reflect different things that ppl care about in them. I chose to model the problem as a regression simply because (1) currently there are more regression models available than ranking models and (2) a user’s interest is not only ordinal but also numeric, i.e., the difference of different interest degrees also makes sense.
Data Understanding – what is the story behind them?
One of my favorite tricks for understanding data is to come up with a story line connecting as many pieces of data as possible. This also helps me to find potential inputs and outputs. Here are some of my thoughts.
(1) Suppose I myself am a user of the recommendation system. First I have to register myself as a user with my particulars (age, language, city, etc…) – this is captured in the users.csv data. The data describes the “static” nature of myself because they don’t depend on anyone else other than me and they will not change along time.
(2) Later I make friends with some other people who are also using the service – this information will be captured in my friend list as in the user_friends.csv file. My friend list tells more information about me than my static particulars, because people usually make friends based on common interests, which are not easily seen by their profiles alone.
(3) Someone (a stranger, a friend of mine or myself) creates an event someday, by registering the event with its description, location and etc. Those are “static” information of events are now captured in events.csv.
(4) The created events start to develop some “dynamic” properties when invitations are being sent out to people (e.g. by system, friends of friends or creators). Those dynamic properties will be shaped as different types of responses are sent back, e.g., by invited people who are not interested, people who are not invited but turn out to be interested. Part of this information is captured in event_attendees.csv, but for it to be useful we must connect it with other pieces, such as users’ friend lists.
(5) At the end, if I have the access to all the created events, I want to know which events I will be interested in. First I will consider the static properties of myself and the events – do I have enough time to schedule? is it in the city that I stay in? is the event popular enough? Is the topic of the event interesting in general, is it on a holiday/weekend and etc. Then I will also consider the dynamic properties between the events and me – am I invited? are a lot of my friends going or not going? are they invited? am I an acquaint of the creator? can I find people to make new friends with in the event who are friends of my friends? Those will be potential features to think about when I start building the models.
Finding Inputs & Outputs
On the input side, here are some features that I am currently using in my code, based on my understanding described above.
- is_user_invited
- user_locale
- is_user_in_event_city
- is_user_in_event_country
- user_gender
- is_event_creator_a_friend_of_user
- num/ratio_of ppl interested/invited/not_interested/mayb_interested in an event
- num/ratio_of friends interested/invited/not_interested/mayb_interested in an event
- time_difference_between_notification_and_event_start
- user_age
- event_cluster_by_topic
- user_cluster_by_community
Most of the input features are straightforward to calculate, except the last two. The topics of events are calculated based on the frequencies of common terms, as given in the events data. I used the kmeans clustering to group the relevant events into clusters, which is analogous to event themes in practice.
User clustering is a little bit different because what we have in the data are connections between friends and they are better modeled as a graph. Finding clusters in graphs is usually studied as a community detection problem.
Formalizing the output is even more important in my opinion – if the levels of the output are too sparse (just as 1, 2, 3), they will not be enough to break ties when ranking them. On the other hand, if they are too “refined”, we are at a risk of making things up because we actually don’t have the information of appropriate target values directly from the training set. It makes more sense to build the target step by step in an hierarchical way by forming different sub-targets and combining them later.
When it comes to finding a good target, usually the most useful thing is the domain knowledge, which I don’t really have for this case. Here is my best guess – I use three pieces of information in the target output, which are (1) whether the user is invited, (2) whether the user showed interest/no interest, (3) the ratio of the people who are both invited and show their interests.
I use (1) and (2) to form the main ranking 1-6 of the output – rank 1 for people who are invited but show no interest, till rank 6 for people who are not invited but end up showing interest. I use (3) as a refined adjustment for the output because there are a lot of people in the training set who are not invited, and show neither interest nor no-interest – those are people ranked as 4 in terms of their interest degrees. Adding (3) will help distinguish users in major each rank group further, which could also be an advantage for some machine-learning models that are easy to be biased by the dominating groups in the dataset.
But this set-up of output brings up another issue – now there is a smell of target leakage because some information is used in both inputs and outputs, though it should not be as bad as in a prediction problem as our final goal is to rank different events. It could be over-fitting, but so far the results showed that including those leaked-to-output information in inputs result in better performances than excluding them.
The modeling of inputs and outputs are extensible. For example, in the future we may also want to consider clustering users based on other criterion, such as the proportions of male/female in their friends, or even thee degrees of their friends as in a graph.
Putting them together – implementations
I am usually comfortable with doing data munging in Python/Command-line and exploring and trying different data models in R. One of the challenges here for me is the relatively large data in events.csv (it is about 1.1G in size and contains 3 million records). It is totally doable with mahout + hadoop, but those solutions seem a little bit overkill, since all the other files are still small in size.
What I really need are some computing tools that allow me to handle data in an out-of-memory way, without the need of loading the whole dataset into memory. I have tried both pandas and hdf5 with python, and bigmemory and biganalytics in R. All of them work pretty well for the problem. I used “foreach” and “randomForest” to train the regression model for interest rank prediction.
My code can be found here
UPDATE: The competition was over and now both the public and private scores (based on different sets of test data) are out. It turned out that there was a big gap between these two types of scores in my submissions – my highest public score was achieved by using the strategy described above, whereas my highest private score was achieved by using a different target output strategy -the targets were generated using (1) and (2) only, without using (3). That makes sense because firstly there is a risk of target leakage in those settings; and secondly the target strategy (3), i.e., the one based on the ratio of the people who are both invited and show their interests, was too refined and so it was at a risk of getting overfitted to the data. But to be frank, unless we see more data, I don’t think we can tell whether such a refined strategy is too much or not beforehand.
A lot of other kagglers are publishing their solutions, such as here. More discussions about solutions are expected from the forum.
I am playing with the Traveling Santa Problem @ Kaggle with a couple of friends. It’s a very good experience to me because (1) I have had almost zero exposure to TSP before this so I am learning a lot of new stuff and (2) the discussions in the forum are very active and educational – you can even get started without any background knowledge by just reading the posts. My friends and I have been exploring different techniques for this problem, the story here is mainly about my personal exploration.
1. Understand the problem
A classical Travelling Salesman Problem (TSP) involves finding a tour of all cities with the total distance along the tour minimized. The Santa problem is special in that we need to find two tours instead of one – the two tours have the constraint that an edge in one tour cannot be used in the other. And we want to make the total distances of both tours as small as possible because the final score will be the larger distance of the two. There have been a lot of discussions about theoretical results of both TSP in general and the Santa problem in particular (especially from the posts on the forum).
My understanding of the problem is that we need two permutations of all the cities, with the constraint that no edges are repeated, and the objective to be minimized is the max of the two total distances. This sounds pretty much like a search problem to me, which means we can use classical searching algorithm such as back-tracking and A* search to build the solution from developing partial solutions step by step, except that there is only one problem here – the number of cities is 150K. This number is so large which easily makes any O(N**3) algorithm look extremely slow. So we have to be very careful about the speed and using heuristics to find approximate solutions.
2. Bottom line method
To me it is always a good idea to start with the most naive solution – in this case I decide to find the first tour (I call it tsp tour) by greedily connecting the nearest available neighbor to the last city of the partial tour that we have built. I called it greedy-search but later I found it actually corresponds to the nearest-neighbor-method , and the term “greedy method” is specially reserved for the one that finds the shortest available edge (instead of cities). For the second tour ( I call it twin tour), I am gonna use the A* search by limiting the branch candidates to the unvisited node that are not directly connected to the parent node in the 1st tsp tour. So for example, if the first tsp tour is 1->;2->;3->;4->;5, then in the search tree for the second twin tsp starting from node 1, the available branches for node 2 will be (4, 5).
I am using python to implement the code – the nearest-neighbor for finding the tsp tour is very fast (after pre-building the top 200 nearest neighbors using a KD-Tree). Here I have to make a decision because we limit the number of neighbors to 200, it is possible that at a specific step, we are running out of available neighbors for the tail of the partial path – we decide to use the nearest point to the tail node in terms of difference on x axis or y axis. Now all I have to do is to pre-sort the points based on x and y axis, instead of building the expensive nearest neighbors for a larger number. This simple strategy gives me the first tsp tour with distance of around 8Million.
But this method basically fails when I am trying to find the second twin tour through searching. The problem is simple to describe but hard to get around – building a search with depth = 150K is too much for loading it to the memory. I use the total_dist(1st tsp tour) – partial_dist(2nd twin tour so far) as the heuristic for A* search. It is definitely not admissible but what we need is an approximate instead of an exact optimal. I have tried cProfile and memory_profiler but the problem is still too slow and the memory usage is not acceptable. I ends up with using the cheapest depth-first search in the end, and the result is not good at all.
3. Divide and conquer
Although the naive method above didnt give me a satisfying solution, at least it gave me an estimate of what kind of challenges I would expect for this problem – this is good enough for a bottom line method. I keep searching for useful libraries for TSP problem and come across the R library (TSP), which is based on the Concorde implementation. The library provides several different TSP methods including NN, LKH (linkern), 2-opt, concorde and nearest-insertion. Those methods are based on building the distance matrix for the cities. Obviously a distance matrix of 150K * 150K is too large and slow – to me this is a good hint to use divide and conquer.
The shape of the sub-regions matters in this problem because after we finding the tour in each sub-region, we have to connect them into one tsp solution. I decide to use the simplest way – partitioning the space into n * n grid, because connecting subregions in grid can be as simple as using a ‘Z’ or ‘N’ shape connection. There will be some overhead of doing so. For example, for a 10 * 10 grid, we have 0-20K on both x and y axis, the worst case for connecting the grids in a ‘Z’ or ‘N” way can result in an extra distance of 11 * 20K = 0.2M. But compared to the optimal solution on the leader board (around 6.5M), it is safe to ignore this part as an easy start.
After finding the first solution, I use a very simple strategy to find the 2nd one – still based on divide-and-conquer. Since those methods are all based on distance matrices, to find the second solution that has no conflicts with the first one, I simple blocked the edges of the first path by using Inf as their corresponding entries in the distance matrix. This strateg turns out to be very effective, except sometimes it is not perfect – some methods such as NN will still use the Inf-long edge simply because it cannot find any other available neighbors in the sub-regions. But that is easy to fix. I write some python script that allow me to verify and FIX the two paths with small conflicts by simply using 2-opt exchange.
I have tried different configurations such as different size of grids, the best result I got is around 7.5M – a big improvement compared to my bottomline solution. I also found using different sizes of grid for 1st tsp tour and 2nd twin tour tends to give better result. The subregions for the 2nd twin tour should be larger than the 1st (or of different regions), simply because the two tours are actually competing with each other in terms of using nearest neighbors of a node – after finding the 1st solution, we need to make the region for the 2nd one bigger otherwise it cannot find other available short connections.
The main problem for this method is – the distance gap between the 1st tsp tour and the 2nd twin tour is usually large (about 1M). This makes sense because they are competing with each other. What we want to do is to make both tours (specially the 1st one) suboptimal, and thus they are close to each other. I have seen other ppl discuss using the same library on the forum, by running the code for several hours. For me the divide-and-conquer has several advantages over not doing so:
(1) complexity reduction – my code usually find the 1st and 2nd solution within half hour, using about 4G memory.
(2) extra flexibility – it is possible to use different methods for different sub-regions, or using different strategies to connect solutions from subregions. since it is a typical map reduce operation, i used foreach package in R to parallize them.
(3) easy to merge – by using the same sub-region grid, there is a very good characteristic of the solutions that I find – in each sub-region, the sub-solution of 1st tsp and 2nd twin tour always satisfy the constraint that there are no duplicate edges shared between them. It turns out to be a very useful characteristic later when we try to merge the solution to make them closer to each other.
4. And averaging them out …
The main problem with divide-and-conquer solution is that the distance difference between two solutions are large and the final score will be the worse of them. A simple strategy is to make the difference smaller by “averaging” them out. Again We “stole” the idea from those ppl that have had great discussions on the forum – but here we extended the original idear a little bit because we already have two solutions that have the non-conflicting characteristics in any of their corresponding partial-tours.
We average the two solutions out by breaking them into corresponding chunks (which correspond to partial solutions from different sub-regions), and re-connect them piece by piece. Along the connection process, we may exchange the chunks of 1st tsp and 2nd twin solution, when exchanging them could make their total distance difference smaller. We came up with the implementation and tested the idea, for an original solution with distances (6.0M and 7.3M), averaging them out makes the new distance = (6.7M and 6.7M). It is a big improvement in terms of the final score!
We also developed some other script that further reduces the distance of the twin solution given the constraint of the 1st one. It is based on greedy search of 2-opt exchange, but the improvement is not significant, which is usually less than 10K.
5. Some more thoughts
The best result seen on the lead board is around 6.5M for this time being, 0.2M better than ours. One explanation is that we use a very naive way of connecting solutions from different sub-regions. As mentioned above, this could result in extra overhead. An more elaborate way of doing this is to cycle the path in the subregion and try to connect them in an optimal way, hopefully the gap will be smaller.
Anyway I personally have had great fun with this problem – I have learned a lot from those kagglers. It is a good forum.
UPDATE:
My friends and I figured out a way of optimizing the in-between paths connecting the sub-solutions from each grid. The idea is based on greedy search – every time we connect the sub-solutions of two grids, we rotate each of them to make the end of first and the start of second close enough. After two sub-solutions are “merged” this way, they are viewed as a whole new sub-solution to be connected to the sub-solution of next grid. So the solution is formed like this (((g1, g2), g3) , g4) , …
By doing this “merging grids” on a 5 x 5 grid solution (after blending), and further optimize each tour by greedy search based on op 2 (described above), we made an improvement from 6.71M to 6.66M. We are still behind the so-far-best result, which is 6.5M. But probably I am gonna stop here unless there is new interesting idea to try.
The lesson that I have learned is – you probably need to spend more than 50% efforts on the improvement from 90% to 100%. So those guys who did the last 10% were really great!
The code can be found at https://github.com/dolaameng/fun_with_kaggle/tree/master/santa
