👨‍💻
Build Your First Primitive Matching Algorithm
//
👨‍💻
Build Your First Primitive Matching Algorithm
Nov/16/2016
🏷️ Tags: [algorithm, matching, findmyreviewer]]
Many students are curious about how the recommendation systems on shopping sites work. Even after watching some explanatory videos on YouTube, they cannot find a proper way to kickstart their very first algorithm. Here is how we setup our very first, quick and dirty recommendation algorithm.
Background
Reputation is one of the most crucial things that an academic journal lives on. To ensure the quality of papers, many academic journals require the paper submitters to provide a list of recommended reviewers to review his paper. Some other journals have a different approach: the editors will reach their own candidates to review the submissions.
However in both cases, there is an inavoidable limitation: the author and the editor only know who they know, they don’t know who they don’t know. When you are facing a specific topic that you don’t know much about, you need find someone who are really familiar with that topic.
There are also some drawbacks of traditional reviewing proccess, such as the uneven reviewer assignment, inefficient reviewer matching, etc.
So we want to build an algorithm to help with this situation.
What Do We Have?
Say, we have an academic paper and our data base has a bunch of metadata of academic papers on hand. Now we want to find some scholar who actually understand what this paper is about. What can we do on this?
So for the paper we want to find a matching, we have:
Title of the paper
Keywords suggested by the author
The abstract
Info of the authors
Names
Email Addresss
Institutions
This is all we can get publicly about the paper without much effort.
Now, if you are aware enough, you may find this problem as a clustering problem: simply throw this paper into our database and do the clustering by keywords. Find a group of authors who are in the cluster and bang! We will get the result!
Unfortunately this would not work well for the following reasons. First of all, clustering can help, but there are authors spanning more than one specific field. Secondly, it is really hard to dertermine how should you cluster these papers, since some paper often talk about several field simultaneously. Thus, a clustering algorithm may not be a good idea in this case. But at least we have a starting point now: we want to identify or profile the authors by keywords.
Notice that we have the publication history for every author in our database, and thus we can get a list of keywords from his history publications. If you look at what we want to match with: the keywords extracted from the paper we want to match. A keyword to keyword matching!pp
The Model
Intuitively, an author with more common keywords are more suitable for being a reviewer for that paper. Let n_i denote the number of common keywords for author i. Following our intutiton, we can rank each author in our database to find our best reviewer candidates.
Scorei=niScore_i =n_i
Congrats! You have built your very first recommendation model!
But when you implement this model with real data, you may find it not really useful: there are so many authors with the same number of common keywords that we cannot differentiate our best canditates from the population. Our model essentially does not help.
Say the input keywords are “Machine Learning, KNN, SVM”. “Machine Learning” has been a hot topic recently, so we can expect there are many authors mentioning “Machine Learning”, thus it could be a common keywords for most of the authors in this field.
Author_A = ["Machine Learning", "KNN", "Database"] # Score = 2 Author_B = ["Machine Learning", "SVM", "CNN"] # Score = 2
Python
See? We have the same score for both the sample author shown above. Even when we have more input keywords, the similar situation is still a problem. We need to enrich our model to consider some other factors that makes an author truly a good candidate for review.
Suppose “KNN” is a relatively hot topic than “SVM”. The author with speciality in “KNN” may be more worthwhile to consider. Why don’t we add a bonus to those authors with specialty in realtively less popoluar topics?
wj=1fjw_j= {1\over f_j}
where f_j is the frequency of keyword j in our database, i.e. the occurrance in our database. Let us introduce an indicator function to show whether an author has a common keyword j or not:
Iij={1if author i mentioned keyword j at least once0otherwiseI_{ij} = \begin{cases} 1 &\quad\text{if author i mentioned keyword j at least once}\\ 0 &\quad\text{otherwise} \end{cases}
Thus, now the score for author i becomes:
scorei=jwjIj=jIjwjscore_i = \sum_j w_jI_j = \sum_j {I_j \over w_j}
Let’s see how it goes:
Suppose in our databse the frequency for “Machine Learning” is 10, and for “SVM” and “KNN” it is 5 and 3 respectively.
Then, we have the following result:
Author_A = ["Machine Learning", "KNN", "Database"] # Score = 0.433 Author_B = ["Machine Learning", "SVM", "CNN"] # Score = 0.3
Python
Now we have a more differentiated outcome.
Can We Do Better?
Yes, absolutely. For example, in this model the author with specialty in “KNN” will overwhelmingly outscore the author with “SVM”. But we may only need one or two reviewers who knows “KNN”.
Although this model has tons of fudemental problems to resolve, at least it would be a good starting point for you to think about how to build a model for a particular purpose.
Stay tuned to our blog for later update to see how the model gets much more complex and accurate as we develop.