Simple Recommendation System

Hacker News

Hacker News
How hacker news handles the concept of age?
Simple and intuitive approach of recommendations: Go by popularity
=> Most people like this, it is likely you might like this as well
Problem: For example in New York times, we would not want to put an article of three years earlier even though it is popular
Balancing popularity with age
Recommendation_System1
Go by the ratio of popularity to time
-> Specifics of Hacker News algorithm and how it balances the ratio
Hacker News Formula
Measure of the article’s popularity:Number of upvotes minus the number of downvotes minus one raised the power of 0.8
Measure of time: Age and hours raised to the power of 1.8
-gravity=1.8: Controls how fast or slow the score falls with respect to time
-penalty:multiplier to implement custom “business rules”(e.g. penalize self-posts, “controversial” posts, + ,many more)
-2: Added to the age and hours, so every link starts from two hours old. Or else, article will start as zero years old and score will reach infinity
Recommendation_System2
Effect of each parameters on the score:
-gravity = 1.8 > 0.8: Exponent for numerator is small than the denomimtor
Denominator will always grow faster than the numerator
Means the age will overtake the popularity, brining new articles closer to the top so relatively it doesn’t matter how many votes an article gets
Popularity term:
Raising the net votes to the power of 0.8 grows sublinearly
ex.0 -> 100 worth more than 1000 -> 1100
Power Law
-Article votes are distributed according to power law
-Few articles get a very high number of votes, but a majority of articles will get very few votes
-ex. Words like ‘the’ get used very often, but words like ‘frog’ don’t get used often in a single document
-If score grew linearly with popularity, they would be spread out
Most article will have 1~2 votes, while a few articles will have 500 or 1000+ votes
-Also appears in the Reddit formula
-Hacker news uses the same ranking formula for links and comments(Different from Reddit)
Hacker new algorithm pattern:
Score will grow very quickly and taper off at a slower rate due to gravity term
This is because, article receives votes early on, but don’t have much effect during the time, but after a while gravity term takes over
Interesting Tidbits
Penalty terms isn’t known, but the inferred tidbits, according to data
-Link is penalized if site is too popular (github.com, reddit.com)
-Article is about NSA(no longer the case)
-Too many comments(proxy for how controversial the link is)
-Business related penalites can also exist (who owns the company)
=> Penalties itself could be applied for any reason or no reason at all

How Reddit implements its ranking algorithm
-Difference between reddit and hacker news algorithm: Everyone can downvote
(Hacker news, to downvote, a certain amount of points need to be earned before being allowed to downvote while in Reddit everyone has equal power to upvote or downvote)
Reddit Formula
-Look quite different compared to hacker news formula, but concepts are similar
Two terms popularity and age exists the same, but instead divided by a plus sign instead of division sign
Measure of popularity
log(): Log of the absolute votes of net votes
-Also sublinear(respect to the net votes):
Have the idea of diminishing returns
max(): Take the max between one and net votes, because it is impossible to take the log of zero
signs(ups-downs):
-Can be positive or negative
If net votes are positive, term will be positive, and vice versa
More downvotes received, the more score will decrease negatively
age: time in seconds in the inception of Reddit
-Term is always positive
-The newer the link is, the more score it receives
-Reddit scores will forever increase linearly
For Hacker news, the score for news decreases over time, but for Reddit, they get bigger
Only the relative score matters, so it doesn’t matter whether the scores will become much larger or smaller
=> Similar pattern with HN
Recommendation_System3
Age grows faster than the popularity term, which means newer stories will always get pushed to the top
Reddit Algorithm Code:
https://github.com/reddit-archive/reddit/blob/master/r2/r2/lib/db/_sorts.pyx
Controversy
-CEO of Reddit editing the posts he didn’t like around the 2016 election
-Some content related to Donald Trump didn’t appear in the front page despite popularity
-Censoring conservative content
Don’t I own my site
It is true we have some freedom in editing my own site, but big sites like Reddit, Google, and etc, have some public responsibility to be neutral

Problems with Averate Rating & Explore vs. Exploit (part 1)

Ratings
-Previous: News feed items, typically having upvotes/downvotes likes
Usually have just 1 or 2 items
-Sites with 5 star ratings
ex. Amazon, Best Buy, and Walmart (Netflix used to.. but changed to binary system)
-Can feel like.. 5 star ratings, more like regression, and binary outcomes as classification
How to Recommend?
Simple approach: Sort by average rating
-Recommending items is like sorting them with a score
Problem:
Top item has 4.5 stars with 2377 reviews, while bottom item has 5 stars with one review
5 stars is bigger than 4.5 stars, but how confident we are in the rating should also be considered
Is the item with 5 stars better than the item with 4.5 stars? A: Maybe Not
Recommendation_System4
Reference: https://www.amazon.com/s?k=airpod&crid=19DKOK0P2QQDI&sprefix=airpo%2Caps%2C347&ref=nb_sb_noss_2
Confidence
Draw a box around the estimate which tells us how confident we are in the estimate
-Gives us the upper bound and the lower bound
-Typical strategy for ranking ratings is to be pessimistic and use the lower bound
Why sort by the lower bound
-ex.Have 2 items that have 4 stars on average
Item1: 3 ratings, Item2: 100 ratings
=> Result:
Recommendation_System5
Reference: https://analystprep.com/cfa-level-1-exam/quantitative-methods/confidence-intervals-2/
Item2:
Confidence interval is skinnier(More confident in its average rating)
Has more ratings, a bigger sample size
Item1:
Confidence interval is fatter(Less confident in its average rating)
-If using the upper bound, Item1 would be ranked higher than the Item2
This is not what we want, because we are more confident that Item2 is the four star item
Confidence Intervals
Normal approximation for confidence intervals
-ex. Given a random variable X, we can calculate the distribution of sample mean
Variable X: Normally distributed with mean mu and variance sigma square
X bar: Estimate of its sample mean, average of the samples I collected(Sum them all up and divide it by n)
=> Proof using probability scales:
X bar is also normally distributed, with the same mean mu, but a variance of sigma squared over n
-The more samples I collect N, the skinnier its distribution
N(number of samples) directly affect the variance of X bar
As N increases, the variance of X bar gets smaller, confidence interval gets skinnier
More samples I collect, more confident I should be
Recommendation_System6
Confidence Intervals(Normal Approximation)
95% Confidence Interval:
Interval for which the area under the curve is 95 percent of the total area
-Total area is one, since it is probability
-Calculate area of 0.95 using the CDF Function
Recommendation_System7
-X bar: sample mean
-Z left & Z right: Inverse CDF of the normal distribution that covers 2.5% of the total prob and 97.5% of the total prob respectively
Can approximate this number using 1.96
-N: Number of samples we collected
-s: Standard deviation of the data
What if X is not normally-distributed?
In practice, data might not be normally distributed
No problem! CLT(Central Limit Theoreum)
-X bar is sum of all the Xs multiplied by a constant
-Sum of all variables converges to a normal distribution. X bar is normally distributed
=> It doesn’t matter whether X is normally distributed itself
Recommendation_System8
Bernoulli CI Approximation
Can derive the confidence interval for Bernoulli as well using the normal approximation
Possible outcomes: 0 or 1
-Bernoulli distribution is the 0 1 distribution
-Used when we have a binary outcome(Upvote downvote situation)
Recommendation_System9
p hat: proportion of ones you get
-total number of upvotes divided by total number of all votes
Variance: p hat times (1-p hat) and the square root of that and plug that to the existing formula
Wilson Interval
Step further
Better approximation of the confidence interval for a Bernoulli distributed variable
Recommendation_System10
-p hat: sample mean
-z: 1.96 for the 95% confidence interval
-n: Number of samples collected
=> Entire process is the same
Sorting process is still the same - sort by lower bound - just that now the lower bound is more accurate
Extension to 5-star ratings
How to extend this to 5 star ratings?
-5-possible outcomes(If using half stars, 10)
-Can use Wilson’s interval
| Stars | Negative | Positive | Total | |———-|———-|———-|———-| | 0 | 0 | 0 | 0 | | 1 | 1 | 0 | 1 | | 2 | 0.75 | 0.25 | 1 | | 3 | 0.5 | 0.5 | 1 | | 4 | 0.25 | 0.75 | 1 | | 5 | 0 | 1 | 1 | Each rating is pseudo upvotes and pseudo downvotes
-1 star is equal to 0 upvotes and 1 downvotes
-5 star is equal to 1 upvotes and 0 downvotes
-3 star is middle, equal to 0.5 upvotes and 0.5 downvotes
=> Plug in the numbers to Wilson’s intervals
Wilson Interval
Where it’s used and possible issues it might have
Hacker news uses the same algorithm for both links and comments while Reddit uses the Wilson’s interval for comments
Reddit’s commenting system:
A great product with one 3-star rating would be penalized heavily
Penalized twice
(1)One rating it has happens to be a poor rating
(2)Confidence interval is wide, meaning the lower bound is a small number
Lower bound is good, because it accounts for number of people who rated the item
-Higher the number of raters -> smaller CI(Confidence Interval) -> higher lower bound
-Popularity increases its score

Problems with Averate Rating & Explore vs. Exploit (part 2)

More problems with average rating
Get another method of fixing the problem of ranking by average rating
What if N is very small, or 0?
Formula for the average or the sample mean:
-Problem is that if we have zero ratings, we have 0 out of 0 which is undefined -> What can we do to fix this?
Smoothening or Dampening
Seen this in a similar context of NLP for word prob also in Page Rank
-Formula generally works for in the average, but also works for probabilites
-Basic Idea:
Add a small number to the Numerator and Denominator
If we have zero ratings, we can just default to some pre-defined value
-Could make u0 to the global average, or just some middle value like 3(stars)
Problem: 3 stars might not be a good choice It might appear like all the products in your website doesn’t have very good rating
-Simple transition from u0 to true sample mean
ex1. Thousand 4 star ratings u0=3, lambda=1 -> 3.999 (Pretty close to 4)
ex2. Five 4 star ratings u0=3, lambda=1 -> 3.83
ex3. One 4 star rating -> 3.5
=> Smooth transition from the pre-defined value to the true sample mean as we collect more and more samples
-See how Baysian approach yields the same formula (Ranking method is different)
Recommendation_System12
Explore-Exploit Dilemma
Also in the subjects of A/B testing or Reinforcement Learning
Ex.Play slot machine
-Some slot machine will give you a better reward than others
All machines look identical, it is unknown which slot machine is better -> To measure the rewards, the only way to find this out is to play them
-Only two outcomes exist: lose(0) or win(1)
Like to play the slot machine where you will win the most -> Play the slot machine and calculate its win rate
-Each slot machine is a Bernoulli random variable
Recommendation_System13
:P hat equals the probability of winning which is number of wins divided by the total number of plays
-How do we decide how many times we want to play each slot machine?
Play more? or less?
Too few times are played, estimate wouldn’t be good
-ex. 3 times played, only win once => p hat is 1 out of 3
It is not an accurate estimate because the confidence interval of that estimate will be very fat
-ex. 300 times played, win 100 times => p hat is 1 out of 3
Estimate of one third will be more accurate
Traditional statistical tests can tell us whether differences in the means between two different slot machines is significantly different
-Problem:There can be only one slot machine can be optimal, others are suboptimal
Generally more times we play, the better our estimate is
Play 10 different slot machines 100 times each (1000 turns)-> 9 machines are suboptimal and 900 turns yielded a suboptimal reward
This doesn’t seem good as well
-Dilemma: Play more or less?
Explore Exploit Dilemma
In order to get a good estimate, we must collect more data. The more data is collected, the better the estimate
But on the other hand, more time is spent doing something suboptimal
Performing both explore and exploit will be ideal, but they are inherently with odds with one another
-Want to explore and collect more data, but I want to exploit what I believe to be the best award
-In order to explore, exploit would be impossible, and in order to exploit, explore will be impossible
Explore-Exploit in Recommenders
Ex.Watching a bunch of Youtube videos in how to make eggs
-Exploit: Show you tons of videos how to make eggs
-> After showing tons of videos how to make eggs, nothing else is suggested
Lots of exploiting, but no exploring
-Maybe suboptimal, because after making, eggs, I might not want to watch any more videos about making eggs anymore
-Stronger exploration component might be needed
Watch videos on ml or videos about movie trailers
Eggs have nothing to do with ml or movie trailers, can easily suggest what I don’t care about
-Exploring doesn’t guarantee something I like
Have the risk of making bad recommendations
Explore-Exploit
-How do we strike balance between two opposing forces of explore exploit?
:Smoothed average gives us one part of solution
Makes really bad products appear better than they are and good products appear worse than they are until enough confidence is earned to converge to true values
Result: Mix of probably good and probably bad products
-Top recommendations will be still be good products that have high confidence
-Next: Bayesian Solution to this problem

Bayesian Ranking(Beginner Version)

Bayesian Ranking(Beginner Version)
Advanced method of sorting items
Use the Bayesian paradigm instead of fixed scores
Brief overview & demo
Lessons
Do not want to sort by average rating, and every lesson is trying to avoid this
Explore-Exploit Dilemma
-Confident with 100 people clicking over 100 views
Difference exists between 1 person clicking over 1 view and 100 people clicking over 100 views
Bayesian Method
-Previous ranking used a fix number(score) (Already know the mean is not suitable)
-Sort by the lower confidence bound?
->Ranking by the fixed number don’t seem right
-Ranking be totally random?
Ranking itself will be totally random and useless
-Common misconception: Random means “completely disordered”
->Random variable is characterized by its distribution
ex. Normal Distribution: If it were to draw a random number, the most likely number is the values that are more likely and less likely and range of values we might get(any number on the real line)
Click Through Rates(CTR)
-Goal: Rank products based on which are more likely to be clicked(order by CTR)
Get more people to click > Takes them to the product page > Presumably make a purchase
-click=1 & view=0 -> CTR=mean
Problem: Don’t want to do this
Bayes: Bringing it all together
Treat the CTR not as a fixed number, but as a random number(i.e.with a distribution)
Recommendation_System15
-All the distributions are bounded between 0 and 1
Click through rates can only be between 0 and 1 so we should better pick a distribution that matches the requirement
Uniform distribution: All values equally likely
Completely flat
Pretty much don’t know anything about what the click through rate is
Any click through rate between 0 and 1 is equally likely
Exponential Distribution: 0 most likely, 1 least likely
Ramp that starts off at zero then goes down to one
Highest point is at zero, so the most likely click through rate is zero
Not completely sure, which is why there is a prob mass for other values of the CTR
For greater values of the CTR they are less likely
Peaked Distribution(Beta Distribution): 0.7 most likely
Pretty confident that the CTR is 0.7 but there is some probabilities for the other values
How do we learn these distributions
Recommendation_System16
ex. I have a bunch of products I want to rank. Going to give each of them their own distributions
The distributions start out flat and become more peaked as more data is collected(Clicks and views of the website)
-View: Every time a user sees a product
-Click: Every time a user clicks a product
How do we learn these distributions
Example
What happens when we collect data?
First Plot: 2/3 clicks
Distribution is pretty wide, and less confident about the true value
Second Plot: 20/30 clicks
Distribition is skinnier, as more data is collected, is more confident that true ctr really is CTR
Third Plot: 200/300 clicks
Distribution is even more skinnier, as an infinite number of data points is approached, the distribution approaches our point estimate
Recommendation_System17
How do we rank items?
If we have fixed numbers, it is easy.
Ex. One item has score=5, and one item has score=3, 5 > 3
But if having two distributions, how do we decide which one is higher?
Wrong Answer
=>Take mean of the distribution and rank by that? A: Don’t sort it by the mean
Thompson Sampling
Do this by sampling random numbers
Extreme Case1: Both Uniform
-Since both samples are equally likely to be any number between 0 and 1
Samples we draw are completely random and it is not really predictable which will come first
-That’s OK, because this means we have no data, we need to explore(Gain views and potentially clicks)
If we don’t know which of these items are more likely to be clicked on, data needs to be collected to give us a better idea
We need to show these items to users to see if they will be clicked or not. Only after the items are seen by the users, we know whether the items are good or not
Extreme Case2: Both Sharp Peaks
-When the peak is very sharp(approaching a deterministic variable), the sample will be very close to the peak
-Highly likely that the item with the highest value peak wins
ex.Consider the peak as two thirds, we make it a sample of 0.6665 or 0.6668, but there is no chance like drawing a sample like 0.2, because of this, it is highly likely that the item whose peak is located at the highest position will win
-Real-World: We’ve collected so much data, we’re very confident of CTR
We’ve already collected millions of datapoints for each items, because we created so many data, there is no question about what the true CTR is
-Because being confident that we know the true CTRs, we don’t have problems ranking by those CTRs or equivalently by samples close to those CTRs
Recommendation_System18
Mixed Case(One Fat, One Skinny)
:Mixed scenario where we have one item with the very sharp peak and one item with the very fat peak
-Sample for the sharp peak
More likely to have a score(sample) near the peak(will be most likely to be located at the peak)
-Sample with the very fat peak
Still in “exploration mode” score(sample) can be high or low
Have a sample anywhere, have a good chance it may be ranked better than the other item, but also have a good chance it may be ranked lower as well
=> Sometimes item1 will be ranked higher than item2 and sometimes vice versa. This is good, because we can collect more data
Recommendation_System19
Lesson
Bayesian method automatically balances the need to explore and exploit
-2 fat distributions: explore both (totally random ranking)
-2 skinny distributions: exploit both (nearly deterministic ranking)
Sort items deterministically by their known CTRs
Mixed: explore and exploit co-exist
-Two extremes can co-exist can have an item where one item needs more exploration because it needs more data and one item which can be exploited, because it has enough data
-Nice thing about this method is that it is completely automatic, we don’t need an AB test to figure out which item is best. Just running the algorithm and everything is ranked automatically in an optimal way
See this in action
Have three items: Awesome, Ok, and Boring
Awesome product: Product that people are most likely to click on
Ok product: Product that is less likely to click on
Boring product: Product that is even less likely to click on
Created a dummy website which just shows three items in an ordered list (Like Amazon)
Recommendation_System20
Ranking
1.Awesome
2.Boring
3.Ok
Graph: Shows us the current distribution of each item
-Have corresponding vertical bars telling us the values of the samples
Samples are random and drawn from the corresponding distributions
-Green vertical line sampled from the green distribution
-Distribution for each item is flat, know nothing about the CTR, because no one has seen this website yet, we know about their CTRs
-Items are ordered pretty randomly. Ok > Boring > Awesome
People visiting the website scenario1
1st visitor: User click the awesome item, because we are interested in awesome products
People visiting the website scenario2
2nd visitor: Refresh this page, and pretend we are the second person to ever visit this website
Distributions for all the items changed
Ranking
1.Awesome
2.Boring
3.Ok
Awesome ranked at the top and then have boring and then ok
Boring and Ok have an equal chance of being 2 and 3rd place
Recommendation_System21
Awesome sloped upwards while ok and boring items are sloped downwards
This makes sense because nobody clicked on ok and boring items
People visiting the website scenario3
3rd visitor: Boring item: Still peaked to zero, but with a steeper slope
Awesome item: Still peaked at one, but with the opposite slope
Ok item: Ok is peaked in the middle, which makes sense because upon viewing twice, we only clicked once
Click through rate is 0.5, but is wide, because 2/1 is not enough data to be confident
Ranking
1.Awesome
2.Ok
3.Boring
Samples drawn correspond to the ranking
People visiting the website scenario4

Reference:
https://github.com/reddit-archive/reddit/blob/master/r2/r2/lib/db/_sorts.pyx