A deep dive on Evaluation Metrics Formulas
Image Source. The image is free to use under the content licence of Pixabay.
In recent articles, we have presented prominent recommender system algorithms that are widely employed in both industry and literature through a different lens using Graphs. Additionally, we have shed light on emerging approaches, including knowledge graph-based recommender systems, which are gaining traction in the field of recommendation research.
In this article, we will delve into the process of assessing recommender systems. We will explore the key criteria and metrics used to evaluate their performance, including accuracy, diversity, coverage, novelty, and serendipity.
In recent years, most of the research work carried out in the field of recommender systems is partly evaluated using accuracy metrics coming from the information retrieval field such as precision (P@K) and recall (R@K) whose formulas are provided hereafter:
where n and m represent respectively the number of users and the number of items, the set of items i₁,i₂,…,iₖ are the items ranked from 1 to K, the value of hit is 1 if the recommended item iⱼ is relevant to user u, otherwise 0. Rel(u) represents the set of relevant items for user u in the test set.
The purpose of these evaluation metrics in the context of product recommendation is to identify the K most relevant items for a given user and to measure the quality of retrieving with precision relevant information. In the special case of recommending only one item to the user, as in Session Based recommender systems where we want to measure the correctness of the immediate next item, hitrate@K:
Similarly to R@K, hit rate@K measures the correctness or accuracy of a recommender system.
The recommender system suggests to the user a ranked list of movies. hitrate@K will be equal to 1, if the user has decided to watch one of the movies in the top K list (e.g. Gatsby), while MRR@K will be equal to 0.5 as the movie is ranked in the second position in the list. Note: the diagram is created by the author.
Three other metrics are widely used in the literature to assess the accuracy of a recommender system, and more particularly, capture how well the hit is ranked in the list:
Mean average Precision (MAP@K): This metric measures how the order of relevant items is given by the recommender system:Top-K Mean Reciprocal Rank (MRR@K): This metric is a specific case of MAP@K, where there is only one relevant item. It measures how the recommender system rank well the relevant item against the irrelevant ones:
where iⱼ is the relevant recommended item within the top-K recommended items.
Normalized discounted cumulative gain (NDCG@K): Order matters for both of MAP@K and NDCG@K, but the main difference is that the mean average precision measure the binary relevance (an item is either of interest or not), while NDCG@K allows relevance scores in form of real numbers:
where IDCG@K is the ideal discounted cumulative gain, defined as follows (Rel(u)) contains only relevant item of user u):
Despite the relevance of these metrics in the assessment of recommender systems, recommending the same kind of products can be sometimes counter productive and not sufficient in real world applications (Netflix, Youtube, etc.). For instance, on Netflix, the user might be attracted by new kind of movies and series; On Youtube, the user often wants to watch new videos. The user must be surprised, and a good recommender system should have the ability to recommend unexpected and attractive items. The idea of not relying solely on precision based metrics is also supported in the industry (e.g. Weekly discover feature on Spotify). Also in research , the authors state that the purpose of an evaluation protocol is to assess the quality of the recommended items, and not only their accuracy or utility. In this context, only an online experiment where users of the system can judge the quality of the recommendations can reliably evaluate the recommendations. Therefore, when evaluating offline, it is necessary to consider other metrics than the sole accuracy.
In order to draw a reliable conclusion about the quality of recommendations, it is necessary that the recommender system should also be able to provide not only accurate but also useful suggestions. Indeed, an extremely popular item may be an accurate suggestion but not interesting for a user. Serendipity, novelty as well as diversity are alternative metrics to accuracy metrics.
The concept of serendipity in recommender systems refers to the system’s ability to recommend unexpected and appealing products to users. In , a metric is proposed to measure serendipity by evaluating the precision of recommended items after filtering out those that are too obvious. The equation below outlines the computation of this metric. The variable hit_non_pop is similar to hit, but it treats the top-k most popular items as non-relevant, even if they are included in the test set of user u. This is because popular items are considered obvious, as they are widely known by most users.
In , a novelty metric is introduced to assess the capability of a recommender system to suggest items that are unlikely to be known by a user. The purpose of this metric is to support recommenders in facilitating users’ discovery of new items. The equation below defined in  outlines the computation of this metric. It’s important to note that unlike previous metrics, the novelty metric solely focuses on the novelty of the recommended items and does not take into account their correctness or relevance.
The function Pₜᵣₐᵢₙ : I → [0, 1] returns the fraction of feedback attributed to the item i in the training set. This value represents the probability of observing a certain item in the training set, that is the number of ratings related to that item divided by the total number of ratings available.
What’s next? Evaluation Protocols
In most of the research work carried out in recommender systems, the evaluation protocol is made in an offline setting where the above mentioned metrics are measured based solely on past interactions. However, this has shown to be not sufficient in reality.
In the next blog post, we will highlight different evaluation methodologies, including offline evaluation, online evaluation, and user studies, and discuss their pros and cons.
 Mouzhi Ge, Carla Delgado-Battenfeld, and Dietmar Jannach. Beyond accuracy: Evaluating recommender systems by coverage and serendipity. In Proceedings of the Fourth ACM Conference on Recommender Systems, RecSys ’10, page 257–260, New York, NY, USA, 2010. Association for Computing Machinery.
 Jonathan L. Herlocker, Joseph A. Konstan, Loren G. Terveen, and John T. Riedl. Evaluating collaborative filtering recommender systems. 22(1):5–53, January 2004.
 Marco de Gemmis, Pasquale Lops, Giovanni Semeraro, and Cataldo Musto. An investigation on the serendipity problem in recommender systems. Inf. Process. Manage., 51(5):695–717, September 2015.
 Saúl Vargas and Pablo Castells. Rank and relevance in novelty and diversity metrics for recommender systems. In Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys ’11, page 109–116, New York, NY, USA, 2011. Association for Computing Machinery.
 Enrico Palumbo, Diego Monti, Giuseppe Rizzo, Raphaël Troncy, and Elena Baralis. entity2rec: Property-specific knowledge graph embeddings for item recommendation. Expert Syst. Appl., 151:113235, 2020.