Collaborative filtering in PHP

In the era of PyTorch and TensorFlow 2.0, one may wonder why PHP would even come to mind when mentioning machine learning. No eager execution, no automatic differentiation, no vectorization of operations, no array broadcasting. It sounds hard to even think of implementing an algorithm without all these features and tools, let alone have it reach an acceptable level of performance.

And yet, PHP still dominates the world of web development. This is notably the case for e-commerce, where four of the top five players (WooCommerce, Magento, VirtueMart and PrestaShop), all written in PHP, already account for 45% of the worldwide market share. As online retail keeps growing (it now accounts for more than three-quarters of overall retail growth), it is also becoming more competitive, and retailers are on the look out for any possible way to boost their sales. Using a recommender system to highlight the products a customer is the most likely to buy is one of them.

While any large e-commerce player will have ample opportunity to build separate machine learning modules using an adequate framework, it is not the case for the majority of small retailers whose infrastructure is in no way designed to support anything other than the monolith that manages their entire business. What to do in such cases?

The idea is to keep it simple. We’ll be implementing the classic low-rank matrix factorization algorithm, using the best tools PHP has to offer. In this case, MathPHP will take care of all the linear algebra and statistics for us.

Let’s start by defining our data. We have users, and we have products. They can be linked together by what we’re going to call a score: there are many ways to build it, but most likely it will be either a rating (e.g. 1 to 5 stars) or a purchase indicator (1 if the user bought the product, 0 else). We’ll use \(s(i, j)\) to denote the score given to product \(i\) by user \(j\), not necessarily defined for all pairs, and \(S\) the set of pairs \((i, j)\) for which \(s(i, j)\) is defined (i.e. a rating is available).

Now on to our model. The idea is to represent a product with \(n\) features, and a user with \(n\) weights, each expressing the correlation bewteen the score they would give a product and one of its the features. Therefore, a prediction of the score is given by the following:

\(\sum\limits_{k = 1}^{n} u_k p_k\)
 

Or, representing our users and products by vectors, simply \(\vec{u} \cdot \vec{p}\). In order to minimize the difference between the predicted scores and the real ones, we will be working with the following cost function:

\(J = \frac{1}{2} \sum\limits_{(i, j) \in S} (\vec{u_j} \cdot \vec{p_i} – s(i, j))^2 + \frac{\lambda}{2} \sum\limits_{i = 1}^{n_p} \|\vec{p_i}\|^2 + \frac{\lambda}{2} \sum\limits_{j = 1}^{n_u} \|\vec{u_j}\|^2\)
 

The first term sums the squares of the differences between the predictions and the targets, while the other two are regularization terms (\(L^2\) here but other methods can be used). This optimization objective yields the following gradients:

\(\frac{\partial J}{\partial \vec{p_i}} = \sum\limits_{j \mid (i, j) \in S} (\vec{u_j} \cdot \vec{p_i} – s(i, j)) \vec{u_j} + \lambda \vec{p_i}\)
 
\(\frac{\partial J}{\partial \vec{u_j}} = \sum\limits_{i \mid (i, j) \in S} (\vec{u_j} \cdot \vec{p_i} – s(i, j)) \vec{p_i} + \lambda \vec{u_j}\)
 

We can then update \(\vec{p_i}\) and \(\vec{u_j}\) using gradient descent, or one of its more advanced variations such as Adam. In the case of the former, which we’ll use for the sake of simplicity, this gives:

\(\vec{p_i} := \vec{p_i} – \alpha \frac{\partial J}{\partial \vec{p_i}}\)
 
\(\vec{u_j} := \vec{u_j} – \alpha \frac{\partial J}{\partial \vec{u_j}}\)
 

where \(\alpha\) is the chosen learning rate.

How does this translate into code? Let’s suppose we have the following:

  • $products, a collection of Product objects
  • $users, a collection of Users objects
  • $ratings, a collection of Rating objects, with the product, user and rating properties

Let’s start by importing the required classes and setting up our configuration:

We’ll be using SplObjectStorage to store our embeddings, so that we don’t have to modify or extend our e-commerce software’s classes, using objects as keys instead. We will then initialize them using values from the chosen distribution (here a normal distribution of mean 0 and standard deviation 1, though other techniques can be used).

Now onto the first epoch. We’ll use the same method to store our gradients:

We can then loop through the ratings, and for each of them, compute the corresponding gradient part for the associated product and user. After this, we sum these gradient parts so as to obtain the full gradient in the above formula, minus the regularization terms.

Finally, we perform gradient descent, subtracting the gradients from the ratings as well as those from the regularization terms:

Now all there is left to do is wrap the code for the epoch in a loop and let it run for the specified number of times. Once this is done, obtaining a prediction is as simple as this:

It is also possible to look for similar products by comparing their embeddings:

You may have noticed that if a user hasn’t, say, rated or bought a product yet, the optimization objective will only be comprised of the regularization terms, leading to the user’s representation being a vector of zeros, and therefore predicting a score of zero for any product. This problem is often solved with mean normalization, which consists in subtracting, for each product, the mean of its ratings across all users before running the algorithm (and adding it back after prediction, of course).

However, all this will achieve in those cases is predict the average score instead of zero, and normalization over small samples may cause the model to be a lot less robust. For these reasons, I would suggest to simply ignore users and products under a certain threshold of ratings.

The full implementation is as follows:

And this is all you need to implement collaborative filtering on any of the e-commerce systems built with PHP.

No, objects are not automatically passed by reference

While it makes sense for PHP beginners to first learn that objects are passed by reference rather than focus on technical details they need not comprehend yet, it worries me how many experienced developers still hold this belief.

The basic example used to justify said misconception is the following:

The function func has its sole argument passed by value, yet it somehow changed the state of $object. Is that not what passing by reference does?

Well, not exactly. Consider the following, where the object is replaced with a new one inside the function:

Here, $object remains unchanged. However, if we add a reference:

We now revert to the original behavior.

Now it has been shown that references do make a difference for objects, the best way to understand why is to clearly define both.

A reference, first, is nothing but an alias, which means that both variables point to the same location in memory and can therefore be used interchangeably.

An object variable, on the other hand, doesn’t contain the object itself but an identifier that allows the runtime to access its content. In fact, it works exactly like pointers in C++, and therefore explains PHP’s use of the same syntax as the ‘member of pointer’ operator (->), which translates to ‘point to the given property for the object of given identifier’.

What happens in the first example is that this identifier is passed by value and copied to $arg in the function scope. Then, the runtime is asked to modify the property prop of the object corresponding to the identifier. The variable $object in the global scope was never modified, but the object it points to was.

In the second example, the identifier is once again passed by value, but then, the local variable $arg is overridden with another identifier to another object, and it is the property prop of that second object that is modified, explaining why the original one remains unchanged.

The third example, finally, has the identifier passed by reference to func. A second object is once again created, and its identifier is saved in the alias $arg, which is, by definition, just another way to refer to the variable $object in the global scope, which in turn now points to a new, modified object.

Back to pointers to objects in C++, the code below shows how they can be used to obtain the exact same behavior as PHP objects:

Though, if you know how pointers work, you probably didn’t need to read this article anyway.

In C#, the behavior is once again the same, but the introduction of reference types (variables that store references to their data, as opposed to directly containing the data) abstracts the concept of pointers and allows us to use objects directly, and with the ‘member of’ operator (.) only:

Please note that C# actually supports pointers (to value types and other pointer types), though they should only be used to interact with native functions or in performance-critical code.