How BARK runs a recommendation engine for dogs

7 min readFeb 10, 2020

By Alex Lee, Senior Machine Learning Engineer

These days, lots of e-commerce businesses are leveraging recommendation engines to get customers the items they want most. Typically, to solve recommendation problems you would like to have a rich dataset on users’ interactions with the items that you will be recommending (this is called a “warm-start”). However, it takes time to build up a rich database like this, and so many startups are stuck trying to make recommendations in a “cold-start” environment, especially in a high-growth business where many users have no purchase history.

But do you need a warm-start environment to run a high-converting recommendation engine? The answer is no, not necessarily. In this post, I’ll share how our data science team at tech@BARK was able to significantly increase the return on our recommendation engine called “Add To Box” (ATB) in a cold-start environment by utilizing data beyond similar user preferences. Along with significant experience improvements and merchandising guidance, these changes together increased add-on revenue 12x in the past year.

What is Add to Box (ATB)?

At BARK we have lots of products, but in this post I’m going to focus on just one: BarkBox — a monthly, customized subscription box of themed toys and treats for dogs. Every month, we give our users the option to add more items (in e-commerce these items are called “stockkeeping units” or SKUs) to their box at an additional cost, and provide recommendations on items they might like to include. This recommendation, powered by Machine Learning, is called Add to Box (ATB).

Every month, we have about 50–70 different products available for ATB that we present to users in two places:

Email: Before each month, we send our users a recommendation of 3 SKUs (an example can be seen below)

Dashboard: When users click through that email, or access their account dashboard via the website, we present them with a maximum of 15 SKUs (an example can seen below)

We want to show products that get customers (and pupstomers) excited every month.

While individualized customer preferences drive much of this, seasonal factors play a large role in the offerings and in turn, everyone’s purchasing decisions. The ATB approach to seasonality is to offer items that are hyper relevant for a short period of time. For us, that means pushing wearable Cupid wings during Valentine’s Day, a menorah chew toy during Hanukkah, Halloween-themed chews, and so on. Seasonal items are part of what makes the subscription continually fresh and exciting. However, the seasonal items are typically only available for one month. So, in addition to having many users with little to no purchase history, we also do not have data on any specific seasonal items. It would be a mistake to ignore such a huge part of the offerings and user base. So, how do we approach these problems?

Use the data you do have

One of our core principles at tech@BARK is to always leverage the data available, and find ways to gather the data you don’t have. In this case, the basis of our challenge hinges on the fact we don’t have as much data as we’d like — but that doesn’t mean we don’t have any.

When users sign up, they create a user profile with information on their pet and their pet’s toy, treat, and chew preferences. This information is useful and helps us understand what their initial preferences might be for what will go immediately into their first box. However, it does not provide any insight into what other items they would like to add to their subscription.

Bearing that in mind, the recommendation engine is partially based on users’ explicit preferences and characteristics at sign up time. But, since most of the items we offer month-to-month are new, and have not been seen before, we cannot rely on the classic collaborative filtering methodology. The collaborative filtering algorithm’s performance depends on having other users who have already rated the items being offered, in addition to other items that have been rated by those users. To address this, we’ve opted for using Neural Factorization Machines (NFM). The power of these models lies in the ability to use metadata to enrich the predictions. Therefore, the item information that we’ve collected and use in the model has expanded over time to complement this approach. Initially, there was very little item metadata to go off of. Now, the item metadata we collect is quite rich and provides us with information such as if it’s shreddable, if it’s wearable, if it’s edible, etc.. — Below is an architecture diagram with a little information on how we apply Neural Factorization Machines on this problem at BARK.

Architecture diagram showing how we apply Neural Factorization Machines

This architecture lets us make more accurate predictions without having rich user-item ratings, and allows us to leverage the data we currently have in-house.

Measuring success

Great! We have a trained model that we know is performing well on our data, but how do we assess its performance in production? This is more complicated than a traditional classifier model where we immediately will be able to track how it’s performing upon deployment, and since we’re a subscription-based company with a calendar-based cadence, we only can track performance once per month.

Although the model is trained on purchase behavior, our current approach for measuring success is to look at metrics for the overall cohort of ATB users rather than purchase decisions on an individual user-item level. Month-to-month we assess the total revenue generated by the ATB program, as well as looking at the revenue per customer. Both these numbers give us a feel for how well we’re doing at offering products that users want to buy.

As we continue to improve ATB, we’re starting to reframe the domain of the problem, which in turn changes how we measure the model’s performance. Once we have a set of recommendations — instead of looking purely at if they will convert or not — we set up the objective function to be specifically around optimizing the maximum amount of revenue from the entire cohort of individuals. To achieve this we have written a customized loss function that measures the KL-divergence of the predicted revenue distribution versus the actual revenue distribution. By using this loss function, we can more accurately predict the expected revenue of any one model trained.

Infrastructure Considerations

Nowadays it isn’t enough to have a well-performing model in a one-off script or jupyter notebook; this model needs to be deployed in a scalable and resilient manner. For ATB ML we do this through a combination of Kubernetes and Airflow. This infrastructure stack allows us to easily swap newly trained models in and out of production with no hiccups. If there are hiccups, we have a centralized logging system in Airflow we can utilize to hunt for the errors and resolve them more efficiently.

Airflow also allows us to combine all data pre-processing steps for the model, model predictions, and file formatting into one continuous pipeline with retries built in to resolve transient errors. This, paired with code optimizations, allows for the pipeline to be run efficiently end to end. This is crucially important when we get various inventory updates or anything else that might reset parameters during our production runs.

Lessons Learned

This recommendation engine has been a labor of love for the tech@BARK data team over the past six months. Through this process, we have learned many lessons that we try to keep in mind when embarking on any data science modeling project.

1. Know your data

This might seem painfully obvious, but it is crucial to understand the scope of data you do and don’t have when starting on any data science task. At BARK, understanding the gaps in the data we have allowed us to be more clever with our feature engineering and modeling choices.

2. Remember your audience

Ultimately, we want our subscribers to enjoy receiving their boxes and this depends on them getting excited by our assortment month-to-month. This means that we have to pay attention to SKUs that are seasonally relevant, even though they present a generalization challenge.

3. Deployment matters

We could craft the perfect recommendation engine, but if we are unable to efficiently make these predictions on our whole subscriber pool, then the model has no utility. Having a code and deployment strategies that allow us to efficiently leverage the model is of huge importance.

4. How do you measure success?

This is never an easy question, but one that we should always keep in mind. If you don’t know how to evaluate a model in tangible metrics, then you have no ability to assess if the work you are doing is making a meaningful impact.

Stay tuned for further updates on our recommendation engine, as well as all of the other unique and interesting things we’re doing at tech@BARK!