Data Analytics
Data Visualisation
User Insights

Recommender System Optimisation

Improving personalised listing experience with user data analysis

author image
Carol Hsu
March 8, 2022

Photo by Soundtrap on Unsplash

Summary

In this project, I conducted a user analysis to gain insights into user preferences and improve Deezer's music recommender feature. Understanding user behavior is crucial in the competitive music streaming industry to enhance user experience and increase customer engagement. The findings aimed to provide valuable insights to modify it Flow Feature for Deezer to stay ahead in the industry.

Background

523.9 million people globally subscribed to a music streaming service (TechCrunch, 2022)

Deezer a French music streaming service, offers 73 million tracks and personalized features based on subscription types. Deezer's Flow feature provides optimized recommendations based on the user's mood, listening history, context, and time, allowing users to discover music based on their mood, context, or specific events.

In 2016, Deezer introduced an exclusive feature called Flow, which is an optimized recommendation system based on the user's mood. Flow recommends new or previously listened-to tracks to users based on their listening history, context, and time, allowing users to discover music based on their mood, context, or specific events.

The Challenge

In today's information-rich environment, users often face decision overload, leading to delayed decisions and reduced motivation to stay with a service. To increase competitive advantage and enhance user loyalty, businesses need to develop recommender systems that automatically suggest content based on user preferences. (Maasø & Hagen, 2020; Hansen et al., 2021).

The Approach

The original goal of this Kaggle challenge is to improve Deezer's recommender system by accurately predicting and suggesting tracks that users will listen to for more than 30 seconds. However, in this project, I went beyond just predicting track skips and conducted a comprehensive user preference analysis to gain insights from user age, activities, music preferences, and listening patterns to optimize the user experience of Deezer's recommendation system.

Data Source

Target variable of this dataset is is_listened. There are 7'558'834 observations with 14 preditors.

  • genre_id: ID of the genre of the song
  • media_id: ID of the song listened to by the user
  • album_id: ID of the album of the song
  • media_duration: duration of the song
  • user_gender: gender of the user
  • user_id:  user ID
  • context_type: type of content where the song has listened: playlist, album ...
  • release_date: the release date of the song with the format YYYYMMDD
  • ts_listen: timestamp of the listening in UNIX time
  • platform_name: type of os
  • platform_family: type of device
  • user_age: age of the user
  • listen_type: if the songs listened in a flow or not
  • artist_id: ID of the artist of the song
  • is_listened: 1 refers to a track that has been listened to, 0 otherwise

Methods & Techniques

  • Data preprocessing
  • Data Exploration
  • Feature Engineering and Data Analysis

Preprocessing

During the data exploration phase, we identified three main issues in the training dataset:

  1. There are 17 entries in the "released_date" field with a value of "30000101", which does not conform to a valid time format.
  2. There are 29,779 data entries where the "ts_listen" field is greater than the "released_date" field, which is unexpected.
  3. There are 2 records where the "ts_listen" field is earlier than the founding year of Deezer (in 2006), which is not possible.

Feature engineering

To gain insights into user preferences, behaviors, and listening patterns, several feature engineering techniques were applied. This involved extracting time-related features from the "ts_listen" field, such as year, month, day, weekday, is_weekend, hour, minutes, and seconds. Additionally, season and sessions were derived from the month and hour, categorizing them into four seasons and six time sessions.

User-related features were also created by aggregating user_id, ts_listen, user_age, media_duration, and media_id. This included features such as

  • listen_diff, which represents the duration of time a user listens to music
  • listen_percent, which indicates the percentage of a song that is listened to
  • time_gap, which represents the time gap before the next listening session
  • listen_start, which represents the time when a user starts listening to music
  • listen_end, which represents the time when a user stops listening to music.

These features were engineered to provide a deeper understanding of user behaviors and listening patterns for further analysis and modeling.

More detail can be seen in Deezer data analysis result.

Data Analysis

Feature FLOW

I analyzed the "listen_type" feature, which indicates whether a user listened to music using the FLOW function (listen_type = 1) or not (listen_type = 0). I aggregated attributes such as user_id, user_age, and media_id (songs) to calculate the average number of songs listened per user and the percentage of songs listened across different age groups.

Table 1 presents information on the average length of songs listened to and the percentage of song listening with and without the FLOW function. The results clearly show that users who do not use the FLOW function listen to songs that are three times longer compared to users who use the FLOW function. Specifically, users who do not use the FLOW function listen to nearly 60% of a song, while users who use the FLOW function only listen to less than 20% of a song that is recommended by the system.

ave_lis_perc
Table 1. Average media listening percentage with and wihout FLOW function

User age is added to Table 2 to compare user listening behaviour accros ten age groups

media_perc
Table 2. Media listening duration

listen_type is added to Table 3 to compare users listening behaviour accros ten age groups within and without FLOW function

age_lis_perc_flow
Table 3. Media listening percentage with and wihout FLOW function based on Age group

User behaviour & preference analysis

  • Time: Time is a crucial factor that influences user preferences, which can vary depending on the time of day. In this analysis, I divided the 24-hour period into six sessions: midnight, early morning, morning, afternoon, evening, and night. The session graph below shows that users typically start listening to music in the morning, reach a peak in the afternoon, and then experience a drop in the evening. The hour graph provides further insights into user activity throughout the 24-hour period.
session
User Listen Time In Sesion

hour
User Listen Time in Hour
  • Medium: The features "platform_family" and "platform_name" refer to the devices and operating systems that users use to access the Deezer app. As the data is encoded with numeric values, we are unable to determine the specific devices or operating systems used by users. However, it is observed that "platform_family" 0 and "platform_name" 0 are the most preferred mediums among users.
platform_family
Medium Source - Platform Family

platform_name
Medium Source - Platform name

Genre Analysis

When analyzing content, genre is a significant feature that can vary over time and be influenced by the surrounding scenarios of users. Through our analysis, we identified 6 main genres with genre IDs 0, 7, 10, 25, 27, and 14, which were highly popular among all other attributes such as hour, session, context, platform, listen type, and user age. In other words, regardless of the time of day, user age, or context, these 6 genres were consistently favored by users. The key findings are listed below, and graphical analysis can be seen in deezer_eda_result

album
Top 10 Albums

artist
Top 10 Artists
gerne
Top 10 Genres

Key Findings

Feature Flow

  • The number of songs listened to, the length of songs, and the percentage of songs listened increased gradually with age. Younger users were more likely to skip songs compared to the 30-year-old age group.
  • Users aged 30 were more likely to finish songs recommended by the system.
  • Users aged 30 listened to nearly two times more songs than users aged above 20.
  • The majority of users listening in the flow skipped more songs than users who were not in the flow, except for users aged 19 and 30.

User Behaviour

  • The number of active users dramatically increased between 5am to 6am.
  • The highest number of listeners showed up between 4 to 6pm, with figures above 500,000.
  • The number of users constantly decreased in the evening and dropped to 200,000 at 23 pm.

Gerne Preference

  • Genre_id 0 was the most popular genre among the top 10 ranking
  • Genre_id 0, 7, 10, 25, 27, 14, 734, 297, and 2744 were the most popular genres.
  • Popular genres were beloved across most sessions, except that genre_id 2744 was not popular during night and midnight, genre_id 50 was preferable during the night, and genre_id 3645 in the midnight.
  • The number of users listening to genre 0 was four times more without listening in the FLOW, whereas, there were more variety genres appearing when users were listening in the FLOW.
  • Genre preference was different between user age groups. Among those, genre_id 0 dominated genre preference across all user age groups, while user age 19 was the main audience of this genre

Conclusion

In conclusion, time is a critical factor that influences user preferences in music listening. Genre preferences may vary by user age, platform, and listening environment. To improve the FLOW feature and reduce bounce rate, a context-based recommendation system is suggested, considering personalized features. This can provide relevant content based on user preferences, listening habits, and contextual factors. By incorporating these factors, it can potentially enhance user satisfaction with Deezer's recommendation system.

Project Reflection

It was a great experience to work on a dataset that contains millions of entries. Ideally, it would be good to start data processing in a database due to the simplicity of programming. In addition, performing data queries can help us have a quick glance at the data and gain a better understanding when performing statistical calculations. On the other hand, many categorical attributes in the given dataset are replaced with numeric labels. It would be helpful to have the original labels of each categorical variable, as they can assist analysts in forming problem statements or hypotheses, and provide better interpretation when analyzing data.

Code File

Gitbub: Python Notebook

Reference

  • Adiyansjan, Gunawan, A. A., & Suhartono, D. (2019). Music Recommader Systen Based on Genre using COnvolutional Recurrent Neural Networls. Procedia Computer Science 157, 99-109.
  • Hansen, C., Mehrotra, R., Hansen, C., Brost, B., Maystre, L., & Lalmas, M. (2021). Shigting Consumption towards Diverse Content on Music Streaming Platforms. Proceedings of the 14th ACM International Conference on Web Search and Data MiningMarch, 238-246.
  • Maasø, A., & Hagen, A. N. (2020). Metrics and Decision-Making in music streaming. Popular communication Vol. 18, No. 1, 18-31.

Other Porjects:

Are you also into Data Analytics?

Let's
Connect