In this project, I conducted a user analysis to gain insights into user preferences and improve Deezer's music recommender feature. Understanding user behavior is crucial in the competitive music streaming industry to enhance user experience and increase customer engagement. The findings aimed to provide valuable insights to modify it Flow Feature for Deezer to stay ahead in the industry.
523.9 million people globally subscribed to a music streaming service (TechCrunch, 2022)
Deezer a French music streaming service, offers 73 million tracks and personalized features based on subscription types. Deezer's Flow feature provides optimized recommendations based on the user's mood, listening history, context, and time, allowing users to discover music based on their mood, context, or specific events.
In 2016, Deezer introduced an exclusive feature called Flow, which is an optimized recommendation system based on the user's mood. Flow recommends new or previously listened-to tracks to users based on their listening history, context, and time, allowing users to discover music based on their mood, context, or specific events.
In today's information-rich environment, users often face decision overload, leading to delayed decisions and reduced motivation to stay with a service. To increase competitive advantage and enhance user loyalty, businesses need to develop recommender systems that automatically suggest content based on user preferences. (Maasø & Hagen, 2020; Hansen et al., 2021).
The original goal of this Kaggle challenge is to improve Deezer's recommender system by accurately predicting and suggesting tracks that users will listen to for more than 30 seconds. However, in this project, I went beyond just predicting track skips and conducted a comprehensive user preference analysis to gain insights from user age, activities, music preferences, and listening patterns to optimize the user experience of Deezer's recommendation system.
Target variable of this dataset is is_listened. There are 7'558'834 observations with 14 preditors.
During the data exploration phase, we identified three main issues in the training dataset:
To gain insights into user preferences, behaviors, and listening patterns, several feature engineering techniques were applied. This involved extracting time-related features from the "ts_listen" field, such as year, month, day, weekday, is_weekend, hour, minutes, and seconds. Additionally, season and sessions were derived from the month and hour, categorizing them into four seasons and six time sessions.
User-related features were also created by aggregating user_id, ts_listen, user_age, media_duration, and media_id. This included features such as
These features were engineered to provide a deeper understanding of user behaviors and listening patterns for further analysis and modeling.
More detail can be seen in Deezer data analysis result.
I analyzed the "listen_type" feature, which indicates whether a user listened to music using the FLOW function (listen_type = 1) or not (listen_type = 0). I aggregated attributes such as user_id, user_age, and media_id (songs) to calculate the average number of songs listened per user and the percentage of songs listened across different age groups.
Table 1 presents information on the average length of songs listened to and the percentage of song listening with and without the FLOW function. The results clearly show that users who do not use the FLOW function listen to songs that are three times longer compared to users who use the FLOW function. Specifically, users who do not use the FLOW function listen to nearly 60% of a song, while users who use the FLOW function only listen to less than 20% of a song that is recommended by the system.
User age is added to Table 2 to compare user listening behaviour accros ten age groups
listen_type is added to Table 3 to compare users listening behaviour accros ten age groups within and without FLOW function
When analyzing content, genre is a significant feature that can vary over time and be influenced by the surrounding scenarios of users. Through our analysis, we identified 6 main genres with genre IDs 0, 7, 10, 25, 27, and 14, which were highly popular among all other attributes such as hour, session, context, platform, listen type, and user age. In other words, regardless of the time of day, user age, or context, these 6 genres were consistently favored by users. The key findings are listed below, and graphical analysis can be seen in deezer_eda_result
In conclusion, time is a critical factor that influences user preferences in music listening. Genre preferences may vary by user age, platform, and listening environment. To improve the FLOW feature and reduce bounce rate, a context-based recommendation system is suggested, considering personalized features. This can provide relevant content based on user preferences, listening habits, and contextual factors. By incorporating these factors, it can potentially enhance user satisfaction with Deezer's recommendation system.
It was a great experience to work on a dataset that contains millions of entries. Ideally, it would be good to start data processing in a database due to the simplicity of programming. In addition, performing data queries can help us have a quick glance at the data and gain a better understanding when performing statistical calculations. On the other hand, many categorical attributes in the given dataset are replaced with numeric labels. It would be helpful to have the original labels of each categorical variable, as they can assist analysts in forming problem statements or hypotheses, and provide better interpretation when analyzing data.