“Rock vs. Pop” : ‘Independent Two-Sample Hypothesis Testing’ based on letters

8 min readApr 21, 2022

Do you wanna see who is the winner of the this Challenge ? Go down ↓

Türkçe

In this article, I carrried out a statistic analysis of difference in the between two music genres based on the frequency data of each letter in the songs of the artists of “the best Rock” and “the best “Pop” genres getting from ranker.com. For this, I applied the “Independent Two-Sample Hypothesis Testing” by establishing the H0 (null), and H1 hypotheses. H0 hypothesis; I based it on the fact that there is no significant difference between the frequency averages of the relevant letter used in the lyrics of these two groups. I used letters in which English alphabet for analysis.

To give a brief information about the independent two-sample test; It can be used when it is desired to carry out a statistical comparison of the difference between the means obtained from between two independent groups, without leaving it to chance. Two different test techniques, parametric or non-parametric, are applied to the data, depending on whether they provide the assumptions or not. The result will give us the opportunity to evaluate whether there is a statistical difference between the two groups regarding the population that we do not know the value of.

The results obtained from this study were as follows:

The normality assumption between the two groups was found as p value = 0.000 for each letter. This means: H0 is rejected, the assumption of normality is not satisfied.
In the control of variance of homogeneity, the p value was found to be greater than 0.05. This means: H0 could not be rejected and the assumption was met.
Although the homogeneity variance was provided, the non-parametric “mannwhitneyu” test was applied because the assumption of normality could not be met. As an example, the parametric “ttest” was also applied, supposing that all the assumptions were met.
According to the mannwhitneyu test, the H0 hypothesis was observed as not rejectable for all letters except for the letters U and V in the lyrics of the Rock and Pop music genres. This means; there is no statistically significant difference in the frequency averages of these letters within the 95% confidence rate.
Since the p value results for U and V are below 0.05, the H0 hypothesis is rejected. There is a statistically significant difference in the mean frequency of these letters at the 95% confidence rate.

Key words: Independent two-sample hypothesis, QQ plot, ttest, mannwhitneyu, pandas, H0 Hypothesis

You can visit my github page for all datasets, functions and codes.

A- PREPARATION OF THE DATASET

1- To create the database of music genres, the “letter_of_songs” function needs to be run. For this, it is necessary to get the GENIUS API key and enter the api_key parameter. You can check my “Long Live Rockn Roll” post to find out how.

2- I created two lists with the names of 100 Pop and 100 Rock artists and used the names I gave to these lists in the function. Some of the names on these lists did not receive a response from Genius.com, so the datasets were not able to bring in exactly 100 singers for each genre, I would like to say that there are a few missing singers. You can find the latest version of the lists on my github page.

pop_list = [‘Tina Turner’, ‘Frank Sinatra’, ‘Queen’, ‘Elton John’, ‘Stevie Wonder’, ‘Bee Gees’, ‘David Bowie’, ‘ABBA’, ‘Cyndi Lauper’, ‘The Beach Boys’, ‘Michael Jackson’, ‘John Lennon’,……]

rock_list = [‘Led Zeppelin’, ‘Queen’, ‘The Beatles’, ‘Pink Floyd’, ‘The Rolling Stones’, ‘Jimi Hendrix’, ‘AC/DC’, ‘The Who’, ‘Guns N’ Roses’, ,‘Elvis Presley’, ‘Van Halen’, ‘The Doors’,….]

3- I ran the “letter_of_songs” function separately for both pop and rock genres to create datasets. The dataset variables consist of “artist name”, “artist ID”, “song name”, “song ID”, “lyric” and all letters from “A” to “Z”. I bought 5 songs of each artist.

running function for creating datasets

import libraries and read ouput excel files.

4- I examined the data sets, checked for outliers and missing values. Some bands have less than 5 songs. Interestingly, “The Beatles” group information comes instead of “Heart” group in the Rock dataset. I cleared this from data. There are not nor missing values or outliers.

5-I added genre information (pop,rock) to the datasets under the “genre” variable. I then combined these datasets.

6- I changed the ID variables to “object” type so that they do not interfere with numeric variables. I created a new variable called “sum”. This variable gives the total number of letters used in each song.

groupby of “genre” ans sum of the numeric variables

7- Some artists in both datasets are the same. So ranker.com has added them to their list as both pop and rock artists. This is not a problem for testing.

B- APPLICATION OF THE HYPOTHESIS TEST

1: Set up the Hypotheses
- H0: μ1 = μ2 (There is no significant difference in the letter frequency average in pop and rock lyrics)
- H1: μ1 ≠ μ2 (significant difference)

2: Assumption Check
* 2.1 Normality Assumption Check (shapiro)

Hypotheses
- H0: Has a normal distribution
- H1: Does not have a normal distribution

applying shapiro technique for all letters

example of p value output up to letter of E

Comment: The p value for all letters is 0.000. Shown up to the letter E in the output example. Since this value is less than 0.05 (alpha), we reject H0. That is, the assumption of normality cannot be met. QQ plot method is used to examine this on the graph.

Assessment of normality assumption with QQ Plot

Applying example of qq plot for letter of E

Output of QQ plot. There is no normallity

In order to ensure the normality distribution, it is expected that the blue dots should be located along the theoretical line showing the red colored normality.

In the QQ plot output, the vertical axis represents the sample distribution and the horizontal axis represents the theoretical distribution.
The distribution of the frequency distribution of the letter E, which we took as an example, in such a way that it is not parallel along the red line, shows that the letter E does not have a normal distribution.
Visualization for all letters produced results similar to the graph above.
When the graph is examined, it is observed that the normal distribution is not met.
It is concluded that the theoretical and sample distributions are not similar to each other.

* 2.2 Homogeneity of Variance (levene)
- Variance Homogeneity of variance evaluates the similarity of the distribution of the variables to each other.
Hypotheses
- H0: variances are homogeneous
- H1: variances are not homogeneous

applying levene technique for all letters

Comment: H0 cannot be rejected because the p value for each letter is greater than 0.05. This means that the assumption of homogeneity of variance is met. That is, the distributions of variables are similar to each other. But since the assumption of normality is not provided, we need to perform the mannwhitneyu test applied to non-parametric data sets instead of the ttest.

3: Application of the Hypothesis

3.1 Independent two-sample test (parametric test) if assumptions provided (stats.ttest_ind)- ttest Experimental only!

Appliying of ttest for letter of E and p value result

3.2 mannwhitneyu test (non-parametric test) if assumptions are not provided (stats.mannwhitneyu)

Appliying of mannwhitneyu for letter of E and p value result

Letters with a p value less than 0.05.

Comment: p value was greater than alpha value in all letters except 2 letters only. This means that in these letters H0 cannot be rejected. While interpreting that there is no statistically significant difference in the 95% confidence rate between the frequency averages of these letters in pop and rock music genres, H0 is rejected for the letters U and V and there is a significant difference.

D- EVALUATION OF DATA SETS WITHOUT TEST

I created a function called “letter_perc” to see the comparison of each letter in Pop and Rock lyrics both as percentage and average in tabular form, and as a graphical comparison. According to the output results of this function;

Mean and percentage values are very close to each other.
When I sorted the Pop and Rock music genres by percentage, the ranking is pretty much the same.

df_perc with mean and percentage variables of the letters

Comparision of the Pop and Rock percentages based on each unique letter

E-RESULTS

In both the statistical test results and the observations made without using the test, there is no statistically significant difference in the frequency average of each letter (except the letter U and V) in approximately 500 song samples taken from Pop and Rock music genres, at the 95% confidence rate.
It may deduced the following interpretation from here. There is a similarity between the words used in the lyrics of the Pop and Rock music genres and the corresponding letter frequencies.

Acknowledgment:

I would like to thank Mustafa Vahit Keskin, Mehmet Akturk, Ozan Güner, Mehmet Tuzcu, Arif Eker who encouraged me to write this article and did not spare her teachings, and my DSMLBC8 group5 friends who are from Veri Bilim Okulu and Miuul.t

References:

lyricsgenius

lyricsgenius provides a simple interface to the song, artist, and lyrics data stored on Genius.com. The full…

pypi.org

Text in Matplotlib Plots — Matplotlib 3.5.1 documentation

Introduction to plotting and working with text in Matplotlib. Matplotlib has extensive text support, including support…

matplotlib.org

Genius | Song Lyrics & Knowledge

Genius is the world’s biggest collection of song lyrics and musical knowledge.

genius.com

Welcome to Python.org

The official home of the Python Programming Language

www.python.org

pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of…

pandas.pydata.org

Contact:

Hakan SARITAŞ

linkedin : www.linkedin.com/in/hakansaritas

GitHub: hakansaritas (HAKAN SARITAŞ) · GitHub

kaggle: Hakan Saritas