Creating a synthetic dataset in pandas: music shops chain example

You need a dataset ! and you could not find it ! , this article may for you..

5 min readMar 26, 2022

Hi all . Today the topic is about creating a your own dataset. I know it sounds weird, but it is cool.

In these days finding a dataset is easy or not, it is depending what you want actually, but hold on one second!!! imagine that you create your own dataset to make exercise. Additionally you can create your own rule based projects and the others... You don’t need to permision, you don’t need to download, you don’t need to waste time finding the right data etc. additionaly your mind is working on creating algoritms.

keys: python, pandas, numpy, dataset, function.

to reach the full code visit my github repository

In this scenario: I am owner of music shops chain, which sell instrument, in different cities of the world. I have customer information in the dataset having these variables;

AGE: customer age.
CITY: city of the music shop is located.
SEX: customer’s gender information.
INSTRUMENT: which instrument was bought.
PRICE: how much money ($) customer paid for the instrument.

The dataset has 5000 samples.

Let’s Start…

1-Main Function

Describe a main function. I give the name of the this function as “create_dataset()”. All other codes locate under this function.

def create_dataset():

2-Create Variables as np.Series except PRICE

AGE variable is between 15–70 age years old.
np.random.randint(min.value , max_value(not include), sample size) gives random int values.

AGE = np.random.randint(15, 71, 5000)

SEX variable is composed of Male and Female.
To create object variables as np.Series use random.choise().
p is probability associated with each entry in the given variable (sex_list).
When you use p, the value variation within the variable causes a different number of distributions.
Total sum of the p value should equal to 1.
Number of p values must be the same with entered list . Example if your list has two items, p also must has two values.

sex_list = ["Male","Female"]
SEX = np.random.choice(sex_list,5000, p=[0.6, 0.4])

INSTRUMENT : “Guitar”, ”Violin”, ”Harmonica”, ”Drum”

inst_list = ["Guitar","Violin","Harmonica","Drum"]
INSTRUMENT =  np.random.choice(inst_list, 5000, 
                               p=[0.3, 0.4, 0.2, 0.1])

CITY :

city_list = ["Izmir","Vancouver","Paris","Tokyo"]
CITY = np.random.choice(city_list,5000, p=[0.3, 0.2, 0.1, 0.4])

3-Create a DataFrame for AGE, SEX, INSTRUMENT, CITY variables

we collect all of variables mentioned above with zip called “list_of_tuples” as list type
turn this list_of_tuples into pandas DataFrame

list_of_tuples = list(zip(AGE, SEX, INSTRUMENT, CITY))

df_0 = pd.DataFrame(list_of_tuples,
                    columns=["AGE", "SEX", "INSTRUMENT", "CITY"])

4-Create PRICE list

I describe the avarage low (price1)and high (price2) prices inside the “price_dict” as dict type for each instrument depending on the instrument order in inst_list. You can put different numbers.
example for guitar min avarage price: 100 $, max avarage price 1800$

price_dict = {"price1" : [100, 150, 50, 300],
              "price2" : [1800, 2000, 800, 2500]}

5-Create a new DataFrame for INSTRUMENT and PRICE

I had to describe new dataframe for PRICE because price ranges change for each instrument separetly. One fuction was created named price_fill which gives a dataframe includes instruments and their PRICE variables.
Using dictionaries is more comfortable than the list.
random.uniform produces float numbers in given ranges.
len(df_0[df_0[“INSTRUMENT”] == “Guitar”] is used to get exact number of the sample to create PRICE

call the function and save as named df_1

df_1 = price_fill(inst_list, price_dict)

6-Merge two dataframes

Merge two dataframes (df_0, df_1) with named as df.
“suffixes” is used to add the labels to the end of the duplicate INSTRUMENT column names. We can understand which one is first which one is last.

df = pd.merge(df_0, df_1, suffixes=("_del",""), left_index=True, right_index=True)

I used sample function to shuffle the rows in the dataframe to get a more realistic view of the dataset.

df = df.sample(frac=1, ignore_index=True)

I have two INSTRUMENT variables after merging and need to delete the first created INSTRUMENT variable, because it was just used for producing of frequency of each instrument. Second INSTRUMENT variable coming from df_1 which has instruments with their price.

df = df.drop("INSTRUMENT_del", axis=1)

7-Save as csv

to use the same numbers for the future, save dataframe to csv format or what you want.

df.to_csv ('export_dataframe.csv', index=False)

Overview of df

**fig 2** : Type and column information of df

**fig 3**: Unique values of the variables

**fig 4**: Overview of numeric variables

**fig 5**: Value counts of the object type variables

You can change the name of the variables and values described above, if you improve the codes, please let me know, i will be happy. Next time I will create a persona project with this imaginary dataset ;).

I was inspired by the Veri Bilim Okulu lessens while creating this dataset and it variables.

Thank you for your patient and reading…

I would like to thank Mustafa Vahit Keskin, Mehmet Akturk, Ozan Güner, Mehmet Tuzcu, Arif Eker who encouraged me to write this article and did not spare her teachings, and my DSMLBC8-Group5 friends from Veri Bilim Okulu.

Contact:
Hakan SARITAŞ
linkedin : www.linkedin.com/in/hakansaritas
GitHub: hakansaritas (HAKAN SARITAŞ) · GitHub
kaggle: Hakan Saritas

References:

pandas

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of…

pandas.pydata.org

NumPy

Why NumPy? Powerful n-dimensional arrays. Numerical computing tools. Interoperable. Performant. Open source.

numpy.org