Creating a synthetic dataset in pandas: music shops chain example

You need a dataset ! and you could not find it ! , this article may for you..

Hakan Sarıtaş
5 min readMar 26, 2022
photo link

Hi all . Today the topic is about creating a your own dataset. I know it sounds weird, but it is cool.

In these days finding a dataset is easy or not, it is depending what you want actually, but hold on one second!!! imagine that you create your own dataset to make exercise. Additionally you can create your own rule based projects and the others... You don’t need to permision, you don’t need to download, you don’t need to waste time finding the right data etc. additionaly your mind is working on creating algoritms.

keys: python, pandas, numpy, dataset, function.

In this scenario: I am owner of music shops chain, which sell instrument, in different cities of the world. I have customer information in the dataset having these variables;

AGE: customer age.
CITY: city of the music shop is located.
SEX: customer’s gender information.
INSTRUMENT: which instrument was bought.
PRICE: how much money ($) customer paid for the instrument.

The dataset has 5000 samples.

Let’s Start…

1-Main Function

Describe a main function. I give the name of the this function as “create_dataset()”. All other codes locate under this function.

def create_dataset():

2-Create Variables as np.Series except PRICE

  • AGE variable is between 15–70 age years old.
  • np.random.randint(min.value , max_value(not include), sample size) gives random int values.
AGE = np.random.randint(15, 71, 5000)
  • SEX variable is composed of Male and Female.
  • To create object variables as np.Series use random.choise().
  • p is probability associated with each entry in the given variable (sex_list).
  • When you use p, the value variation within the variable causes a different number of distributions.
  • Total sum of the p value should equal to 1.
  • Number of p values must be the same with entered list . Example if your list has two items, p also must has two values.
sex_list = ["Male","Female"]
SEX = np.random.choice(sex_list,5000, p=[0.6, 0.4])
  • INSTRUMENT : “Guitar”, ”Violin”, ”Harmonica”, ”Drum”
inst_list = ["Guitar","Violin","Harmonica","Drum"]
INSTRUMENT = np.random.choice(inst_list, 5000,
p=[0.3, 0.4, 0.2, 0.1])
  • CITY :
city_list = ["Izmir","Vancouver","Paris","Tokyo"]
CITY = np.random.choice(city_list,5000, p=[0.3, 0.2, 0.1, 0.4])

3-Create a DataFrame for AGE, SEX, INSTRUMENT, CITY variables

  • we collect all of variables mentioned above with zip called “list_of_tuples” as list type
  • turn this list_of_tuples into pandas DataFrame
list_of_tuples = list(zip(AGE, SEX, INSTRUMENT, CITY))

df_0 = pd.DataFrame(list_of_tuples,
columns=["AGE", "SEX", "INSTRUMENT", "CITY"])

4-Create PRICE list

  • I describe the avarage low (price1)and high (price2) prices inside the “price_dict” as dict type for each instrument depending on the instrument order in inst_list. You can put different numbers.
  • example for guitar min avarage price: 100 $, max avarage price 1800$
price_dict = {"price1" : [100, 150, 50, 300],
"price2" : [1800, 2000, 800, 2500]}

5-Create a new DataFrame for INSTRUMENT and PRICE

  • I had to describe new dataframe for PRICE because price ranges change for each instrument separetly. One fuction was created named price_fill which gives a dataframe includes instruments and their PRICE variables.
  • Using dictionaries is more comfortable than the list.
  • random.uniform produces float numbers in given ranges.
  • len(df_0[df_0[“INSTRUMENT”] == “Guitar”] is used to get exact number of the sample to create PRICE
  • call the function and save as named df_1
df_1 = price_fill(inst_list, price_dict)

6-Merge two dataframes

  • Merge two dataframes (df_0, df_1) with named as df.
  • suffixes” is used to add the labels to the end of the duplicate INSTRUMENT column names. We can understand which one is first which one is last.
df = pd.merge(df_0, df_1, suffixes=("_del",""), left_index=True, right_index=True)
  • I used sample function to shuffle the rows in the dataframe to get a more realistic view of the dataset.
df = df.sample(frac=1, ignore_index=True)
  • I have two INSTRUMENT variables after merging and need to delete the first created INSTRUMENT variable, because it was just used for producing of frequency of each instrument. Second INSTRUMENT variable coming from df_1 which has instruments with their price.
df = df.drop("INSTRUMENT_del", axis=1)

7-Save as csv

  • to use the same numbers for the future, save dataframe to csv format or what you want.
df.to_csv ('export_dataframe.csv', index=False)

Overview of df

fig 1: Overview of df
fig 2 : Type and column information of df
fig 3: Unique values of the variables
fig 4: Overview of numeric variables
fig 5: Value counts of the object type variables

You can change the name of the variables and values described above, if you improve the codes, please let me know, i will be happy. Next time I will create a persona project with this imaginary dataset ;).

I was inspired by the Veri Bilim Okulu lessens while creating this dataset and it variables.

--

--

Hakan Sarıtaş

Data Scentist | Data Analyist | Data Visualization | Data interpretation | Data Processing | NLP | Machine Learning | Marine Geophysicicst |Vocal in Les Devins