TP1 - Some Python exercices to get back into the swing of things


Part 1. Basic Python (extra libs are prohibited)


Exercise 1.

Write a python function that:

  1. Reads the file users.csv line per line
  2. Prints the sentence <name> has <age> years old. for each line

Here is a demo about how to read a file line per line

with open("path to the file", "r") as finput: # "r" means read mode and finput is a variable (its name is free)
    for line in f :
        print(l)

To split a string s according to a separator sep you should use the split function (s.split(sep)). This function returns a list.

Exercise 2.

Write a python function that:

  1. Loads the user information from users.csv into a dictionary dUsers
  2. Returns dUsers

dUsers must follow the following format:

dUsers = {
    id: {
        "name" : name,
        "age" : age,
        "sex" : sex,
        "interests" : ["interest1", "interest2"]
    }
}

Exercise 3.

Using the data structure you just have created, write two python functions that:

  1. Returns the name of the oldest user
  2. Returns then number of users who like python

Exercise 4.

We now consider the links between users and thus manipulate the links.csv file. Create a python function that returns an adjacency list dRel with the following format.

dRel = {
   id : [list of connected users]
}

Exercise 5.

Write two Python functions that:

  1. Return the name of the user who has the biggest number of friends
  2. Return the name of the user who has the biggest number of friends of the opposite sex

Exercise 6 (optional)

The purpose of this last exercise it to write a basic recommender system that implements the following principle: for each user, the system should recommend the majority interest of his/her friend. Obviously, if this majority interest is shared by the user, the second most majority is recommended.

Part 2. An introduction to Python for data scientist


There are 5 major steps in any data science / machine learning project :

  • Data exploration
  • Data formatting
  • Model validation
  • Prediction
  • Result submission

A brief introduction about how these steps can be handled in Python (>= 3.6) is given below.

Essential libraries


Pandas

Pandas is a library written for the Python programming language that allows data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical vectors and time series.

  • The DataFrame object to manipulate data easily and efficiently with indexes that can be strings;
  • Tools to read and write structured data in memory from and to different formats: CSV files, text files, Microsoft Excel spreadsheet file, SQL database...;
  • intelligent data alignment and missing data management (NaN = not a number). label-based data alignment (character strings). sorting according to various totally disordered data criteria;
  • Resizing and pivot table;
  • Merging and joining of large volumes of data;
  • Time series analysis.

Documentation link: https://pandas.pydata.org/pandas-docs/stable/


Numpy

NumPy is an extension of the Python programming language, designed to manipulate multidimensional matrices or tables as well as mathematical functions operating on these tables. It offers much more efficient types and operations than the standard lib, and has shortcuts for mass processing.

Documentation link: https://docs.scipy.org/doc/


Matplotlib

Matplotlib is a library of the Python programming language designed to plot and visualize data in graphical form. It can be combined with the NumPy and SciPy python scientific computation libraries.

Documentation link: https://matplotlib.org/contents.html


Scikit-learn

Scikit-learn is a free Python library dedicated to automatic learning. It is developed by many contributors, particularly in the academic world, by French institutes of higher education and research such as Inria and Télécom ParisTech. It includes functions for estimating random forests, logistic regressions, classification algorithms, and support vector machines. It is designed to harmonize with other free Python libraries, including NumPy and SciPy.

Documentation link: http://scikit-learn.org/stable/


Let's start coding!

Headers

Here are the first lines of almost all data scientist Python scripts. It aims at importing the libraries you will use in the following.

In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns # Seaborn is a Python data visualization library based on matplotlib
import numpy as np
%matplotlib inline 

Import the data

In machine learning competitions, two files are usually given. A training file that is used to learn the machine learning algorithm and a test file that is used to measure the performance of the algorithm.

Instructions: read the pandas documentation and find how to read the two csv files. Then, print the first ten lines of the train data frame using the head function.

Data exploration

Let's talk about the context, we have to predict house prices. As you should know, it is a problem of SUPERVISED machine learning, because a target variable (SalePrice) has to be predicted. As we have to predict a value it is a regression problem so you will use regression algorithms.

Instructions:

  • Print the column names of the training data frame using the columns primitive;
  • Print the number of lines and columns of the training and test data frames using the shape primitive

Analysis of the target variable

Instructions:

  • Apply the describe function on the SalePrice column
  • Call the seaborn distplot function on the SalePrice column

Relationship between numerical features and the target variable

The piece of code below shows how to plot a scatter plot of the two numerical variables GrLivArea and SalePrice (the target variable).

Instructions. Modify this piece of code to display the relationship between every numerical features and the target variable (you should use a loop).

Hint. To determine wheter a variable (column of the data frame) is numerical, you can have a look to the following stack overflow post.

# scatter plot grlivarea/saleprice
var = 'GrLivArea'

# A new data frame is created with only the desired columns (the two we would like to display)
price_surface = pd.concat([train['SalePrice'], train[var]], axis=1) 
price_surface.plot.scatter(x=var, y='SalePrice', ylim=(0,800000))
plt.ylabel("Prix")
plt.xlabel("Surface habitable")
plt.show()

Relationship between categorical features and the target variable

The piece of code below shows how to plot a boxplot of the categorical variables SaleCondition w.r.t. the target variable.

Instructions. Modify this piece of code to display the relationship between every categorical features and the target variable (you should use a loop).

Hint. To determine wheter a variable (column of the data frame) is categorical, you can have a look to the following stack overflow post.

In [ ]:
var = 'SaleCondition'
pair = pd.concat([train['SalePrice'], train[var]], axis=1)
f, ax = plt.subplots(figsize=(16, 8))
fig = sns.boxplot(x=var, y="SalePrice", data=pair)
fig.axis(ymin=0, ymax=800000);
plt.xticks(rotation=90);

Calculate correlations between variables

The best way to get a complete view of your dataset fairly quickly is to make a heatmap representing the correlations between variables. The code below shows how to do that very quickly. Have a look to the documentation to determine which method has been used as default to calculate the correlations.

In [ ]:
#correlation matrix
corrmat = train.corr()

f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmat, vmax=1, vmin=-1, square=True);

We now focus on the 10 features that are the most correlated with the target feature.

In [ ]:
k = 10 #Number of features to consider

# We keep only the k most (negatively or positively) correlated features
cols = abs(corrmat).nlargest(k, 'SalePrice')['SalePrice'].index

cm = np.corrcoef(train[cols].values.T)

sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

Data preparation

Most machine learning algorithms do not deal with missing data (NaN). One of the first challenges to adresse is to manage these missing values by replacing them with estimates.

We first check the ratio of missing values per feature.

In [ ]:
#missing data
# the isnull method outputs a matrix of the same format as the train and for each element of this matrix
# sends a booleen: True if the value is a missing value (NaN), False if not
# Then we add the number of null values
total = train.isnull().sum().sort_values(ascending=False)
percent = (train.isnull().sum()/train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

We can see that the first 5 variables contain too many missing values, it is better not to use them.

The train and the test are merged in order to do the same formatting for the training and test game. This process is very classic.

In [ ]:
data = pd.concat([train, test],axis = 'rows', sort=False) # merge the two datasets
data.reset_index(drop= True)
data.head()

Instructions. Remove features:

  • With little correlation to the target (SalePrice) (between -0.4 and 0.4)
  • With too many missing values (40%)

Pandas method to succeed in the task: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

Replace NaN values

Now you have to replace missing values in order to make sense of them.

Pandas method to succeed in the task: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html

For example, you can replace missing values with the most frequent value, or the mean, median...

Instructions. Replace the NaN values of the other variables.

In [ ]:
# Replace NaN values in LotFrontage with the mean
data['LotFrontage'] = data['LotFrontage'].fillna(data['LotFrontage'].mean())

# Replace NaN values in Alley with the mean
data['Alley'] = data['Alley'].fillna('NOACCESS')

Converting categorical features into numerical features

Very few machine learning algorithms do not take the categorical variables as inputs, they need numerical values. It is thus necessary to convert them into numerical features.

To remedy this there exists several methods:

  • Label encoding** (for example, replace the values[right, left, walkers] with[0, 1, 2])
  • One Hot encoding (for example, replace the values[right, left, walkers] with 3 binary variables)
  • One advanced method: the target encoding** (work it out)

Resources :

Instructions. Apply the one hot encoding on all categorical features.

Data normalization

Now the DataFrame is ready, it is a good habit to normalize the data if you use algorithms of machine learning such as SVM or KNN.

Instructions. Apply the MinMaxScaler to normalize the data.

Resources. This Stack Overflow entry should be of interest: https://stackoverflow.com/questions/26414913/normalize-columns-of-pandas-data-frame

Split the data into train and test.

In [ ]:
is_test = data['SalePrice'].isnull()  # Masque afin de séparer la base d'entrainement et de test
# car dans le test nous ne connaissons pas la valeur de la variable cible donc ils ont comme valeur NaN
train = data[~is_test]  # la tilde est la négation
test = data[is_test].drop('SalePrice', axis = 'columns')

Do some sanity check

Always check your code before training

assert returns an error if the condition is wrong

In [ ]:
assert len(train) == 1460 # Check the size of the training set
assert len(test) == 1459 # Check the size of the test set
assert train.isnull().sum().sum() == 0 # Check if there still exists NaN values
assert test.isnull().sum().sum() == 0 # Check if there still exists NaN values

Model validation

In [ ]:
# X are the training data and Y the prices to predict
X_train = train.drop(['SalePrice','Id'], axis = 'columns')
Y_train = train['SalePrice']

The problem is measured using the RMSE, which is the average square deviation between the predicted value and the true value. $$\sqrt{\frac{1}{n} \sum^n_{i=1}(\overline{y_i} - y_i)^2}$$ The goal is to minimize this evaluation metric.

After the data formatting, the evaluation of the model is the most important. It is necessary to evaluate your model.

Validation of the model provides us with information on its performance, if new additions or modifications to the data have enable the model to better predict. Also it informs us if there is overfit (the worst enemy in machine learning)

In [ ]:
  def rmse(predictions,targets):
    """Implementation of RMSE
    
    Arguments:
      predictions {np array} -- Predicted value
      targets {np array} -- True value
    
    Returns:
      float -- RMSE score
    """
    return np.sqrt(np.mean((predictions-targets)**2))

The validation method we will use is cross-validation.

Cross-validation is, in machine learning, a method of estimating the reliability of a model based on a sampling technique.

Suppose you have a statistical model with one or more unknown parameters, and a set of learning data on which you can train the model. The learning process optimizes the model parameters to match the data as closely as possible. If an independent validation sample is then taken from the same training population, it will generally turn out that the model does not respond as well to validation as it did during training: sometimes it is called overlearning. Cross-validation is a way to predict the effectiveness of a model on a hypothetical validation set when an independent and explicit validation set is not available.

k-fold cross-validation: the original sample is divided into k samples, then one of k samples is selected as the validation set and the other k-1 samples will constitute the learning set. The performance score is calculated as in the first method, then the operation is repeated by selecting another validation sample from among the k-1 samples that have not yet been used for model validation. The operation is repeated k times so that in the end each sub-sample was used exactly once as a validation set. The mean of the k root mean square errors is finally calculated to estimate the prediction error.

Déclaration of the model

In [ ]:
from sklearn.linear_model import LinearRegression
model = LinearRegression() # Try to use some others!!

Cross-validation

In [ ]:
from sklearn.model_selection import KFold


# Split the dataset into 5 folds using a predefined seed (for reproducibility purpose)
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
CV = KFold(n_splits = 5,random_state = 42) 


# List to save the model values for each fold
fit_score = [] 
val_score = []

verbose = False # Set it to True if you want aditionnal infos to be displayed

# enumerate is a predifined keyword in Python: https://docs.python.org/3/library/functions.html#enumerate
for i, (fit_index,val_index) in enumerate(CV.split(X_train,Y_train)):
    
    X_fit = X_train.iloc[fit_index]
    Y_fit = Y_train.iloc[fit_index]
    X_val = X_train.iloc[val_index]
    Y_val = Y_train.iloc[val_index]
    
    
    model.fit(X_fit,Y_fit)
    
    pred_fit = model.predict(X_fit)
    pred_val = model.predict(X_val)
    
    if verbose :
        print(f'Rmse fit for fold {i+1} : {rmse(pred_fit,Y_fit):.3f}')
        print(f'Rmse val for fold {i+1} : {rmse(pred_val,Y_val):.3f}')
    
    fit_score.append(rmse(pred_fit,Y_fit))
    val_score.append(rmse(pred_val,Y_val))

fit_score = np.array(fit_score)
val_score = np.array(val_score)

print(f'RMSE score for fit :{np.mean(fit_score):.3f} ± {np.std(fit_score):.3f}')
print(f'RMSE score for val :{np.mean(val_score):.3f} ± {np.std(val_score):.3f}')

Instructions. Some areas for improvment:

  • Try other models on scikit learn to start with such as random forest and svm
  • Try to train a model with lightgbm regressor with early stopping (out of the scope of this course)
  • Try to make out-of-fold bagging that will improve your final prediction (out of the scope of this course)

Prediction and generation of the output fimle

At this point, you trained k models and have an idea on how effective is your solution (the features used, the algorithms and its parameters). We are now training our model on all train data because previously, we only used $\frac{4}{5}$ of our data in cross-validation.

In [ ]:
model.fit(X_train,Y_train)
pred = model.predict(test.drop(['Id'],axis = 'columns'))

If you want to participate to a machine learning competition (e.g., Kaggle), you need to submit to prediction and thus to first write it in a file. You will find below some piece of code to achieve this goal.

In [ ]:
submission = pd.DataFrame()
submission['Id'] = np.array(test['Id'])
submission['SalePrice'] = pred
submission.head()
In [ ]:
filename = f'submission_{np.mean(val_score):.3f}_{np.std(val_score):.3f}'
submission.to_csv(f'submission/{filename}',index =False)