Exploratory Data Analysis (EDA), Feature Selection, and machine learning prediction on time series data.

oluyede Segun (jr)
Analytics Vidhya
Published in
12 min readOct 13, 2021

--

INTRODUCTION

Dataset description: Swedish crime statistics from 1950 to 2015.

Attribute information of dataset: crimes.total: total number of reported crimes. crimes.penal.code: total number of reported crimes against the criminal code. crimes.person: total number of reported crimes against a person. murder: total number of reported murder. sexual.offences: total number of reported sexual offences. rape: total number of reported rapes. assault: total number of reported aggravated assaults. stealing.general: total number of reported crimes involving stealing or robbery. robbery: total number of reported armed robberies. burglary: total number of reported armed burglaries. vehicle.theft: total number of reported vehicle thefts. house.theft: total number of reported theft inside a house. shop.theft: total number of reported theft inside a shop. out.of.vehicle.theft: total number of reported theft from a vehicle. criminal.damage: total number of reported criminal damages. other.penal.crimes: number of other penal crime offenses. fraud: total number of reported frauds. narcotics: total number of reported narcotics abuses. drunk.driving: total number of reported drunk driving incidents. Year: the year. population: the total estimated population of Sweden at the time

Download Dataset: https://www.kaggle.com/mguzmann/swedishcrime

GOAL : In this project we will train a machine learning model to predict muder rate in sweden using the sweden crime rate dataset , also perform exploratory data analysis (EDA) and feature selection by accomplishing the following on the sweden crime dataset:

1. Load and view dataset

2. Data visualization

3.Data preprocessing(data encoding , handling missing values, handling outliers (detection, removal and replacements) and normalization)

4. feature selection with filter , embedded and wrapper methods.

5. Compare training without feature selection and with feature selection (Filter method(chi-square), wrapper method (RFE), and embedded method(Lasso))

6. Time series or regression algorithms comparison (Naïve Bayes, k-nearest neighbor, Support vector machines, Convolutional neural network and Recurrent Neural Network(RNN)(LSTM)

7. Save trained model

1. load and view dataset

View first five rows and the shape of the dataset , also check for missing data. finally check dataset data types

View first five rows and the shape of the dataset , also check for missing data.
check dataset data types

2. Data visualization

We can plot some visualizations using line chart, density plot, scatter plot, bar charts and histogram.

line chart showing crime and burglary per year
density chart showing assault per year
scatterplot showing fraud per year
bar charts showing house theft per year
histogram showing rape per year

3.Data preprocessing(data encoding , handling missing values, handling outliers (detection, removal and replacements) and normalization)

3.1 handling missing values:

Replace the missing data with the mean of the columns

Replace the missing data with the mean of the columns

3.2 Encoding data

Since there are not object columns in the dataset there is no need to encode the columns

check columns datatypes

3.3 Outlier detection

3.3.1 visualization method

the visualization method can be done with boxplots and distribution plot and other plots, boxplots and distribution plot are used below.

boxplot outlier detection for rape and crimes columns.
Distribution plot outlier detection
distribution plot outlier detection for crimes and rape columns

3.3.2 Z-score detection

Another method for outlier detection is the Z-score:

Z score is also known as standard score. Z score tells how many standard deviations away a data point is from the mean, It helps to understand if a data value is greater or smaller than mean and how far away it is from the mean. More specifically.

We calculate the Z-Scores for each column. And set a threshold, which indicates that the data point is quite different from the other data points

3.3.3 Interquartile range

IQR is used to measure variability by dividing a data set into quartiles. Q1, Q2, Q3 called first, second and third quartiles are the values for splitting the dataset.

  • Q1 represents the 25th percentile of the data.
  • Q2 represents the 50th percentile of the data.
  • Q3 represents the 75th percentile of the data.

IQR is the range between the first and the third quartiles namely Q1 and Q3: IQR = Q3 — Q1. The data points which fall below Q1–1.5 IQR or above Q3 + 1.5 IQR are outliers.

Interquartile range on crimes dataset shows there are outliers

3.3.4 Removing outliers

In the previous section, we saw how one can detect the outlier using Z-score, and inter quartile range , but now we want to remove or filter the outliers and get the clean data.

We now a new shape of (47,21) after removal with Z-score.

We now a new shape of (51,21) after removal with inter-quartile range.

We can also replace outlier with median value of columns as shown below.

Outlier removal and replacement

3.4 Normalization with minmax scalar

Min-max normalization is one of the most common ways to normalize data. For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.

Min max normalization on the outliers replaced with median values on the crimes dataset

4. feature selection with filter , embedded and wrapper methods.

Feature selection is also known as attribute selection is a process of extracting the most relevant features from the dataset and then applying machine learning algorithms for the better performance of the model. Feature selection usually can lead to better learning performance, higher learning accuracy, lower computational cost, and better model interpretability.

4.1 Filter method

In the Filter method, features are selected based on statistical measures. It is independent of the learning algorithm and requires less computational time. Examples are Information gain, chi-square test, Fisher score, correlation coefficient, and variance threshold or ANOVA.

4.1.1 chi-square
Calculate Chi-square between each feature and the target and select the desired number of features with best Chi-square scores. It determines if the association between two categorical variables of the sample would reflect their real association in the population.

chi-square feature selection

4.1.2 Pearson correlation

Correlation is a measure of the linear relationship of 2 or more variables. Through correlation, we can predict one variable from the other. The logic behind using correlation for feature selection is that the good variables are highly correlated with the target.

Correlated features to murder column

4.1.3 Information gain

It is commonly used in the construction of decision trees from a training dataset, by evaluating the information gain for each variable, and selecting the variable that maximizes the information gain, which in turn minimizes the entropy and best splits the dataset into groups for effective classification.

Information gain of each columns and how they relate to murder column

4.1.4 ANOVA

ANOVA is an acronym for analysis of variance and is used for determining whether the means from two or more samples of data (often three or more) come from the same distribution or not. The results of this test can be used for feature selection where those features that are independent of the target variable can be removed from the dataset.

anova feature selection on crimes dataset

4.2 wrapper method

The Wrapper methodology considers the selection of feature sets as a search problem, where different combinations are prepared, evaluated, and compared to other combinations. A predictive model is used to evaluate a combination of features and assign model performance scores. RFE , Forward selection and backward elimination are used on the dataset.

4.2.1 RFE

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. Features are ranked by the model’s coef or feature_importances_ attributes, and by recursively eliminating a small number of features per loop, RFE attempts to eliminate dependencies and collinearity that may exist in the model.

RFE feature selection on crimes dataset

4.2.2 Forward selection

Forward selection is an iterative method in which we start with having no feature in the model. In each iteration, we keep adding the feature which best improves our model till an addition of a new variable does not improve the performance of the model.

forward selection feature selection on crimes dataset

4.2.3 backward elimination

In backward elimination, we start with all the features and removes the least significant feature at each iteration which improves the performance of the model. We repeat this until no improvement is observed on removal of features.

backward elimination feature selection on crimes dataset

4.3 Embedded method

These methods encompass the benefits of both the wrapper and filter methods, by including interactions of features but also maintaining reasonable computational cost. Embedded methods are iterative in the sense that takes care of each iteration of the model training process and carefully extracts those features which contribute the most to the training for a particular iteration.

4.3.1 Lasso Regularization

Lasso or L1 Regularization consists of adding a penalty to the different parameters of the machine learning model to avoid over-fitting. In linear model regularization, the penalty is applied over the coefficients that multiply each of the predictors. From the different types of regularization, Lasso or L1 has the property that is able to shrink some of the coefficients to zero. Therefore, that feature can be removed from the model.

L1 regularization feature selection on crimes dataset

4.3.2 Ridge Regression

L2 or ridge regression, on the other hand, is useful when you have collinear/codependent features.Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.

Ridge regression feature selection on crimes dataset

4.3.3 Random forest importance

Feature importance refers to a class of techniques for assigning scores to input features to a predictive model that indicates the relative importance of each feature when making a prediction.

Random Forests is a kind of a Bagging Algorithm that aggregates a specified number of decision trees. The tree-based strategies used by random forests naturally rank by how well they improve the purity of the node, or in other words a decrease in the impurity (Gini impurity) over all trees.

Random forest importance on crimes dataset

4.3.4 Principle component Analysis (PCA)

PCA is a dimensionality reduction method. The PCA method can be described and implemented using the tools of linear algebra.The basic idea when using PCA as a tool for feature selection is to select variables according to the magnitude (from largest to smallest in absolute values) of their coefficients (loadings).

PCA on crimes dataset

5. Compare training without feature selection and with feature selection (Filter method(chi-square), wrapper method (RFE), and embedded method(Lasso))

5.1 Naïve Bayes training without feature selection

We have an accuracy of 0.7 after training with naïve Bayes without feature selection.

Naïve Bayes training without feature selection

5.2 Naïve Bayes training with feature selection

5.2.1 Chi-square

We have an accuracy of 0.714 after training with naïve Bayes with filter method (chi-square) feature selection.

Naïve Bayes training with chi-square feature selection

5.2.2 RFE

We have an accuracy of 0.643 after training with naïve Bayes with wrapper method (RFE) feature selection.

Naïve Bayes training with RFE feature selection

5.2.3 LASSO

We have an accuracy of 0.785 after training with naïve Bayes with embedded method (LASSO) feature selection.

Naïve Bayes training with LASSO feature selection

6. Time series or regression algorithms comparison (Naïve Bayes, k-nearest neighbor, Support vector machines, Convolutional neural network and RNN(LSTM)

6.1 Naive bayes

it is an algorithm based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Naive Bayes classifier performs better compare to other models like logistic regression and you need less training data

naive bayes training with lasso feature selection

6.2 K nearest neighbor

KNN algorithm can be used for both classification and regression problems. The KNN algorithm uses ‘feature similarity’ to predict the values of any new data points. This means that the new point is assigned a value based on how closely it resembles the points in the training set.

KNN training with lasso feature selection

6.3 Support Vector Machine (SVM)

Support Vector Machine can also be used as a regression method, maintaining all the main features that characterize the algorithm (maximal margin). In the case of regression, a margin of tolerance (epsilon) is set in approximation to the SVM which would have already requested from the problem.

SVM training with lasso feature selection

6.4 Convolutional Neural Network

Convolutional Neural Network (CNN) models are mainly used for two-dimensional arrays like image data. However, we can also apply CNN with regression data analysis. In this case, we apply a one-dimensional convolutional network and reshape the input data according to it. Keras provides the Conv1D class to add a one-dimensional convolutional layer into the model.

CNN training with lasso feature selection

6.5 RNN(LSTM)

RNN’s (LSTM’s) are pretty good at extracting patterns in input feature space, where the input data spans over long sequences. Given the gated architecture of LSTM’s that has this ability to manipulate its memory state, they are ideal for regression or time series problems.

7. Save trained model

Since the machine learning model has been trained, we can now save this model with pickle.

save trained model with pickle

CONCLUSION

This project explained the process of EDA on the Swedish crime rate dataset. we covered how to perform visualization, data preprocessing by handling of missing data, outliers, normalization, Explained feature selection methods, and compared chisquare. rfe and lasso training accuracy, and finally compared the SVM, KNN, Naïve Bayes, CNN and LSTM.

WRITER: OLUYEDE SEGUN . A(jr)

Resources used (References) and further reading:

linkedin profile: https://www.linkedin.com/in/oluyede-segun-adedeji-jr-a5550b167/

Link to explanatory notebook: https://github.com/juniorboycoder/TIME_SEREIS_EDA_FEATURE_SELECTION_AND_PREDICITVE_ANALYSIS/blob/main/eda_and_feature_Selection_timeseries_project.ipynb

twitter profile: https://twitter.com/oluyedejun1

TAGS: #FeatureSelection #Outlier #timeseries #regression #CNN #SVM #KNN #LSTM #Naivebayes #filter #wrapper #embedded #EDA

--

--

oluyede Segun (jr)
Analytics Vidhya

Certified I.T specialist | Computer Network Admin | Cloud | Artificial intelligence ( Machine Learning & Data Science),& webdev. python/JavaScript language