Cyber Security & Data Analytics – Malware Analysis

Team 2: Wei-Ping (Daniel) Lee, Ziwei Liang, Yuzi Liu, Maryna Pavlenko

The project's objecticve is to evaluate different classification models to predict malicious and benign websites, based on application layer and network characteristics.

Data source

INDEX:

1: Data processing
2: Descriptive statistics
3: Visualizations
4: Machine learning models

Importing libraries

import numpy as np
import pandas as pd 
import os
import sklearn
from sklearn.ensemble import RandomForestClassifier
from collections import Counter
from sklearn.tree import DecisionTreeClassifier,export_graphviz 
from sklearn.model_selection import train_test_split,cross_val_predict, cross_val_score 
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.externals.six import StringIO  
from IPython.display import Image  
import pydotplus
import seaborn as sns
import category_encoders as ce
import matplotlib.pyplot as plt
import matplotlib as mlp
from pandas.plotting import scatter_matrix
import xgboost as xgb
from sklearn.metrics import mean_squared_error,confusion_matrix,classification_report,accuracy_score

Reading dataset

dataset=pd.read_csv("C:/Users/Maryna/Desktop/winter HW/dataset.csv")
dataset.head()

Data overview: 1781 values. 21 columns out of which 7 categorical variables

print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1781 entries, 0 to 1780
Data columns (total 21 columns):
URL                          1781 non-null object
URL_LENGTH                   1781 non-null int64
NUMBER_SPECIAL_CHARACTERS    1781 non-null int64
CHARSET                      1781 non-null object
SERVER                       1780 non-null object
CONTENT_LENGTH               969 non-null float64
WHOIS_COUNTRY                1781 non-null object
WHOIS_STATEPRO               1781 non-null object
WHOIS_REGDATE                1781 non-null object
WHOIS_UPDATED_DATE           1781 non-null object
TCP_CONVERSATION_EXCHANGE    1781 non-null int64
DIST_REMOTE_TCP_PORT         1781 non-null int64
REMOTE_IPS                   1781 non-null int64
APP_BYTES                    1781 non-null int64
SOURCE_APP_PACKETS           1781 non-null int64
REMOTE_APP_PACKETS           1781 non-null int64
SOURCE_APP_BYTES             1781 non-null int64
REMOTE_APP_BYTES             1781 non-null int64
APP_PACKETS                  1781 non-null int64
DNS_QUERY_TIMES              1780 non-null float64
Type                         1781 non-null int64
dtypes: float64(2), int64(12), object(7)
memory usage: 292.3+ KB
None

Data processing¶

a) removing redundant variables: "URL" variable with unique values and dates which are not needed our for classification and prediction analysis

dataset.drop(["URL", "WHOIS_REGDATE", "WHOIS_UPDATED_DATE"], axis=1, inplace = True)

b) Detection and interpolation of missing values

print(dataset.isnull().sum())

URL_LENGTH                     0
NUMBER_SPECIAL_CHARACTERS      0
CHARSET                        0
SERVER                         1
CONTENT_LENGTH               812
WHOIS_COUNTRY                  0
WHOIS_STATEPRO                 0
TCP_CONVERSATION_EXCHANGE      0
DIST_REMOTE_TCP_PORT           0
REMOTE_IPS                     0
APP_BYTES                      0
SOURCE_APP_PACKETS             0
REMOTE_APP_PACKETS             0
SOURCE_APP_BYTES               0
REMOTE_APP_BYTES               0
APP_PACKETS                    0
DNS_QUERY_TIMES                1
Type                           0
dtype: int64

dataset = dataset.interpolate()
print(dataset.isnull().sum())

URL_LENGTH                   0
NUMBER_SPECIAL_CHARACTERS    0
CHARSET                      0
SERVER                       1
CONTENT_LENGTH               0
WHOIS_COUNTRY                0
WHOIS_STATEPRO               0
TCP_CONVERSATION_EXCHANGE    0
DIST_REMOTE_TCP_PORT         0
REMOTE_IPS                   0
APP_BYTES                    0
SOURCE_APP_PACKETS           0
REMOTE_APP_PACKETS           0
SOURCE_APP_BYTES             0
REMOTE_APP_BYTES             0
APP_PACKETS                  0
DNS_QUERY_TIMES              0
Type                         0
dtype: int64

c) Feature Engineering on Categorical Data

dataset["SERVER"].value_counts()

Apache                                                                                                                386
nginx                                                                                                                 211
None                                                                                                                  175
Microsoft-HTTPAPI/2.0                                                                                                 113
cloudflare-nginx                                                                                                       94
Microsoft-IIS/7.5                                                                                                      51
GSE                                                                                                                    49
Server                                                                                                                 49
YouTubeFrontEnd                                                                                                        42
nginx/1.12.0                                                                                                           36
ATS                                                                                                                    30
Apache/2.2.15 (CentOS)                                                                                                 25
Apache-Coyote/1.1                                                                                                      20
Microsoft-IIS/8.5                                                                                                      15
Apache/2                                                                                                               15
Microsoft-IIS/6.0                                                                                                      14
Apache/2.4.7 (Ubuntu)                                                                                                  13
Apache/2.2.14 (FreeBSD) mod_ssl/2.2.14 OpenSSL/0.9.8y DAV/2 PHP/5.2.12 with Suhosin-Patch                              13
Apache/2.2.22 (Debian)                                                                                                 12
Apache/2.2.15 (Red Hat)                                                                                                12
nginx/1.4.6 (Ubuntu)                                                                                                    9
nginx/1.8.1                                                                                                             9
nginx/1.8.0                                                                                                             8
nginx/1.10.1                                                                                                            8
Varnish                                                                                                                 7
Apache/2.4.25 (Amazon) OpenSSL/1.0.1k-fips                                                                              7
Apache/2.4.25                                                                                                           6
Apache/2.2.31 (Amazon)                                                                                                  6
nginx/1.6.2                                                                                                             6
LiteSpeed                                                                                                               6
                                                                                                                     ... 
CherryPy/3.6.0                                                                                                          1
Proxy Pandeiro UOL                                                                                                      1
Apache/2.2.22 (Debian) mod_python/3.3.1 Python/2.7.3 mod_ssl/2.2.22 OpenSSL/1.0.1t                                      1
nginx/1.4.3                                                                                                             1
Apache/2.0.63 (Unix) mod_ssl/2.0.63 OpenSSL/0.9.8e-fips-rhel5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 PHP/5.3.6      1
Resin/3.1.8                                                                                                             1
mw2185.codfw.wmnet                                                                                                      1
Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips PHP/7.0.14                                                                    1
mw2260.codfw.wmnet                                                                                                      1
mw2180.codfw.wmnet                                                                                                      1
barista/5.1.3                                                                                                           1
mw2177.codfw.wmnet                                                                                                      1
mw2232.codfw.wmnet                                                                                                      1
XXXXXXXXXXXXXXXXXXXXXX                                                                                                  1
mw2109.codfw.wmnet                                                                                                      1
nginx/1.7.12                                                                                                            1
mw2187.codfw.wmnet                                                                                                      1
Apache/2.4.6 (CentOS) OpenSSL/1.0.1e-fips PHP/5.4.16 mod_apreq2-20090110/2.8.0 mod_perl/2.0.10 Perl/v5.24.1             1
Apache/2.2.14 (Unix) mod_ssl/2.2.14 OpenSSL/0.9.8a                                                                      1
nginx/0.8.38                                                                                                            1
openresty/1.11.2.2                                                                                                      1
Apache/2.4.10 (Unix) OpenSSL/1.0.1k                                                                                     1
Apache/2.4.18 (Unix) OpenSSL/1.0.2e Communique/4.1.10                                                                   1
gunicorn/19.7.1                                                                                                         1
Apache/2.4.6 (CentOS) PHP/5.6.8                                                                                         1
Apache/2.4.6 (Unix) mod_jk/1.2.37 PHP/5.5.1 OpenSSL/1.0.1g mod_fcgid/2.3.9                                              1
mw2171.codfw.wmnet                                                                                                      1
Apache/2.4.25 (Debian)                                                                                                  1
Apache/2.4.6 (CentOS) mod_fcgid/2.3.9 PHP/5.6.30                                                                        1
Apache/2.4.6 (Unix) mod_jk/1.2.37                                                                                       1
Name: SERVER, Length: 239, dtype: int64

#Grouping values with the least count into one bin "Other" to reduce number of unique values
series = pd.value_counts(dataset.SERVER)
mask = (series/series.sum() * 100).lt(1)
dataset['SERVER'] = np.where(dataset['SERVER'].isin(series[mask].index),'Other',dataset['SERVER'])

dataset["SERVER"].value_counts()

Other                     499
Apache                    386
nginx                     211
None                      175
Microsoft-HTTPAPI/2.0     113
cloudflare-nginx           94
Microsoft-IIS/7.5          51
GSE                        49
Server                     49
YouTubeFrontEnd            42
nginx/1.12.0               36
ATS                        30
Apache/2.2.15 (CentOS)     25
Apache-Coyote/1.1          20
Name: SERVER, dtype: int64

dataset["CHARSET"].value_counts()

UTF-8           676
ISO-8859-1      427
utf-8           379
us-ascii        155
iso-8859-1      134
None              7
ISO-8859          1
windows-1252      1
windows-1251      1
Name: CHARSET, dtype: int64

#Grouping values with the least count into one bin "Other" to reduce number of unique values
series = pd.value_counts(dataset.CHARSET)
mask = (series/series.sum() * 100).lt(1)
dataset['CHARSET'] = np.where(dataset['CHARSET'].isin(series[mask].index),'Other',dataset['CHARSET'])

dataset["WHOIS_COUNTRY"].value_counts()

US                1103
None               306
CA                  84
ES                  63
AU                  35
PA                  21
GB                  19
JP                  11
CN                  10
IN                  10
UK                  10
FR                   9
CZ                   9
CH                   6
NL                   6
[u'GB'; u'UK']       5
KR                   5
AT                   4
PH                   4
BS                   4
ru                   4
SC                   3
DE                   3
KY                   3
us                   3
TR                   3
HK                   3
BE                   3
SE                   3
RU                   2
Cyprus               2
KG                   2
SI                   2
NO                   2
IL                   2
BR                   2
UA                   2
UY                   2
PK                   1
LU                   1
LV                   1
BY                   1
IT                   1
AE                   1
IE                   1
TH                   1
UG                   1
se                   1
United Kingdom       1
Name: WHOIS_COUNTRY, dtype: int64

#Handling duplicate values for countries
def replace(x):
    if x == "[u'GB'; u'UK']"or x=="United Kingdom" or x=="UK":
        return "GB"
    elif x == "Cyprus":
        return "CY"
    elif x == "us":
        return "US"
    elif x == "ru":
        return "RU"
    elif x == "se":
        return "SE"
    else:
        return x
    
dataset["WHOIS_COUNTRY"] = list(map(lambda x: replace(x), dataset["WHOIS_COUNTRY"]))

#Grouping values with the least count into one bin "Other" to reduce number of unique values
counts = dataset['WHOIS_COUNTRY'].value_counts()
dataset['WHOIS_COUNTRY'] = np.where(dataset['WHOIS_COUNTRY'].isin(counts[counts < 10].index),'Other',dataset['WHOIS_COUNTRY'])

dataset["WHOIS_COUNTRY"].value_counts()

US       1106
None      306
Other     100
CA         84
ES         63
AU         35
GB         35
PA         21
JP         11
IN         10
CN         10
Name: WHOIS_COUNTRY, dtype: int64

dataset["WHOIS_STATEPRO"].value_counts()

CA                      372
None                    362
NY                       75
WA                       65
Barcelona                62
FL                       61
Arizona                  58
California               57
ON                       45
NV                       30
UT                       29
CO                       24
PA                       23
MA                       22
PANAMA                   19
IL                       19
NJ                       15
Ohio                     15
MO                       15
Queensland               14
Utah                     13
New York                 11
VA                       10
Washington               10
TX                       10
Quebec                    9
Texas                     9
GA                        8
Illinois                  8
PRAHA                     8
                       ... 
OR                        1
-                         1
VT                        1
District of Columbia      1
Maharashtra               1
Paris                     1
ME                        1
worcs                     1
MAHARASHTR                1
North Carolina            1
South Carolina            1
liaoningsheng             1
ab                        1
HANTS                     1
MAINE                     1
RM                        1
Greater London            1
ny                        1
il                        1
Haryana                   1
MH                        1
QUEBEC                    1
CH                        1
UTTAR PRADESH             1
Rogaland                  1
Punjab                    1
Indiana                   1
Mahe                      1
Bei Jing                  1
Vi                        1
Name: WHOIS_STATEPRO, Length: 182, dtype: int64

#Handling duplicate values for States
def replace_ny_ca(x):
    if x == "California"or x=="CALIFORNIA":
        return "CA"
    elif x == "Arizona":
        return "AZ"
    elif x == "New York" or x=="NEW YORK":
        return "NY"
    elif x == "Ohio":
        return "OH"
    elif x == "Utah":
        return "UT"
    elif x == "None":
        return "NA"
    elif x == "Texas":
        return "TX"
    elif x == "Washington":
        return "WA"
    elif x == "va":
        return "VA"
    elif x == "Illinois" or x=="il":
        return "IL"
    elif x == "District of Columbia" or x=="DC" or x=="Maryland":
        return "MD"
    elif x == "New Jersey":
        return "NJ"
    elif x == "Maine" or x=="MAINE":
        return "ME"
    elif x == "Quebec" or x=="QUEBEC" or x=="qc" or x=="quebec":
        return "QC"
    elif x == "Missouri":
        return "MO"
    elif x == "Nevada":
        return "NV"
    elif x == "WC1N" or x=="Greater London" or x=="UK" or x=="WEST MIDLANDS" or x=="worcs" or x=="Peterborough" or x=="London" or x=="HANTS" or x=="MIDDLESEX":
        return "England"
    elif x == "Pennsylvania":
        return "PA"
    elif x == "Florida" or x=="FLORIDA":
        return "FL"
    else:
        return x
    
dataset["WHOIS_STATEPRO"] = list(map(lambda x: replace_ny_ca(x), dataset["WHOIS_STATEPRO"]))

#Indentifying missing State values based on country column
for i,v in dataset.iterrows():
    if v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='GB':
            print ('England')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='SE':
            print ('Sweden')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='LU':
            print ('Luxembourg')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='FR':
            print ('France')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='IL':
            print ('Israel')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='BE':
            print ('Belgium')        
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='NO':
            print ('Norway')        
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='TR':
            print ('Turkey')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='DE':
            print ('Germany')      
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='BR':
            print ('Brazil')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='JP':
            print ('Japan')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='AU':
            print ('Australia')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='PH':
            print ('Philippines')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='CZ':
            print ('CzechRep')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='KR':
            print ('SKorea')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='UA':
            print ('Ukraine')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='HK':
            print ('Hong Kong')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='CH':
            print ('Switzerland')
    elif v['WHOIS_STATEPRO']=='None' and v['WHOIS_COUNTRY']=='CY':
            print ('Cypres')

#Grouping values with the least count into one bin "Other" to reduce number of unique values
counts = dataset['WHOIS_STATEPRO'].value_counts()
dataset['WHOIS_STATEPRO'] = np.where(dataset['WHOIS_STATEPRO'].isin(counts[counts < 20].index),'Other',dataset['WHOIS_STATEPRO'])

dataset["WHOIS_STATEPRO"].value_counts()

CA           430
NA           362
Other        347
NY            87
WA            75
FL            67
AZ            64
Barcelona     62
ON            45
UT            42
NV            33
PA            28
IL            28
CO            24
England       22
MA            22
MO            22
OH            21
Name: WHOIS_STATEPRO, dtype: int64

#binary encoding of categirical variables
dataset_ce = dataset.copy()
encoder = ce.BinaryEncoder(cols=['WHOIS_STATEPRO','WHOIS_COUNTRY', 'CHARSET', 'SERVER'])
df_binary = encoder.fit_transform(dataset_ce)
df_binary.head()

Descriptive statistics¶

dataset.describe()

Visualizations¶

%matplotlib inline
sns.set(style="darkgrid")
sns.countplot(x="Type", data=dataset)
plt.title('Benign vs Malicious Websites',fontsize=30)
plt.ylabel('Number of Occurrences',fontsize=20)
plt.xlabel('Type',fontsize=20)
plt.figure(figsize=(16, 6))

<Figure size 1152x432 with 0 Axes>

<Figure size 1152x432 with 0 Axes>

#Segregating the classes 
yes = dataset[dataset.Type == 1]
no = dataset[dataset.Type == 0]
print('YES : %d  No: %d'%(len(yes), len(no)))

YES : 216  No: 1565

ratio=round(len(yes)/(len(yes)+len(no))*100)
print("Malitious websites ratio is",ratio,"%")

Malitious websites ratio is 12 %

dataset.drop(dataset.loc[dataset['WHOIS_COUNTRY']=="None"].index, inplace=True)
dataset[dataset['Type']==1].groupby('WHOIS_COUNTRY')['WHOIS_COUNTRY'].count().sort_values(ascending=False).head(5).plot(kind='bar')
sns.set(style="darkgrid")
plt.title('Frequency Distribution of Malicious Websites per Country',fontsize=15)
plt.ylabel('Number of Occurrences', fontsize=20)
plt.xlabel('Country', fontsize=20)
plt.show()

dataset.drop(dataset.loc[dataset['WHOIS_STATEPRO']=="NA"].index, inplace=True)
dataset[dataset['Type']==1].groupby('WHOIS_STATEPRO')['WHOIS_STATEPRO'].count().sort_values(ascending=False).head(6).plot(kind='bar')
sns.set(style="darkgrid")
plt.title('Frequency Distribution of Malicious Websites per State', fontsize=15)
plt.ylabel('Number of Occurrences', fontsize=20)
plt.xlabel('State', fontsize=20)
plt.show()

sns.countplot(y="Type", hue='CHARSET', data=dataset)
plt.title('Frequency Distribution of Charset',fontsize=25)
plt.ylabel('Number of Occurrences', fontsize=20)
plt.xlabel('Charset', fontsize=20)

Text(0.5, 0, 'Charset')

dataset.drop(dataset.loc[dataset['SERVER']=="Other"].index, inplace=True)
dataset[dataset['Type']==1].groupby('SERVER')['SERVER'].count().sort_values(ascending=False).head(5).plot(kind='barh')
sns.set(style="darkgrid")
plt.title('Frequency Distribution of Malicious Websites per Server',fontsize=15)
plt.xlabel('Server', fontsize=20)
plt.show()

#Density distribution of SOURCE_APP_PACKETS
#Mean value is in the range of 14 and 18. right skewed
sns.distplot(df_binary['SOURCE_APP_PACKETS'], hist=True, kde=True, 
             color = 'darkblue', 
             hist_kws={'edgecolor':'black'},
             kde_kws={'linewidth': 4})
plt.xlim(0, 80)

C:\Users\Maryna\New folder\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning:

Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.

(0, 80)

#Positive correlation among some of the variables and right skewed histograms are observed
scatter_matrix(df_binary[['NUMBER_SPECIAL_CHARACTERS','TCP_CONVERSATION_EXCHANGE','URL_LENGTH','DNS_QUERY_TIMES']])
plt.show()

bplot = sns.boxplot(y='DNS_QUERY_TIMES', x='WHOIS_COUNTRY', 
                 data=dataset, 
                 width=0.5,
                 palette="colorblind")

bplot = sns.boxplot(x='REMOTE_IPS', y='SERVER', 
                 data=dataset, 
                 width=0.5,
                 palette="colorblind")

Machine learning models¶

Feature importance¶

#split dataset in features and target variable
X = df_binary.drop('Type',axis=1) #Predictors
y = df_binary['Type']
#apply SelectKBest class to extract top 15 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Specs','Score']  #naming the dataframe columns
print(featureScores.nlargest(15,'Score'))  #print 15 best features

                        Specs         Score
29           SOURCE_APP_BYTES  1.051184e+06
22             CONTENT_LENGTH  2.819265e+05
26                  APP_BYTES  2.378437e+05
30           REMOTE_APP_BYTES  2.146202e+05
24       DIST_REMOTE_TCP_PORT  1.063633e+03
20                 URL_LENGTH  6.235178e+02
7             WHOIS_COUNTRY_1  2.975855e+02
23  TCP_CONVERSATION_EXCHANGE  2.901959e+02
21  NUMBER_SPECIAL_CHARACTERS  2.616580e+02
1            WHOIS_STATEPRO_1  2.334727e+02
28         REMOTE_APP_PACKETS  2.212112e+02
27         SOURCE_APP_PACKETS  1.970290e+02
31                APP_PACKETS  1.970290e+02
10            WHOIS_COUNTRY_4  1.417818e+02
3            WHOIS_STATEPRO_3  5.739644e+01

Decision Tree Model¶

feature_cols = ['SOURCE_APP_BYTES','CONTENT_LENGTH','APP_BYTES', 'REMOTE_APP_BYTES', 'DIST_REMOTE_TCP_PORT', 'URL_LENGTH',
                'WHOIS_COUNTRY_1','TCP_CONVERSATION_EXCHANGE', 'NUMBER_SPECIAL_CHARACTERS', 'WHOIS_STATEPRO_1',
                'REMOTE_APP_PACKETS', 'SOURCE_APP_PACKETS', 'APP_PACKETS','WHOIS_COUNTRY_4','WHOIS_STATEPRO_3']
X = df_binary[feature_cols] # Features
y = df_binary.Type # Target variable

# To understand model performance, dividing the dataset into a training set and a test set 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=12)

# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)

#Predict the response for test dataset
y_pred = clf.predict(X_test)

# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9523809523809523

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,  
                filled=True, rounded=True,
                special_characters=True, feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
Image(graph.create_png())

Random Forest¶

#Create a Gaussian Classifier
rf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=rf.predict(X_test)
rf.fit(X_train,y_train)

y_pred=rf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.9467787114845938

# Visualize our results
def print_score(classifier,X_train,Y_train,X_test,Y_test,train=True):
    if train == True:
        print("Training results:\n")
        print('Accuracy Score: {0:.4f}\n'.format(accuracy_score(Y_train,classifier.predict(X_train))))
        print('Classification Report:\n{}\n'.format(classification_report(Y_train,classifier.predict(X_train))))
        print('Confusion Matrix:\n{}\n'.format(confusion_matrix(Y_train,classifier.predict(X_train))))
        res = cross_val_score(classifier, X_train, Y_train, cv=10, n_jobs=-1, scoring='accuracy')
        print('Average Accuracy:\t{0:.4f}\n'.format(res.mean()))
        print('Standard Deviation:\t{0:.4f}'.format(res.std()))
    elif train == False:
        print("Test results:\n")
        print('Accuracy Score: {0:.4f}\n'.format(accuracy_score(Y_test,classifier.predict(X_test))))
        print('Classification Report:\n{}\n'.format(classification_report(Y_test,classifier.predict(X_test))))
        print('Confusion Matrix:\n{}\n'.format(confusion_matrix(Y_test,classifier.predict(X_test))))

print_score(rf,X_train,y_train,X_test,y_test,train=False)

Test results:

Accuracy Score: 0.9468

Classification Report:
              precision    recall  f1-score   support

           0       0.95      0.99      0.97       306
           1       0.94      0.67      0.78        51

   micro avg       0.95      0.95      0.95       357
   macro avg       0.95      0.83      0.88       357
weighted avg       0.95      0.95      0.94       357


Confusion Matrix:
[[304   2]
 [ 17  34]]

def create_graph(forest, feature_names):
    estimator = forest.estimators_[5]

    export_graphviz(estimator, out_file='tree.dot',
                    feature_names = feature_names,
                    class_names = ['benign', 'malicious'],
                    rounded = True, proportion = False, precision = 2, filled = True)

    # Convert to png using system command (requires Graphviz)
    from subprocess import call
    call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=200'])
create_graph(rf, list(X))
# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

Xgboost model¶

#Converting the dataset into an optimized data structure called Dmatrix 
data_dmatrix = xgb.DMatrix(data=X,label=y)

C:\Users\Maryna\New folder\lib\site-packages\xgboost\core.py:587: FutureWarning:

Series.base is deprecated and will be removed in a future version

C:\Users\Maryna\New folder\lib\site-packages\xgboost\core.py:588: FutureWarning:

Series.base is deprecated and will be removed in a future version

#Instantiating XGBoost regressor object by calling the XGBRegressor() class from the XGBoost library with the 
#hyper-parameters passed as arguments
xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 10)

#Fitting the regressor to the training set and making predictions on the test set 
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)

rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))

RMSE: 0.284922

#In order to build more robust models, let's do a k-fold cross validation where all the entries in the 
#original training dataset are used for both training as well as validation
params = {"objective":"reg:linear",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=3,
                    num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)

# RMSE for prediction has reduced as compared to the first time
print((cv_results["test-rmse-mean"]).tail(1))

49    0.219681
Name: test-rmse-mean, dtype: float64

#Plotting feature importance
xg_reg = xgb.train(params=params, dtrain=data_dmatrix, num_boost_round=10)
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [5, 5]
plt.show()

xgb.plot_tree(xg_reg,num_trees=0)
plt.rcParams['figure.figsize'] = [20, 10]
plt.show()

	URL_LENGTH	NUMBER_SPECIAL_CHARACTERS	CONTENT_LENGTH	TCP_CONVERSATION_EXCHANGE	DIST_REMOTE_TCP_PORT	REMOTE_IPS	APP_BYTES	SOURCE_APP_PACKETS	REMOTE_APP_PACKETS	SOURCE_APP_BYTES	REMOTE_APP_BYTES	APP_PACKETS	DNS_QUERY_TIMES	Type
count	1781.000000	1781.000000	1781.000000	1781.000000	1781.000000	1781.000000	1.781000e+03	1781.000000	1781.000000	1.781000e+03	1.781000e+03	1781.000000	1781.000000	1781.000000
mean	56.961258	11.111735	13497.243964	16.261089	5.472768	3.060640	2.982339e+03	18.540146	18.746210	1.589255e+04	3.155599e+03	18.540146	2.263335	0.121280
std	27.555586	4.549896	38415.552697	40.500975	21.807327	3.386975	5.605057e+04	41.627173	46.397969	6.986193e+04	5.605378e+04	41.627173	2.930036	0.326544
min	16.000000	5.000000	0.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000e+00	0.000000e+00	0.000000	0.000000	0.000000
25%	39.000000	8.000000	603.000000	0.000000	0.000000	0.000000	0.000000e+00	0.000000	0.000000	0.000000e+00	0.000000e+00	0.000000	0.000000	0.000000
50%	49.000000	10.000000	4714.750000	7.000000	0.000000	2.000000	6.720000e+02	8.000000	9.000000	5.790000e+02	7.350000e+02	8.000000	0.000000	0.000000
75%	68.000000	13.000000	12578.500000	22.000000	5.000000	5.000000	2.328000e+03	26.000000	25.000000	9.806000e+03	2.701000e+03	26.000000	4.000000	0.000000
max	249.000000	43.000000	649263.000000	1194.000000	708.000000	17.000000	2.362906e+06	1198.000000	1284.000000	2.060012e+06	2.362906e+06	1198.000000	20.000000	1.000000

	URL	URL_LENGTH	NUMBER_SPECIAL_CHARACTERS	CHARSET	SERVER	CONTENT_LENGTH	WHOIS_COUNTRY	WHOIS_STATEPRO	WHOIS_REGDATE	WHOIS_UPDATED_DATE	...	DIST_REMOTE_TCP_PORT	REMOTE_IPS	APP_BYTES	SOURCE_APP_PACKETS	REMOTE_APP_PACKETS	SOURCE_APP_BYTES	REMOTE_APP_BYTES	APP_PACKETS	DNS_QUERY_TIMES	Type
0	M0_109	16	7	iso-8859-1	nginx	263.0	None	None	10/10/2015 18:21	None	...	0	2	700	9	10	1153	832	9	2.0	1
1	B0_2314	16	6	UTF-8	Apache/2.4.10	15087.0	None	None	None	None	...	7	4	1230	17	19	1265	1230	17	0.0	0
2	B0_911	16	6	us-ascii	Microsoft-HTTPAPI/2.0	324.0	None	None	None	None	...	0	0	0	0	0	0	0	0	0.0	0
3	B0_113	17	6	ISO-8859-1	nginx	162.0	US	AK	7/10/1997 4:00	12/09/2013 0:45	...	22	3	3812	39	37	18784	4380	39	8.0	0
4	B0_403	17	6	UTF-8	None	124140.0	US	TX	12/05/1996 0:00	11/04/2017 0:00	...	2	5	4278	61	62	129889	4586	61	4.0	0