My Journey Through Machine Learning With Python

2018:
I got interested in Machine Learning in late 2018 through a blog I accidentally stumbled upon. It was an opener to the world of Artificial Intelligence & Machine Learning. As a Computer Science graduate and Full Stack Programmer of over 20 years, It was a no brainer to look the way of Machine Learning. I was really fascinated and excited by the capability and process of Machine Learning.
2019
In 2019 I decided to do some online hands-on Machine Learning courses after reading so numerous tutorials, posts and blogs on Machine Learning with Python.The courses included beginners to intermediate levels. Its a pity I did not document all my activities during this time. But I did document the Pre-Processing Data section of the Prima indians onset of Diabetes dataset

Same year I was lucky to be invited to attend a one week Data Science, AI and Machine Leaarning with Python event, tagged AI+ Invasion Port Harcourt, organised by the Data Science of Nigeria. It was an insightful and fun event. I got to meet other people that were interested in Machine Learning.

15th February, 2019:

Prepare For Modeling by Pre-Processing Data

Today, I continued working on the Prima indians onset of Diabetes dataset and read more about "Prepare For Modeling by Pre-Processing Data". Data cleaning is a critically important step in any machine learning project. Data scientists claim that 80% of their time is consumed by the hectic process of data cleaning.

Sometimes you need to prepare your data in order to best present the inherent structure of the problem in your data to the modeling algorithms. There are many techniques that one can use to prepare data for modeling. I am using the pre-processing capabilities provided by the scikit-learn.

    For example, I tried the fiollowing Techniques :
  • Standardize numerical data (e.g. mean of 0 and standard deviation of 1)
    using the scale and center options.
  • Normalized nummerical data (e.g. to a range of 0-1) using the range option
  • Explored more advanced feature engineering such as Binarizing

  • Lets see some codes:
      Standardization:
  • #Preprocessing Data using Standardization in scikit-learn
  • #Standardize Data (0 mean, 1 stdev)
  • from sklearn.preprocessing import StandardScaler
  • import pandas
  • import numpy
  • url = "diabetes.csv"
  • names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
  • dataframe = pandas.read_csv(url, names=names)
  • array = dataframe.values
  • #seperate array into input and output components
  • X = array[:,0:8]
  • Y = array[:,8]
  • scaler = StandardScaler().fit(X)
  • rescaledX = scaler.transform(X)
  • #summarize transformed data
  • numpy.set_printoptions(precision=3)
  • print(rescaledX[0:5,:])

  • Output:
    [[ 0.64 0.848 0.15 0.907 -0.693 0.204 0.468 1.426]
    [-0.845 -1.123 -0.161 0.531 -0.693 -0.684 -0.365 -0.191]
    [ 1.234 1.944 -0.264 -1.288 -0.693 -1.103 0.604 -0.106]
    [-0.845 -0.998 -0.161 0.155 0.123 -0.494 -0.921 -1.042]
    [-1.142 0.504 -1.505 0.907 0.766 1.41 5.485 -0.02 ]] ,
      Normalization:
  • #Preprocess Data using Normalization in scikit-learn
  • #Normalize Data (length of 1)
  • from sklearn.preprocessing import Normalizer
  • import pandas
  • import numpy
  • url = "diabetes.csv"
  • names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
  • dataframe = pandas.read_csv(url, names=names)
  • array = dataframe.values
  • #seperate array into input and output components
  • X = array[:,0:8]
  • Y = array[:,8]
  • scaler = Normalizer().fit(X)
  • normalizedX = scaler.transform(X)
  • #summarize tranformed data
  • numpy.set_printoptions(precision=3)
  • print(normalizedX)

  • Output:
    [[0.034 0.828 0.403 ... 0.188 0.004 0.28 ] [0.008 0.716 0.556 ... 0.224 0.003 0.261] [0.04 0.924 0.323 ... 0.118 0.003 0.162] ... [0.027 0.651 0.388 ... 0.141 0.001 0.161] [0.007 0.838 0.399 ... 0.2 0.002 0.313] [0.008 0.736 0.554 ... 0.241 0.002 0.182]]
      Binarization:
  • #Preprocess Data using Binarization in scikit-learn
  • #Binarize Data
  • from sklearn.preprocessing import Binarizer
  • import pandas
  • import numpy
  • url = "diabetes.csv"
  • names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
  • dataframe = pandas.read_csv(url, names=names)
  • array = dataframe.values
  • #seperate array into input and output components
  • X = array[:,0:8]
  • Y = array[:,8]
  • binarizer = Binarizer(threshold=0.0).fit(X)
  • binaryX = binarizer.transform(X)
  • #summarize tranformed data
  • numpy.set_printoptions(precision=3)
  • print(binaryX[0:5,:])

  • Output:
    [[1. 1. 1. 1. 0. 1. 1. 1.]
    [1. 1. 1. 1. 0. 1. 1. 1.]
    [1. 1. 1. 0. 0. 1. 1. 1.]
    [1. 1. 1. 1. 1. 1. 1. 1.]
    [0. 1. 1. 1. 1. 1. 1. 1.]]


    18 May, 2021
    On this day I became officially a Kaggler. :)

    My first competition was to build a predictive model that answers the question: “what sorts of people were more likely to survive?” from the Titanic mishap. I guess all first timers go through the same process. The project was given to us in order to help us get familiar with the Kaggle environment and competition processes. But we are still required to do the job though.

    My first task was to download the Titanic datasets and have a peek at them. Then I set out examine the dataset properly using Pandas Python Library in order to a get better sense of the structure, count, missing records, nature of the content of the features and the overall dataset. ...


    19 May, 2021

    — Fredrick Ughimi (Twitter: @fredrickughimi)

    Comments

    Popular posts from this blog

    Mega-Net HospitalPro version 8.0

    How Mega-Net HospitalPro Supports Paperless Hospital

    A Case For Paperless Office