The Basic Concepts Of Data Science

Yusuf Karatoprak
8y
10k
0
3

Article

The basics concepts of data science can be separated into two important parts. Some people may argue with me because I have to tell you supervised learning and unsupervised learning and decision tree algorithms. But my intention is not explaining the concepts of Data Science. This article is related to knowledge about who wants to be started as a data scientist.

The basic concepts of Data Science can be separated into two parts.

Regression
Classification

Why we have to learn these two concepts? The first reason is that we have to model the relationship between two variables by fitting a linear equation. And Classification is a method of classification for the data. The classifier is used for classification.

A linear regression line has an equation of the form Y = b*X+a, where X is the explanatory variable and Y is the dependent variable.The slope of the line is b, and “a” is the intercept.

How can I make it by Numpy?

As I mentioned before, Linear regression attempts to model the relationship between two variables by fitting a linear equation to the observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model.

(http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt# C: \pybook\ Data\ LinearRegressionDataSet.csv
data = pd.read_csv("C:\pybook\Data\LinearRegressionDataSet.csv")
print(data)
x = data["X"]
y = data["Y"]
x = pd.DataFrame.as_matrix(x)
y = pd.DataFrame.as_matrix(y)
print(x)
print(y)
m, b = np.polyfit(x, y, 1)
a = np.arange(150)
plt.scatter(x, y)
plt.plot(m * a + b)
z = int(input("X value ?"))
prediction = m * z + b
print(prediction)
print("y=", m, "x+", b)
plt.scatter(z, prediction, c = "red", marker = ">")
plt.show()

What is numpy?

Numpy is a kind of scientific library for a Python developer who wants to write a functional scientific program. With the help of numpy, you don’t have to develop a software from zero to advanced level.

What is pandas?

Pandas is a software library written in the Python programming language for data manipulation and analysis. Generally, it is used for reading data from CSV resources.

What is matplotlib?

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK+.

Let’s understand above codes

First of all, we have to read data from CSV by using pandas.

X,Y

108 ,392.5

19 ,46.2

13 ,15.7

124 ,422.2

40 ,119.4

57 ,170.9

23 ,56.9

14 ,77.5

45 ,214

10 ,65.3

5 ,20.9

48 ,248.1

And then, we have to set X and Y value by using x = data[“X”] and y = data[“Y”] and then, transform arrays as matrix. Fitting columns is another important step. Polyfit function will be good method for fitting X and Y values. And preparing data by using np.arange() function will produce 150 items. And then running code will create below plot.

Let’s make it again by scikit-learn

import numpy as np
import pandas as pd
from sklearn.linear_model
import LinearRegression as lr
import matplotlib.pyplot as plt
data = pd.read_csv("C:\pybook\Data\LinearRegressionDataSet.csv")
x = data["X"]
y = data["Y"]
x = x.reshape(63, 1)
y = y.reshape(63, 1)
linearregression = lr()
linearregression.fit(x, y)
linearregression.predict(x)# y = m * x + b# m = coef# b = intercept
m = linearregression.coef_
b = linearregression.intercept_
a = np.arange(150)
plt.scatter(x, y)# plt.scatter(a, m * a + b)
plt.scatter(a, m * a + b, c = "red")
plt.show()

When you look at third line you can see that sklearn library LinearRegression part. Reading csv and reshaping X and Y values is routine method. We have to focus on fittind and prediction method,

linearregression.fit(x,y)
linearregression.predict(x)

Coef_ function is using measuring slope value. Intercept_ is also a function to create b values.

As a result; we understand that Linear Regression and classification is basic concepts for people who want to be a data scientist. We used here y = m*x+b equation that is a simple way to understand regression because there are some other methods for linear regression. Fitting csv columns and calling prediction are awesome things to develop vision perspective of people who want to be a data scientist.