This tutorial helps you to learn Data Science with Python with examples. Python is an open source language and it is widely used as a high-level programming language for general-purpose programming. It has gained high popularity in data science world. In advanced analytics and predictive analytics market, it is ranked among top 3 programming languages for advanced analytics. Some bloggers opposed and some in favor of 2. If you filter your search criteria and look for only recent articles late 2016 onwardsyou would see majority of bloggers are in favor of Python 3. See the following reasons to support Python 3. The official end date for the Python 2. Afterward there would be no support from community. It does not make any sense to learn 2. What's new in Python 3. It is a language for the future. It fixed major issues with versions of Python 2 series. Python 3 was first released in year 2008. It has been 9 years releasing robust versions of Python 3 series. Key Takeaway You should go for Python 3. In terms of learning Python, there are no major differences in Python 2. It is not too difficult to move from Python 3 to Python 2 with a few adjustments. Your focus should go on learning Python as a language. Python for Data Science : Introduction Python is widely used and very popular for a variety of software engineering tasks such as website development, cloud-architecture, back-end etc. It is equally python for data science in data science world. In advanced analytics world, there has been several debates on R vs. There are some areas such as number of libraries for statistical analysis, where R wins over Python but Python is catching up very fast. With popularity of big data and data science, Python has become first programming language of data scientists. There are several reasons to learn Python. Do you know these sites are developed in Python. It comes with Python software along with preinstalled popular libraries. You have to manually install libraries. Recommended : Go for first option and download anaconda. It saves a lot of time in learning and coding Python Spyder. It gives an environment wherein writing python code is user-friendly. It comes with a syntax editor where you can write programs. python for data science It has a console to check each and every line of code. Under the 'Variable explorer', you can access your created data files and function. Following are some data structures used in Python. List It is a sequence of multiple values. It allows us to store different types of data such as integer, float, string etc. See the examples of list below. First one is an integer list containing only integer. Second one is string list containing only string values. Third one is mixed list containing integer, string and float values. Index starts from 0 and end with number of elements-1. Negative sign tells Python to search list item from right to left. Tuple A tuple is similar to a list in the sense that it is a sequence of elements. It is also called user-defined functions. It helps you in automating the repetitive task and calling reusable code in easier way. It does not mean no other value than 0 can be set here. For data manipulation and data wrangling. A collections of functions to understand and explore data. It is counterpart of dplyr and reshape2 packages in R. It's a package for efficient array computations. It allows us to do some operations on an entire column or table in one line. It is roughly approximate to Rcpp package in R which eliminates the limitation of slow speed in R. For mathematical and scientific functions such as integration, interpolation, signal processing, linear algebra, statistics, etc. It is built on Numpy. A collection of machine learning algorithms. Python for data science is built on Numpy and Scipy. It can perform all the techniques that can be done in R using glm, knn, randomForest, rpart, e1071 packages. It's a leading package for graphics in Python. It is equivalent to ggplot2 package in R. For statistical and predictive modeling. It includes various functions to explore data and generate descriptive and predictive analytics. It is equivalent to sqldf package in R. Maximum of the above packages are already preinstalled in Spyder. It is a 2-dimensional data structure that can store data of different data types such as characters, integers, floating point values, factors. Comparison of Data Type in Python and Pandas The following table shows how Python and pandas package stores data. Python for data science Type Pandas Standard Python For character variable object string For categorical variable category - For Numeric variable without decimals int64 int Numeric characters with decimals float64 float For date time variables datetime64 - Important Pandas Functions The table below shows comparison of pandas functions with R functions for various data wrangling and manipulation tasks. It would help you to memorize pandas functions. It's a very handy information for programmers who are new to Python. It includes solutions for most of the frequently used data exploration tasks. Functions R Python pandas package Installing a package install. Import Required Packages You can import required packages using import statement. In the syntax below, we are asking Python to import numpy and pandas package. The 'as' is used to alias package name. Build DataFrame We can build dataframe using DataFrame function of pandas package. DataFrame mydata In this dataframe, we have three variables - productcode, sales, cost. To see number of rows and columns You can run the command below to find out number of rows and columns. It means 6 rows and 3 columns. To view first 3 rows The df. In the example below, we are selecting second column. Column Index starts from 0. Hence, 1 refers to second column. Also, we can make use of df. To summarize data frame To summarize or explore data, you can submit the command below. To select only a particular variable, you can write the following code - df. To calculate summary statistics 9. Sort Data In the code below, we are arrange data in ascending order by sales. In this case, we are calculating average sale and cost by product code. We can use astype function to make id as a categorical variable. It is one of the method to explore a categorical variable. Generate Histogram Histogram is one of the method to check distribution of a continuous variable. In the figure shown below, there python for data science two values for variable 'sales' in range 1000-1100. In the remaining intervals, there is only a single value. In this case, there are only 5 values. If you have a large dataset, you can plot histogram to identify outliers in a continuous variable. We will also cover statsmodels library for regression techniques. Install the required libraries Import the following libraries before reading or exploring data Import required libraries import pandas as pd import statsmodels. Download and import data into Python With the use of python library, we can easily get data from web into python. Python for data science Data Let's explore data. It helps to answer the question whether data is skewed. Logistic Regression Model Logistic Regression is a special type of regression where target variable is categorical in nature and independent variables be discrete or continuous. In this post, we will demonstrate only binary logistic regression which takes only binary values in target variable. Unlike linear regression, logistic regression model returns probability of target variable. It assumes binomial distribution of dependent variable. In other words, it belongs to binomial family. In python, we can write R-style model formula y ~ x1 + x2 + x3 using patsy and statsmodels libraries. In the formula, we need to define variable 'position' as a categorical variable by mentioning it inside capital C. The dataset 'y' contains variable admit which is a target variable. The other dataset 'X' contains Intercept constant valuedummy variables for Treatment, gre and gpa. Since 4 is set as a reference category, it will be 0 against all the three dummy variables. When it is continuous, it is called regression tree. And when it is categorical, it is called classification tree. It selects a variable at each step that best splits the set of values. There are several algorithms to find best split. Some of them are Gini, Entropy, C4. There are several advantages of decision tree. It is simple to use and easy to understand. It requires a very few data preparation steps. It can handle mixed data - both categorical and continuous variables. In terms of speed, it is a very fast algorithm. In the above case, we have not performed variable selection. We can also select best parameters by using grid search fine tuning python for data science. Random Forest Model Decision Tree has limitation of overfitting which implies it does not generalize pattern. It is very sensitive to a small change in training data. To overcome this problem, random forest comes into picture. It grows a large number of trees on randomised data. It selects random number of variables to grow each tree. It is more robust algorithm than decision tree. It is one of the most popular machine learning algorithm. It is commonly used in data science competitions. It is always ranked in top 5 algorithms. It has become a part of every data science toolkit. It is a strategy to select the best parameters for an algorithm. In scikit-learn they are passed as arguments to the constructor of the estimator classes. The machine learning package sklearn requires all categorical variables in numeric form. This can be accomplished using the following script. In sklearn, there is already a function for this step. Create Dummy Variables Python for data science you want to convert categorical variables into dummy variables. It is different to the previous example as it creates dummy variables instead of convert it in numeric form. In many algorithms, if missing values are not filled, it removes complete row. If data contains a lot of missing values, it can lead to huge data loss. There are multiple ways to impute missing values. It makes sense to replace missing value with 0 when 0 signifies meaningful. For example, whether customer holds a credit card product. Outlier Treatment There are many ways to handle or treat outliers or extreme values. See below the implementation of log transformation in Python. Standardization In some algorithms, it is required to standardize variables before running the actual algorithm. Standardization refers to the process of making mean of variable zero and unit variance standard deviation. I hope you would find this tutorial helpful. I tried to cover all the important topics which beginner must know about Python. Once completion of this tutorial, you can flaunt you know how to program it in Python and you can implement machine learning algorithms using sklearn package. About Author: Python for data science founded ListenData with a simple objective - Make analytics easy to understand and follow. He has over 7 years of experience in data science and predictive modeling. While I love having friends who agree, I only learn from those who don't. I'm mostly a user of R but want to learn python. The very best people are always very helpful to you personally for solving up your all kind of data source management problems and form this certainly get the best data scientist, which are professional in their work of information handling and they may easily solve your all type of data management problems in short time. I am using Pythin 3. It will be last supported in SymPy version 1. Use direct imports from the defining module instead. TypeError: 'bool' object is not callable How can I handle this. Thank you Hey very nice blog!. I enjoy reading through your article post, I wanted to write a little comment to support you and wish you a good continuation. All the best for all your blogging efforts. The way of explanation about the comparison between is nice. Let me try it out. Spyder Shortcut Keys are quite useful too, but I think what most important in programming is to know the top. For example, the can help you to know all the possibilities and the most convenient keys. I use the website to know what works now and came with some new ideas.