Introduction to Pandas

PANDAS:

In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals.Its name is a play on the phrase "Python data analysis" itself.

Pandas is mainly used for data analysis. Pandas allows importing data from various file formats such as comma-separated values, JSON, SQL, Microsoft Excel.Pandas allows various data manipulation operations such as merging, reshaping,selecting, as well as data cleaning, and data wrangling features.

Objectives:

After completing this lab you will be able to:

Acquire data in various ways

Obtain insights from Data with Pandas library

Data Acquisition:

There are various formats for a dataset, .csv, .json, .xlsx etc. The dataset can be stored in different places, on your local machine or sometimes online.

In this section, you will learn how to load a dataset into our Jupyter Notebook.

In our case, the Automobile Dataset is an online source, and it is in CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.

So, let's begin with it,

First import all the required modules as follows:

# import pandas library

import pandas as pd

import numpy as np

Now, to read the data we use "read_csv" method of the pandas library as 

shown below:

# Import pandas library

import pandas as pd

# Read the online file by the URL provides above, and assign it to variable "df"

other_path = "This contins the path of the file or the link of the dataset."

Note: The header=None parameter is used to specify that the dataset contains no header

by specifying header=None python automatically adds integers in place of the headers.

Now, Header is nothing but the names of the columns i.e the headings of your columns.

df = pd.read_csv(other_path, header = None)

After importing the dataset it is good to view the data and that can be 

done by using the head(n) method to view the first n lines of the dataset.

# show the first 5 rows using dataframe.head() method

print("The first 5 rows of the dataframe") 

df.head(5)

The first 5 rows of the dataframe

0123456789
03?alfa-romerogasstdtwoconvertiblerwdfront88.6
13?alfa-romerogasstdtwoconvertiblerwdfront88.6
21?alfa-romerogasstdtwohatchbackrwdfront94.5
32164audigasstdfoursedanfwdfront99.8
42164audigasstdfoursedan4wdfront99.4

Add Headers:

Take a look at our dataset; pandas automatically set the header by an integer from 0.

we can change the header by making a list of headers and passing it into the pandas dataframe as follows:

pd.columns = header , where header is a "list" of headers.

Save Dataset:

Correspondingly, Pandas enables us to save the dataset to csv by using the
dataframe.to_csv() method, you can add the file path and name along with quotation
marks in the brackets.
For example, if you would save the dataframe df as automobile.csv to your local machine,
you may use the syntax below:
df.to_csv("automobile.csv", index=False)

Read/Save Other Data Formats:

Data Formate	Read	Save
csv	`pd.read_csv()`	`df.to_csv()`
json	`pd.read_json()`	`df.to_json()`
excel	`pd.read_excel()`	`df.to_excel()`
hdf	`pd.read_hdf()`	`df.to_hdf()`
sql	`pd.read_sql()`	`df.to_sql()`

Basic Insight of Dataset:

After reading data into Pandas dataframe, it is time for us to explore the dataset.
There are several ways to obtain essential insights of the data to help us better understand
our dataset.

Data Types:

Data has a variety of types.
The main types stored in Pandas dataframes are object, float, int, bool and datetime64.
In order to better learn about each attribute, it is always good for us to know the data type
of each column. In Pandas:
df.dtypes

Describe:

If we would like to get a statistical summary of each column, such as count, column
mean value, column standard deviation, etc. We use the describe method:
df.describe()

This shows the statistical summary of all numeric-typed (int, float) columns.
However, what if we would also like to check all the columns including those that are
of type object.
# describe all the columns in "df"
df.describe(include = "all")

Search This Blog

Data Science Introduction using Python