Pandas DataFrame – The ultimate guide

Pandas is a popular python library used for data manipulation and analysis. This is especially useful in finance as there’s a lot of data involved. For example, pandas can be used for analyzing stock market data and backtesting trading strategies.

The basic data structure pandas uses is called a DataFrame. This has a table structure with rows and columns for recording observations under different categories. This looks similar to an excel spreadsheet. In this post, I’ll cover some basics of pandas dataframe.

If you are using Jupyter notebook to write your code, you won’t have to install pandas because it comes preinstalled with Anaconda. However, if you are using another Python IDE or text editor, you may first have to install pandas with the following command in your terminal:

— pip install pandas

You must have at least a Python version 3.5.3 or above in order to install Pandas.

I will use Jupyter notebook for this post.

Importing Pandas

Now that you have pandas installed, import the pandas library. Importing a library loads it into the memory so you can access and work with it quickly.


import pandas as pd

The second part of the import statement (‘as pd’) can be anything of your choice e.g. ‘as pnd’ or ‘as pds’. This is done so that you can access the commands as ‘pd.command’ instead of ‘pandas.command’ every-time you use it.

Creating a DataFrame

There are several ways to load the data into a pandas dataframe.

Technique 1:

Reading a local csv, excel or tsv file into a pandas dataframe

#Reading a local csv file into a dataframe
sectors = pd.read_csv('/Users/vaishrathi/Documents/modeldata1.csv') 

Here, ‘sectors’ is the name of the new dataframe created.

‘pd.read_csv’ reads the file (modeldata1.csv) given in the ‘( )’ brackets into the new dataframe. The filetype can be anything including excel in which case you’ll type ‘pd.read_xlsx’. 

The text inside the brackets (‘/Users/vaishrathi/Documents/modeldata1.csv’) is the user path of modeldata1 file on my computer. On a mac, you will be able to access this information by right clicking the file and clicking on ‘Get info’. You will then find it in the ‘Where’ section.

To view the dataframe that you just created, simply type the name of the dataframe and press Shift+Enter

sectors

This is how it looks like.

Output

The first row of your csv, tsv or excel file becomes the column names of your dataframe.

Technique 2:

 Converting a python list, dictionary or numpy array into a pandas dataframe


import pandas as pd

#creating a list of movie titles

movie_title = ['Harley Quinn: Birds of Prey', 'Wonder Woman 1984', 'The Call of the Wild']

#creating a list with average ticket prices for the above movies

ticket_prices = [17, 17, 16]

#creating 'd' dictionary with data for the dataframe

d = {'movie name': movie_title , 'ticket price (in USD)' : ticket_prices}

movie_info = pd.DataFrame(data = d)

#printing the dataframe 

movie_info

Output

Selection of data from a DataFrame  

Now going back to the first dataframe ‘sectors’, there were a few hundred rows and multiple columns. What is you just wanted to view a part of it?

Technique 1:

Selecting columns of a dataframe using column names


#printing the "communication services" and "financials" columns of the 'sectors' dataframe using column names

sectors[['Communication Services', 'Financials']]

Output

Technique 2:

Selecting rows and columns using position


#selecting column using position value

sectors.iloc[ : , 1 ]

Output

In the above code, the colon ‘:’ stands for ‘all’. The first part of iloc takes in row position and the second is column position. So here the code returns all rows and the column in position 1 i.e. ‘S&P 500′.

The position values start from 0 and hence if we used 0 in place of 1, the code would have returned the ‘Date’ column.

Now let’s see how we can select a particular row using iloc.


#selecting row using position value

sectors.iloc[ 0 , : ]

Output

Technique 3:

Selecting a slice of the dataframe ( rows 2 to 10 and columns 5 to 9)


#selecting a slice of the dataframe

sectors.iloc[2:11,5:10]

Output

As you might have noticed above, to select rows 2 to 10, I have given ‘2:11’ as the slicing value and for columns 5 to 9, I’ve given ‘5:10’.  This is because the second value of the slice isn’t inclusive whereas the first one is. Hence I have given one number higher for the second value.

Adding and Dropping a column from a DataFrame

Adding Columns

There are many ways to add a column to an exiting dataframe.

Technique 1:


#creating a new list with values from 0 to 365

new_values = range(len(sectors))

sectors['Numbers'] = new_values

#viewing the result sectors

Output

A new column called ‘Numbers’ has been added to the existing sectors dataframe at the end of the table. Now to insert the column in a location of your choice, you can use the following:

Technique 2:


#insert a new column called New_Numbers in position 1
sectors.insert(1, 'New_Numbers', new_values, False)

#printing the dataframe
sectors

Output

Adding columns permanently modifies the original dataframe and reflects the changes.

Dropping Columns

Now if you’d like to drop a few columns, pandas again allows multiple ways to achieve this.

Technique 1:


#dropping a single column (New_Numbers) using the column name

sectors.drop(['New_Numbers'], axis = 1)

Output

Technique 2:


#Dropping multiple columns (Energy and Industrials) using column names

sectors.drop(['Energy', 'Industrials'], axis = 1)

Output

Technique 3:


#dropping columns using their position value

sectors.drop(sectors.columns[1], axis = 1)

Output

The New_Numbers column in position 1 is now dropped.

Note that dropping a column doesn’t delete the column from the original dataframe. Hence, the column that was dropped in technique 1 still shows up in technique 2’s output. If you want these dropped changes to reflect in the original dataframe, you will have to reassign the dataframe like below.


#reassigning the dataframe with 'New_Numbers' and 'Numbers' columns dropped

sectors = sectors.drop(['New_Numbers','Numbers'],axis = 1)

#printing sectors dataframe

sectors

Output

Export DataFrame to csv


#exporting sectors dataframe to csv file

sectors.to_csv('/Users/vaishrathi/Desktop/sectors_dataframe')

This creates a new csv file called ‘sectors_dataframe’ to the user path I’ve given in the brackets (my desktop).

Leave a Comment