Jaekeun Lee's Space     Projects     About Me     Blog     Writings     Tags

Data Manipulation with Pandas 1

Tags:


Pandas


Pandas is a python data analysis toolkit that provides fast, flexible and expressive data structures designed to make working with “relational” or “labeled” data in easy and intuitive way. The name pandas came from the term “panel data”. Anyone interested in Data Analysis using python should better get familar with this library.


There are many posts, books and even official documentations that explains how to make use of this wonderful library. In this post, I would like to discuss practical techniques and summarize the gist of pandas.


DataFrame


Pandas has two types of data structure:

  • Series : 1-Dimensional
  • DataFrame: 2-Dimensional


For more basic tutorial about Pandas and numpy, please check my tutorial on the GitHub

Dataframe is an expression of data into a tabular form. In pandas, three elements are used to create dataframe: column (variable name), row (data), and index. Dataframe can be created from different forms of python data types. List, dictionary, pandas series, and numpy ndarray can be used to make Dataframe. Example codes are presented below:

import pandas as pd
import numpy as np

# numpy 2d array to dataframe
nd_array = np.array([[0,1,2],[3,4,5]])
n2darray_to_df = pd.DataFrame(nd_array)

# dictionary to dataframe
my_dict = {"name": ["Jake","Mike"],"age":["26","14"],"gender":["male","female"]}
dict_to_df = pd.DataFrame(my_dict)

# pandas series to dataframe
my_series = pd.Series({"Korea":"Kimchi","Japan":"Sushi","India":"Curry"})
series_to_df = pd.DataFrame(my_series)



Accessing data in DataFrame


DataFrame is very similar to spreadsheets that you can see on Excel. Just as Microsoft Excel, you can access to specific parts of data by filtering and indexing your DataFrame. We will use this data for demonstration.


Reading Data

In order to open your csv (comma seperated version) file, all you need is one line of code after the import:

#importing pandas library
import pandas as pd               

# Reading file
df = pd.read_csv("YOUR DIRECTORY/pd_tutorial.csv") 

# Displaying a few lines on the top of the file
df.head()        

# Shows the size of your data in a tuple
df.shape



Indexing, Filtering and Selection

In order to analyze data, one should be able to properly manipulate, select and retrieve their subject data. This means that proficiency in indexing, filtering and selection determines the efficiency of work. This is the main reason why I wrote and organized this post.


1. iloc, loc

These are the most basic functions used to find data from dataframe.

  • iloc : uses integer position to find values. Labels and strings are not supported
  • loc: uses label to find values.




2. Indexing

Similar to indexing with lists, Pandas dataframes can also be indexed in a similar way.

# Retrieving a specific column
df["COLUMN NAME"]  # Returns pandas.Series
df[["COLUMN NAME"]] # Returns pandas.DataFrame

# Retrieving multiple columns
df[["COLUMN 1", "COLUMN 2"]] # Returns pandas.DataFrame

# Retrieving specific rows
df[ num1 : num2 ] # Returns rows from index num1 to num2 - 1



3. Using Conditions

You can make use of conditions to filter data of your interest too. If the outcome of respective conditions produce boolean values, the filtering would turnout fine. The example below gets crime data which occurred in district 3 with "Traffic" included in the crime description. The function *.head()* is similar to SQL query *LIMIT*. It is used to retrieve the requested amount of lines of data. In the example, 10 lines are retrieved as a result.

# dataframe[ ( CONDITION1 )   OPERATOR  ( CONDITION2 ) ]

# OPERATOR
& # for and
| # for or
^ # for xor

# CONDITION Examples
# value 4 only from a column
dataframe["COLUMN NAME"] == 4 

# values larger than 4 from a column
dataframe["COLUMN NAME"] > 4 

# value that includes keyword "APPLE" from a column
dataframe["COLUMN NAME"].str.contains("APPLE") 

# value without keyword "APPLE" from a column
~dataframe["COLUMN NAME"].str.contains("APPLE") 



So, this was probably the basic manipulations that you can do with pandas. In the next post, I will cover the “fancy” tricks.



References
  • https://towardsdatascience.com/10-python-pandas-tricks-to-make-data-analysis-more-enjoyable-cb8f55af8c30
  • https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python
  • https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/