7.6 KiB
title |
---|
pandas DataFrame |
DataFrame
In this section you will have a detailed look on the other important data-type of pandas "DataFrame". In pandas, DataFrame is used as an object to represent multi-dimensional data. They are mainly used to represent 2 dimensional or tabular data with rows and columns. They can also be called as collection of Series
.
DataFrame also supports 3 dimensional data using the multi index properties. It will be the replacement for the old and now existing panel
object. A 3-dimensional DataFrame can consist of multiple 2-D DataFrame.
Basic syntax of DataFrame
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
Data
: ndarray, dict,Series, another dataframe
index
: array-like or index. default to RangeIndex(1,2,3 . . n). Represents the index label.
columns
: array-like or index. default to RangeIndex(1,2,3 . . n). Represents the column label.
dtype
: dtype, default None. Data type of the DataFrame
Creating DataFrame in different ways:
As a first step import our pandas module:
import pandas as pd
Create an empty DataFrame:
df = pd.DataFrame()
print(df)
Empty DataFrame
Columns: []
Index: []
Create using a list:
input_data = [['Mark',87],['Tom',78],['Monika',97]]
df = pd.DataFrame(data = input_data)
print(df)
0 1
0 Mark 87
1 Tom 78
2 Monika 97
print('DataFrame with column and row index labels:')
roll_no = [223,224,225]
df = pd.DataFrame(data= input_data,index= roll_no,
columns=['Name','Score'])
print(df)
DataFrame with column and row index labels:
Name Score
223 Mark 87
224 Tom 78
225 Monika 97
Create using a dict:
input_data = {'Name': ['Mark','Tom','Monika'],
'Score': [87,78,97]}
df =pd.DataFrame(data=input_data,dtype= float) # Notice that the score is change to float.
print(df)
Name Score
0 Mark 87.0
1 Tom 78.0
2 Monika 97.0
Create using a list of dict:
input_data = [{'Name': 'Mark','Score': 87},
{'Name': 'Tom','Score': 78},
{'Name': 'Monika', 'Score': 97}]
df = pd.DataFrame(data= input_data, index=roll_no)
print(df)
Name Score
223 Mark 87
224 Tom 78
225 Monika 97
Create using a dict of Series:
input_data = {'Name': pd.Series(['Mark','Tom','Monika','John']),
'Score': pd.Series([87,78,97])}
df = pd.DataFrame(input_data)
print(df)
Name Score
0 Mark 87.0
1 Tom 78.0
2 Monika 97.0
3 John NaN
You can notice the above output, For John the score is NaN(not a number). In pandas empty values are defaulted with numpy.nan.
DataFrame Manipulations:
Now that you have a comprehensive idea on how to create a DataFrame and different kind of inputs you can use to create it. Next on to different manipulation operations we can do with a DataFrame.
Column Manipulation:
Below are the operations on the column level discussed here:
- Column selection
- Column addition
- Column deletion
score_sheet = {'Name': pd.Series(['Mark','Tom','Monika','Lilly','Sam']),
'Maths': pd.Series([89,87,83,78,77]),
'Science': pd.Series([78,88,66,0,88])}
DF = pd.DataFrame(score_sheet)
print(DF)
Name Maths Science
0 Mark 89 78
1 Tom 87 88
2 Monika 83 66
3 Lilly 78 0
4 Sam 77 88
Column selection:
DF['Name'] # Selcting a particular column
0 Mark
1 Tom
2 Monika
3 Lilly
4 Sam
Name: Name, dtype: object
type(DF['Maths'])
pandas.core.series.Series
You can notice that each column in a DataFrame is considered as a Series and it supports all the Series type operations. Example:`
DF['Maths'].max() # Finding the max score in maths
89
math = DF[['Name','Maths']] # Selcting multiple column
print(math)
Name Maths
0 Mark 89
1 Tom 87
2 Monika 83
3 Lilly 78
4 Sam 77
Column addition:
DF['English'] = pd.Series([88,89,98,88,0]) # Adding a new subject English.
print(DF)
Name Maths Science English
0 Mark 89 78 88
1 Tom 87 88 89
2 Monika 83 66 98
3 Lilly 78 0 88
4 Sam 77 88 0
DF['Total Score'] = DF['Maths'] + DF['Science'] + DF['English']
print(DF)
Name Maths Science English Total Score
0 Mark 89 78 88 255
1 Tom 87 88 89 264
2 Monika 83 66 98 247
3 Lilly 78 0 88 166
4 Sam 77 88 0 165
Column deletion:
#Using the del function:
del DF['English']
print(DF)
Name Maths Science Total Score
0 Mark 89 78 255
1 Tom 87 88 264
2 Monika 83 66 247
3 Lilly 78 0 166
4 Sam 77 88 165
#Using the pop method:
DF.pop('Total Score')
print(DF)
Name Maths Science
0 Mark 89 78
1 Tom 87 88
2 Monika 83 66
3 Lilly 78 0
4 Sam 77 88
Row Manipulation:
As like column, DataFrame
have the similar operations for rows as well. Now you will see about those operations in row level in detail. You will use the same DataFrame
DF we have created before.
Row selection:
There are two method availabel in DataFrame for selection. They are .iloc() and .loc().
- .iloc() method is used to select based on position.
- loc() method is used to select based on the label value.
Now we will see about the .iloc() method.
DF.iloc[2] #retruns the 2nd row.
Name Monika
Maths 83
Science 66
Name: 2, dtype: object
type(DF.iloc[3])
pandas.core.series.Series
The important thing to notice here is that it returns a series again. Not just the column is retruned as a Series , rows as well.
print(DF[2:4]) # Sliceing the rows
Name Maths Science
2 Monika 83 66
3 Lilly 78 0
Row addition:
new_student = pd.DataFrame(data = [['Ben',79,89]],
columns=['Name','Maths','Science'],
index=[5])
DF = DF.append(new_student) # Using the append method added a new column.
print(DF)
Name Maths Science
0 Mark 89 78
1 Tom 87 88
2 Monika 83 66
3 Lilly 78 0
4 Sam 77 88
5 Ben 79 89
Row deletion:
# We delete using the drop method and we use index label for deleting:
DF.drop(3)
print(DF)
Name Maths Science
0 Mark 89 78
1 Tom 87 88
2 Monika 83 66
3 Lilly 78 0
4 Sam 77 88
5 Ben 79 89