freeCodeCamp/guide/english/data-science-tools/pandas/dataframe/index.md

7.6 KiB

title
pandas DataFrame

DataFrame

In this section you will have a detailed look on the other important data-type of pandas "DataFrame". In pandas, DataFrame is used as an object to represent multi-dimensional data. They are mainly used to represent 2 dimensional or tabular data with rows and columns. They can also be called as collection of Series.

DataFrame also supports 3 dimensional data using the multi index properties. It will be the replacement for the old and now existing panel object. A 3-dimensional DataFrame can consist of multiple 2-D DataFrame.

Basic syntax of DataFrame

pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)

Data : ndarray, dict,Series, another dataframe

index : array-like or index. default to RangeIndex(1,2,3 . . n). Represents the index label.

columns : array-like or index. default to RangeIndex(1,2,3 . . n). Represents the column label.

dtype : dtype, default None. Data type of the DataFrame

Creating DataFrame in different ways:

As a first step import our pandas module:

import pandas as pd

Create an empty DataFrame:

df = pd.DataFrame()
print(df)
Empty DataFrame
Columns: []
Index: []

Create using a list:

input_data = [['Mark',87],['Tom',78],['Monika',97]]
df = pd.DataFrame(data = input_data)
print(df)
        0   1
0    Mark  87
1     Tom  78
2  Monika  97
print('DataFrame with column and row index labels:')
roll_no = [223,224,225]
df = pd.DataFrame(data= input_data,index= roll_no,
             columns=['Name','Score'])
print(df)
DataFrame with column and row index labels:
       Name  Score
223    Mark     87
224     Tom     78
225  Monika     97

Create using a dict:

input_data = {'Name': ['Mark','Tom','Monika'],
              'Score': [87,78,97]}
df =pd.DataFrame(data=input_data,dtype= float)         # Notice that the score is change to float.
print(df)
     Name  Score
0    Mark   87.0
1     Tom   78.0
2  Monika   97.0

Create using a list of dict:

input_data = [{'Name': 'Mark','Score': 87},
              {'Name': 'Tom','Score': 78},
              {'Name': 'Monika', 'Score': 97}]
df = pd.DataFrame(data= input_data, index=roll_no)
print(df)
       Name  Score
223    Mark     87
224     Tom     78
225  Monika     97

Create using a dict of Series:

input_data = {'Name': pd.Series(['Mark','Tom','Monika','John']),
              'Score': pd.Series([87,78,97])}
df = pd.DataFrame(input_data)
print(df)
     Name  Score
0    Mark   87.0
1     Tom   78.0
2  Monika   97.0
3    John    NaN

You can notice the above output, For John the score is NaN(not a number). In pandas empty values are defaulted with numpy.nan.

DataFrame Manipulations:

Now that you have a comprehensive idea on how to create a DataFrame and different kind of inputs you can use to create it. Next on to different manipulation operations we can do with a DataFrame.

Column Manipulation:

Below are the operations on the column level discussed here:

  • Column selection
  • Column addition
  • Column deletion
score_sheet = {'Name': pd.Series(['Mark','Tom','Monika','Lilly','Sam']),
               'Maths': pd.Series([89,87,83,78,77]),
               'Science': pd.Series([78,88,66,0,88])}
DF = pd.DataFrame(score_sheet)
print(DF)
     Name  Maths  Science
0    Mark     89       78
1     Tom     87       88
2  Monika     83       66
3   Lilly     78        0
4     Sam     77       88

Column selection:

DF['Name']             # Selcting a particular column
0      Mark
1       Tom
2    Monika
3     Lilly
4       Sam
Name: Name, dtype: object
type(DF['Maths'])
pandas.core.series.Series

You can notice that each column in a DataFrame is considered as a Series and it supports all the Series type operations. Example:`

DF['Maths'].max()      # Finding the max score in maths
89
math = DF[['Name','Maths']]    # Selcting multiple column
print(math)
     Name  Maths
0    Mark     89
1     Tom     87
2  Monika     83
3   Lilly     78
4     Sam     77

Column addition:

DF['English'] = pd.Series([88,89,98,88,0])   # Adding a new subject English.
print(DF)
     Name  Maths  Science  English
0    Mark     89       78       88
1     Tom     87       88       89
2  Monika     83       66       98
3   Lilly     78        0       88
4     Sam     77       88        0
DF['Total Score'] = DF['Maths'] + DF['Science'] + DF['English']
print(DF)
     Name  Maths  Science  English  Total Score
0    Mark     89       78       88          255
1     Tom     87       88       89          264
2  Monika     83       66       98          247
3   Lilly     78        0       88          166
4     Sam     77       88        0          165

Column deletion:

#Using the del function:
del DF['English']
print(DF)
     Name  Maths  Science  Total Score
0    Mark     89       78          255
1     Tom     87       88          264
2  Monika     83       66          247
3   Lilly     78        0          166
4     Sam     77       88          165
#Using the pop method:
DF.pop('Total Score')
print(DF)
     Name  Maths  Science
0    Mark     89       78
1     Tom     87       88
2  Monika     83       66
3   Lilly     78        0
4     Sam     77       88

Row Manipulation:

As like column, DataFrame have the similar operations for rows as well. Now you will see about those operations in row level in detail. You will use the same DataFrame DF we have created before.

Row selection:

There are two method availabel in DataFrame for selection. They are .iloc() and .loc().

  • .iloc() method is used to select based on position.
  • loc() method is used to select based on the label value.

Now we will see about the .iloc() method.

DF.iloc[2]              #retruns the 2nd row.
Name       Monika
Maths          83
Science        66
Name: 2, dtype: object
type(DF.iloc[3])
pandas.core.series.Series

The important thing to notice here is that it returns a series again. Not just the column is retruned as a Series , rows as well.

print(DF[2:4])           # Sliceing the rows 
     Name  Maths  Science
2  Monika     83       66
3   Lilly     78        0

Row addition:

new_student = pd.DataFrame(data = [['Ben',79,89]], 
                           columns=['Name','Maths','Science'], 
                           index=[5])

DF = DF.append(new_student)             # Using the append method added a new column.
print(DF)
     Name  Maths  Science
0    Mark     89       78
1     Tom     87       88
2  Monika     83       66
3   Lilly     78        0
4     Sam     77       88
5     Ben     79       89

Row deletion:

# We delete using the drop method and we use index label for deleting:
DF.drop(3)
print(DF)
     Name  Maths  Science
0    Mark     89       78
1     Tom     87       88
2  Monika     83       66
3   Lilly     78        0
4     Sam     77       88
5     Ben     79       89

More Information:

DataFrame