374 lines
7.6 KiB
Markdown
374 lines
7.6 KiB
Markdown
|
---
|
||
|
title: pandas DataFrame
|
||
|
---
|
||
|
|
||
|
## DataFrame
|
||
|
|
||
|
In this section you will have a detailed look on the other important data-type of pandas "DataFrame". In pandas, DataFrame is used as an object to represent multi-dimensional data. They are mainly used to represent 2 dimensional or tabular data with rows and columns. They can also be called as collection of `Series`.
|
||
|
|
||
|
DataFrame also supports 3 dimensional data using the multi index properties. It will be the replacement for the old and now existing `panel` object. A 3-dimensional DataFrame can consist of multiple 2-D DataFrame.
|
||
|
|
||
|
### Basic syntax of DataFrame
|
||
|
|
||
|
|
||
|
```python
|
||
|
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
|
||
|
```
|
||
|
|
||
|
`Data` : ndarray, dict,Series, another dataframe
|
||
|
|
||
|
`index` : array-like or index. default to RangeIndex(1,2,3 . . n). Represents the index label.
|
||
|
|
||
|
`columns` : array-like or index. default to RangeIndex(1,2,3 . . n). Represents the column label.
|
||
|
|
||
|
`dtype` : dtype, default None. Data type of the DataFrame
|
||
|
|
||
|
### Creating DataFrame in different ways:
|
||
|
|
||
|
As a first step import our pandas module:
|
||
|
|
||
|
|
||
|
```python
|
||
|
import pandas as pd
|
||
|
```
|
||
|
|
||
|
### Create an empty DataFrame:
|
||
|
|
||
|
|
||
|
```python
|
||
|
df = pd.DataFrame()
|
||
|
print(df)
|
||
|
```
|
||
|
|
||
|
Empty DataFrame
|
||
|
Columns: []
|
||
|
Index: []
|
||
|
|
||
|
|
||
|
### Create using a list:
|
||
|
|
||
|
|
||
|
```python
|
||
|
input_data = [['Mark',87],['Tom',78],['Monika',97]]
|
||
|
df = pd.DataFrame(data = input_data)
|
||
|
print(df)
|
||
|
```
|
||
|
|
||
|
0 1
|
||
|
0 Mark 87
|
||
|
1 Tom 78
|
||
|
2 Monika 97
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
print('DataFrame with column and row index labels:')
|
||
|
roll_no = [223,224,225]
|
||
|
df = pd.DataFrame(data= input_data,index= roll_no,
|
||
|
columns=['Name','Score'])
|
||
|
print(df)
|
||
|
```
|
||
|
|
||
|
DataFrame with column and row index labels:
|
||
|
Name Score
|
||
|
223 Mark 87
|
||
|
224 Tom 78
|
||
|
225 Monika 97
|
||
|
|
||
|
|
||
|
### Create using a dict:
|
||
|
|
||
|
|
||
|
```python
|
||
|
input_data = {'Name': ['Mark','Tom','Monika'],
|
||
|
'Score': [87,78,97]}
|
||
|
df =pd.DataFrame(data=input_data,dtype= float) # Notice that the score is change to float.
|
||
|
print(df)
|
||
|
```
|
||
|
|
||
|
Name Score
|
||
|
0 Mark 87.0
|
||
|
1 Tom 78.0
|
||
|
2 Monika 97.0
|
||
|
|
||
|
|
||
|
### Create using a list of dict:
|
||
|
|
||
|
|
||
|
```python
|
||
|
input_data = [{'Name': 'Mark','Score': 87},
|
||
|
{'Name': 'Tom','Score': 78},
|
||
|
{'Name': 'Monika', 'Score': 97}]
|
||
|
df = pd.DataFrame(data= input_data, index=roll_no)
|
||
|
print(df)
|
||
|
```
|
||
|
|
||
|
Name Score
|
||
|
223 Mark 87
|
||
|
224 Tom 78
|
||
|
225 Monika 97
|
||
|
|
||
|
|
||
|
### Create using a dict of Series:
|
||
|
|
||
|
|
||
|
```python
|
||
|
input_data = {'Name': pd.Series(['Mark','Tom','Monika','John']),
|
||
|
'Score': pd.Series([87,78,97])}
|
||
|
df = pd.DataFrame(input_data)
|
||
|
print(df)
|
||
|
```
|
||
|
|
||
|
Name Score
|
||
|
0 Mark 87.0
|
||
|
1 Tom 78.0
|
||
|
2 Monika 97.0
|
||
|
3 John NaN
|
||
|
|
||
|
|
||
|
You can notice the above output, For John the score is NaN(not a number). In pandas empty values are defaulted with numpy.nan.
|
||
|
|
||
|
### DataFrame Manipulations:
|
||
|
|
||
|
Now that you have a comprehensive idea on how to create a DataFrame and different kind of inputs you can use to create it. Next on to different manipulation operations we can do with a DataFrame.
|
||
|
|
||
|
### Column Manipulation:
|
||
|
|
||
|
Below are the operations on the column level discussed here:
|
||
|
* Column selection
|
||
|
* Column addition
|
||
|
* Column deletion
|
||
|
|
||
|
|
||
|
```python
|
||
|
score_sheet = {'Name': pd.Series(['Mark','Tom','Monika','Lilly','Sam']),
|
||
|
'Maths': pd.Series([89,87,83,78,77]),
|
||
|
'Science': pd.Series([78,88,66,0,88])}
|
||
|
DF = pd.DataFrame(score_sheet)
|
||
|
print(DF)
|
||
|
```
|
||
|
|
||
|
Name Maths Science
|
||
|
0 Mark 89 78
|
||
|
1 Tom 87 88
|
||
|
2 Monika 83 66
|
||
|
3 Lilly 78 0
|
||
|
4 Sam 77 88
|
||
|
|
||
|
|
||
|
### Column selection:
|
||
|
|
||
|
|
||
|
```python
|
||
|
DF['Name'] # Selcting a particular column
|
||
|
```
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
0 Mark
|
||
|
1 Tom
|
||
|
2 Monika
|
||
|
3 Lilly
|
||
|
4 Sam
|
||
|
Name: Name, dtype: object
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
type(DF['Maths'])
|
||
|
```
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
pandas.core.series.Series
|
||
|
|
||
|
|
||
|
|
||
|
You can notice that each column in a DataFrame is considered as a Series and it supports all the Series type operations. Example:`
|
||
|
|
||
|
|
||
|
```python
|
||
|
DF['Maths'].max() # Finding the max score in maths
|
||
|
```
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
89
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
math = DF[['Name','Maths']] # Selcting multiple column
|
||
|
print(math)
|
||
|
```
|
||
|
|
||
|
Name Maths
|
||
|
0 Mark 89
|
||
|
1 Tom 87
|
||
|
2 Monika 83
|
||
|
3 Lilly 78
|
||
|
4 Sam 77
|
||
|
|
||
|
|
||
|
### Column addition:
|
||
|
|
||
|
|
||
|
```python
|
||
|
DF['English'] = pd.Series([88,89,98,88,0]) # Adding a new subject English.
|
||
|
print(DF)
|
||
|
```
|
||
|
|
||
|
Name Maths Science English
|
||
|
0 Mark 89 78 88
|
||
|
1 Tom 87 88 89
|
||
|
2 Monika 83 66 98
|
||
|
3 Lilly 78 0 88
|
||
|
4 Sam 77 88 0
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
DF['Total Score'] = DF['Maths'] + DF['Science'] + DF['English']
|
||
|
print(DF)
|
||
|
```
|
||
|
|
||
|
Name Maths Science English Total Score
|
||
|
0 Mark 89 78 88 255
|
||
|
1 Tom 87 88 89 264
|
||
|
2 Monika 83 66 98 247
|
||
|
3 Lilly 78 0 88 166
|
||
|
4 Sam 77 88 0 165
|
||
|
|
||
|
|
||
|
### Column deletion:
|
||
|
|
||
|
|
||
|
```python
|
||
|
#Using the del function:
|
||
|
del DF['English']
|
||
|
print(DF)
|
||
|
```
|
||
|
|
||
|
Name Maths Science Total Score
|
||
|
0 Mark 89 78 255
|
||
|
1 Tom 87 88 264
|
||
|
2 Monika 83 66 247
|
||
|
3 Lilly 78 0 166
|
||
|
4 Sam 77 88 165
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
#Using the pop method:
|
||
|
DF.pop('Total Score')
|
||
|
print(DF)
|
||
|
```
|
||
|
|
||
|
Name Maths Science
|
||
|
0 Mark 89 78
|
||
|
1 Tom 87 88
|
||
|
2 Monika 83 66
|
||
|
3 Lilly 78 0
|
||
|
4 Sam 77 88
|
||
|
|
||
|
|
||
|
### Row Manipulation:
|
||
|
|
||
|
As like column, `DataFrame` have the similar operations for rows as well. Now you will see about those operations in row level in detail. You will use the same `DataFrame` DF we have created before.
|
||
|
|
||
|
### Row selection:
|
||
|
|
||
|
There are two method availabel in DataFrame for selection. They are .iloc() and .loc().
|
||
|
|
||
|
* .iloc() method is used to select based on position.
|
||
|
* loc() method is used to select based on the label value.
|
||
|
|
||
|
Now we will see about the .iloc() method.
|
||
|
|
||
|
|
||
|
```python
|
||
|
DF.iloc[2] #retruns the 2nd row.
|
||
|
```
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
Name Monika
|
||
|
Maths 83
|
||
|
Science 66
|
||
|
Name: 2, dtype: object
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
```python
|
||
|
type(DF.iloc[3])
|
||
|
```
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
pandas.core.series.Series
|
||
|
|
||
|
|
||
|
|
||
|
The important thing to notice here is that it returns a series again. Not just the column is retruned as a Series , rows as well.
|
||
|
|
||
|
|
||
|
```python
|
||
|
print(DF[2:4]) # Sliceing the rows
|
||
|
```
|
||
|
|
||
|
Name Maths Science
|
||
|
2 Monika 83 66
|
||
|
3 Lilly 78 0
|
||
|
|
||
|
|
||
|
### Row addition:
|
||
|
|
||
|
|
||
|
```python
|
||
|
new_student = pd.DataFrame(data = [['Ben',79,89]],
|
||
|
columns=['Name','Maths','Science'],
|
||
|
index=[5])
|
||
|
|
||
|
DF = DF.append(new_student) # Using the append method added a new column.
|
||
|
print(DF)
|
||
|
```
|
||
|
|
||
|
Name Maths Science
|
||
|
0 Mark 89 78
|
||
|
1 Tom 87 88
|
||
|
2 Monika 83 66
|
||
|
3 Lilly 78 0
|
||
|
4 Sam 77 88
|
||
|
5 Ben 79 89
|
||
|
|
||
|
|
||
|
### Row deletion:
|
||
|
|
||
|
|
||
|
```python
|
||
|
# We delete using the drop method and we use index label for deleting:
|
||
|
DF.drop(3)
|
||
|
print(DF)
|
||
|
```
|
||
|
|
||
|
Name Maths Science
|
||
|
0 Mark 89 78
|
||
|
1 Tom 87 88
|
||
|
2 Monika 83 66
|
||
|
3 Lilly 78 0
|
||
|
4 Sam 77 88
|
||
|
5 Ben 79 89
|
||
|
|
||
|
|
||
|
#### More Information:
|
||
|
|
||
|
[DataFrame](http://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.html)
|