freeCodeCamp/guide/english/data-science-tools/pandas/dataframe/index.md

374 lines
7.6 KiB
Markdown
Raw Normal View History

---
title: pandas DataFrame
---
## DataFrame
In this section you will have a detailed look on the other important data-type of pandas "DataFrame". In pandas, DataFrame is used as an object to represent multi-dimensional data. They are mainly used to represent 2 dimensional or tabular data with rows and columns. They can also be called as collection of `Series`.
DataFrame also supports 3 dimensional data using the multi index properties. It will be the replacement for the old and now existing `panel` object. A 3-dimensional DataFrame can consist of multiple 2-D DataFrame.
### Basic syntax of DataFrame
```python
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
```
`Data` : ndarray, dict,Series, another dataframe
`index` : array-like or index. default to RangeIndex(1,2,3 . . n). Represents the index label.
`columns` : array-like or index. default to RangeIndex(1,2,3 . . n). Represents the column label.
`dtype` : dtype, default None. Data type of the DataFrame
### Creating DataFrame in different ways:
As a first step import our pandas module:
```python
import pandas as pd
```
### Create an empty DataFrame:
```python
df = pd.DataFrame()
print(df)
```
Empty DataFrame
Columns: []
Index: []
### Create using a list:
```python
input_data = [['Mark',87],['Tom',78],['Monika',97]]
df = pd.DataFrame(data = input_data)
print(df)
```
0 1
0 Mark 87
1 Tom 78
2 Monika 97
```python
print('DataFrame with column and row index labels:')
roll_no = [223,224,225]
df = pd.DataFrame(data= input_data,index= roll_no,
columns=['Name','Score'])
print(df)
```
DataFrame with column and row index labels:
Name Score
223 Mark 87
224 Tom 78
225 Monika 97
### Create using a dict:
```python
input_data = {'Name': ['Mark','Tom','Monika'],
'Score': [87,78,97]}
df =pd.DataFrame(data=input_data,dtype= float) # Notice that the score is change to float.
print(df)
```
Name Score
0 Mark 87.0
1 Tom 78.0
2 Monika 97.0
### Create using a list of dict:
```python
input_data = [{'Name': 'Mark','Score': 87},
{'Name': 'Tom','Score': 78},
{'Name': 'Monika', 'Score': 97}]
df = pd.DataFrame(data= input_data, index=roll_no)
print(df)
```
Name Score
223 Mark 87
224 Tom 78
225 Monika 97
### Create using a dict of Series:
```python
input_data = {'Name': pd.Series(['Mark','Tom','Monika','John']),
'Score': pd.Series([87,78,97])}
df = pd.DataFrame(input_data)
print(df)
```
Name Score
0 Mark 87.0
1 Tom 78.0
2 Monika 97.0
3 John NaN
You can notice the above output, For John the score is NaN(not a number). In pandas empty values are defaulted with numpy.nan.
### DataFrame Manipulations:
Now that you have a comprehensive idea on how to create a DataFrame and different kind of inputs you can use to create it. Next on to different manipulation operations we can do with a DataFrame.
### Column Manipulation:
Below are the operations on the column level discussed here:
* Column selection
* Column addition
* Column deletion
```python
score_sheet = {'Name': pd.Series(['Mark','Tom','Monika','Lilly','Sam']),
'Maths': pd.Series([89,87,83,78,77]),
'Science': pd.Series([78,88,66,0,88])}
DF = pd.DataFrame(score_sheet)
print(DF)
```
Name Maths Science
0 Mark 89 78
1 Tom 87 88
2 Monika 83 66
3 Lilly 78 0
4 Sam 77 88
### Column selection:
```python
DF['Name'] # Selcting a particular column
```
0 Mark
1 Tom
2 Monika
3 Lilly
4 Sam
Name: Name, dtype: object
```python
type(DF['Maths'])
```
pandas.core.series.Series
You can notice that each column in a DataFrame is considered as a Series and it supports all the Series type operations. Example:`
```python
DF['Maths'].max() # Finding the max score in maths
```
89
```python
math = DF[['Name','Maths']] # Selcting multiple column
print(math)
```
Name Maths
0 Mark 89
1 Tom 87
2 Monika 83
3 Lilly 78
4 Sam 77
### Column addition:
```python
DF['English'] = pd.Series([88,89,98,88,0]) # Adding a new subject English.
print(DF)
```
Name Maths Science English
0 Mark 89 78 88
1 Tom 87 88 89
2 Monika 83 66 98
3 Lilly 78 0 88
4 Sam 77 88 0
```python
DF['Total Score'] = DF['Maths'] + DF['Science'] + DF['English']
print(DF)
```
Name Maths Science English Total Score
0 Mark 89 78 88 255
1 Tom 87 88 89 264
2 Monika 83 66 98 247
3 Lilly 78 0 88 166
4 Sam 77 88 0 165
### Column deletion:
```python
#Using the del function:
del DF['English']
print(DF)
```
Name Maths Science Total Score
0 Mark 89 78 255
1 Tom 87 88 264
2 Monika 83 66 247
3 Lilly 78 0 166
4 Sam 77 88 165
```python
#Using the pop method:
DF.pop('Total Score')
print(DF)
```
Name Maths Science
0 Mark 89 78
1 Tom 87 88
2 Monika 83 66
3 Lilly 78 0
4 Sam 77 88
### Row Manipulation:
As like column, `DataFrame` have the similar operations for rows as well. Now you will see about those operations in row level in detail. You will use the same `DataFrame` DF we have created before.
### Row selection:
There are two method availabel in DataFrame for selection. They are .iloc() and .loc().
* .iloc() method is used to select based on position.
* loc() method is used to select based on the label value.
Now we will see about the .iloc() method.
```python
DF.iloc[2] #retruns the 2nd row.
```
Name Monika
Maths 83
Science 66
Name: 2, dtype: object
```python
type(DF.iloc[3])
```
pandas.core.series.Series
The important thing to notice here is that it returns a series again. Not just the column is retruned as a Series , rows as well.
```python
print(DF[2:4]) # Sliceing the rows
```
Name Maths Science
2 Monika 83 66
3 Lilly 78 0
### Row addition:
```python
new_student = pd.DataFrame(data = [['Ben',79,89]],
columns=['Name','Maths','Science'],
index=[5])
DF = DF.append(new_student) # Using the append method added a new column.
print(DF)
```
Name Maths Science
0 Mark 89 78
1 Tom 87 88
2 Monika 83 66
3 Lilly 78 0
4 Sam 77 88
5 Ben 79 89
### Row deletion:
```python
# We delete using the drop method and we use index label for deleting:
DF.drop(3)
print(DF)
```
Name Maths Science
0 Mark 89 78
1 Tom 87 88
2 Monika 83 66
3 Lilly 78 0
4 Sam 77 88
5 Ben 79 89
#### More Information:
[DataFrame](http://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.html)