Data cleaning (#33923)

* Data cleaning

Data cleaning

* Updated with back ticks
pull/32931/head^2
Ganesh Pavan K 2018-12-18 04:26:36 +00:00 committed by Christopher McCormack
parent 617502cc67
commit 30b7481f1c
1 changed files with 5 additions and 0 deletions

View File

@ -214,6 +214,11 @@ df['col1'].apply(len)
```python
del df['col1']
```
## Data Cleaning
Data cleaning is a very important step in data analysis. For example, we always check for missing values in the data by running `pd.isnull()` which checks for null Values, and returns a boolean array (an array of true for missing values and false for non-missing values). In order to get a sum of null/missing values, run `pd.isnull().sum()`. `pd.notnull()` is the opposite of `pd.isnull()`. After you get a list of missing values you can get rid of them, or drop them by using `df.dropna()` to drop the rows or `df.dropna(axis=1)` to drop the columns. A different approach would be to fill the missing values with other values by using df.fillna(x) which fills the missing values with x (you can put there whatever you want) or `s.fillna(s.mean())` to replace all null values with the mean (mean can be replaced with almost any function from the statistics section).
It is sometimes necessary to replace values with different values. For example, `s.replace(1,'one')` would replace all values equal to 1 with 'one'. Its possible to do it for multiple values: `s.replace([1,3],['one','three'])` would replace all 1 with 'one' and 3 with 'three'. You can also rename specific columns by running: `df.rename(columns={'old_name': 'new_ name'})` or use `df.set_index('column_one')` to change the index of the data frame.
## Checking for missing values
```df.isnull()```