Data cleaning (#33923)

* Data cleaning Data cleaning * Updated with back ticks
2018-12-18 04:26:36 +00:00 · 2018-12-18 04:26:36 +00:00 · 30b7481f1c
parent 617502cc67
commit 30b7481f1c
1 changed files with 5 additions and 0 deletions
--- a/guide/english/data-science-tools/pandas/index.md
+++ b/guide/english/data-science-tools/pandas/index.md
@ -214,6 +214,11 @@ df['col1'].apply(len)
 ```python
 del df['col1']
 ```
+## Data Cleaning
+Data cleaning is a very important step in data analysis. For example, we always check for missing values in the data by running `pd.isnull()` which checks for null Values, and returns a boolean array (an array of true for missing values and false for non-missing values). In order to get a sum of null/missing values, run `pd.isnull().sum()`. `pd.notnull()` is the opposite of `pd.isnull()`. After you get a list of missing values you can get rid of them, or drop them by using `df.dropna()` to drop the rows or `df.dropna(axis=1)` to drop the columns. A different approach would be to fill the missing values with other values by using df.fillna(x) which fills the missing values with x (you can put there whatever you want) or `s.fillna(s.mean())` to replace all null values with the mean (mean can be replaced with almost any function from the statistics section).
+
+It is sometimes necessary to replace values with different values. For example, `s.replace(1,'one')` would replace all values equal to 1 with 'one'. It’s possible to do it for multiple values: `s.replace([1,3],['one','three'])` would replace all 1 with 'one' and 3 with 'three'. You can also rename specific columns by running: `df.rename(columns={'old_name': 'new_ name'})` or use `df.set_index('column_one')` to change the index of the data frame.
+

 ## Checking for missing values
 ```df.isnull()```