diff --git a/guide/english/data-science-tools/pandas/index.md b/guide/english/data-science-tools/pandas/index.md index fb19c3aebe2..577d2f067b9 100644 --- a/guide/english/data-science-tools/pandas/index.md +++ b/guide/english/data-science-tools/pandas/index.md @@ -214,6 +214,11 @@ df['col1'].apply(len) ```python del df['col1'] ``` +## Data Cleaning +Data cleaning is a very important step in data analysis. For example, we always check for missing values in the data by running `pd.isnull()` which checks for null Values, and returns a boolean array (an array of true for missing values and false for non-missing values). In order to get a sum of null/missing values, run `pd.isnull().sum()`. `pd.notnull()` is the opposite of `pd.isnull()`. After you get a list of missing values you can get rid of them, or drop them by using `df.dropna()` to drop the rows or `df.dropna(axis=1)` to drop the columns. A different approach would be to fill the missing values with other values by using df.fillna(x) which fills the missing values with x (you can put there whatever you want) or `s.fillna(s.mean())` to replace all null values with the mean (mean can be replaced with almost any function from the statistics section). + +It is sometimes necessary to replace values with different values. For example, `s.replace(1,'one')` would replace all values equal to 1 with 'one'. It’s possible to do it for multiple values: `s.replace([1,3],['one','three'])` would replace all 1 with 'one' and 3 with 'three'. You can also rename specific columns by running: `df.rename(columns={'old_name': 'new_ name'})` or use `df.set_index('column_one')` to change the index of the data frame. + ## Checking for missing values ```df.isnull()```