freeCodeCamp/guide/english/data-science-tools/pandas/index.md

3.0 KiB

title
pandas

Everybody loves pandas!

pandas

pandas is a Python library for data analysis using data frames. Data frames are tables of data, which may conceptually be compared to a spreadsheet. Data scientists familiar with R will feel at home here. pandas is often used along with numpy, pyplot, and scikit-learn.

Importing pandas

It is a widely used convention to import the pandas library using the alias pd:

import pandas as pd

Data frames

A data frame consists of a number of rows and columns. Each column represents a feature of the data set, and so, has a name and a data type. Each row represents a data point through associated feature values. The pandas library allows you to manipulate the data in a data frame in various ways. pandas has a lot of possibilities, so the following is merely scratching the surface to give you a feel for the library.

Series

Series is the basic data-type in pandas. A Series is very similar to an array (NumPy array, in fact it is built on top of the NumPy array object). A Series can have axis labels, as it can be indexed by a label with no number indexing for the location of data. It can hold any valid Python Object like List, Dictionary etc.

Loading data from a csv file

A .csv file is a comma separated value file. A very common way to store data. To load such data into a pandas data frame use the read_csv method:

df = pd.read_csv(file_path)

Here, file_path can be a local path to a csv file on you computer, or a url pointing to one. The column names may be included in the csv file, or the may be passed as an argument. For more on this, and much more, take a look at the documentation.

Getting an overview of a data frame

To show the first few rows of a data frame, the head method is useful (once more this should sound familiar to R programmers):

df.head()

This will show the first 5 rows of the data frame.

To show more than first 5 rows simply put the number of rows you want to print out inside the head method.

df.head(10)

This will show the first 10 rows of the data frame.

To show the last few rows of a data frame, the tail method is useful (once more this should sound familiar to R programmers):

df.tail()

This will show the last 5 rows of the data frame.

Subsetting: Getting a column by name

A data frame can be subset in many ways. One of the simplest is getting a single column. For instance, if the data frame df contains a column named age, we can extract it as follows:

ages=df["age"]

More Information:

  1. pandas
  2. read_csv
  3. head