Getting started with Python pandas
What to do with Python’s pandas library?
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
Ask Python itself what is pandas!
import pandas
pandas?
Here is some of the output:
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.
Main Features
“Here are just a few of the things that pandas does well:
- Easy handling of missing data in floating point as well as non-floating point data.
- Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
- Automatic and explicit data alignment: objects can be explicitly aligned
to a set of labels, or the user can simply ignore the labels and let
Series
,DataFrame
, etc. automatically align the data for you in computations. - Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data.” (…)
Want to know more? Run pandas??
! To save some typing pandas can be imported as pd like import pandas as pd
. So pd??
will do it as well.
Go on with some real world example…
We shall import os
, plotly
(as plotly.express) and pandas
and load our example gapminder
dataset as DataFrame
.
Now we import os
, plotly
(as plotly.express) and pandas
modules than load our example gapminder
dataset as DataFrame
. Note: that this intro example is created as a jupyter-lab notebook, that is also exported to a python script. Results were exported to markdown and pdf. In order to render markdown correctly in Jekyll, some minor modification were needed. See the modified markdown file on github.
import os
import plotly.express as px
import pandas as pd
df = px.data.gapminder() # n.b. there is a separate module for gapminder data
type(df)
pandas.core.frame.DataFrame
Have a look at some attributes and descriptive statistics of our DataFrame.
df.head(5)
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 | AFG | 4 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 | AFG | 4 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 | AFG | 4 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 | AFG | 4 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 | AFG | 4 |
df.shape #without parents!
(1704, 8)
pd.set_option('display.float_format', lambda x: '%.1f' % x) # set number of digits
df.describe()
year | lifeExp | pop | gdpPercap | iso_num | |
---|---|---|---|---|---|
count | 1704.0 | 1704.0 | 1704.0 | 1704.0 | 1704.0 |
mean | 1979.5 | 59.5 | 29601212.3 | 7215.3 | 425.9 |
std | 17.3 | 12.9 | 106157896.7 | 9857.5 | 248.3 |
min | 1952.0 | 23.6 | 60011.0 | 241.2 | 4.0 |
25% | 1965.8 | 48.2 | 2793664.0 | 1202.1 | 208.0 |
50% | 1979.5 | 60.7 | 7023595.5 | 3531.8 | 410.0 |
75% | 1993.2 | 70.8 | 19585221.8 | 9325.5 | 638.0 |
max | 2007.0 | 82.6 | 1318683096.0 | 113523.1 | 894.0 |
df["country"].value_counts()
Nicaragua 12
Gambia 12
Rwanda 12
Cambodia 12
Congo, Dem. Rep. 12
..
Sudan 12
Swaziland 12
Peru 12
Bulgaria 12
Costa Rica 12
Name: country, Length: 142, dtype: int64
Filtering DataFrame
- Select the most recent year.
df["year"].max()
2007
Exclude all previous years and keep 2007 only with a query.
df.query('year == 2007')
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
11 | Afghanistan | Asia | 2007 | 43.8 | 31889923 | 974.6 | AFG | 4 |
23 | Albania | Europe | 2007 | 76.4 | 3600523 | 5937.0 | ALB | 8 |
35 | Algeria | Africa | 2007 | 72.3 | 33333216 | 6223.4 | DZA | 12 |
47 | Angola | Africa | 2007 | 42.7 | 12420476 | 4797.2 | AGO | 24 |
59 | Argentina | Americas | 2007 | 75.3 | 40301927 | 12779.4 | ARG | 32 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1655 | Vietnam | Asia | 2007 | 74.2 | 85262356 | 2441.6 | VNM | 704 |
1667 | West Bank and Gaza | Asia | 2007 | 73.4 | 4018332 | 3025.3 | PSE | 275 |
1679 | Yemen, Rep. | Asia | 2007 | 62.7 | 22211743 | 2280.8 | YEM | 887 |
1691 | Zambia | Africa | 2007 | 42.4 | 11746035 | 1271.2 | ZMB | 894 |
1703 | Zimbabwe | Africa | 2007 | 43.5 | 12311143 | 469.7 | ZWE | 716 |
142 rows × 8 columns
We got a df with 142 rows which is equal to the number of countries. And this is so True
.
len(df.country.unique()) == len(df.query('year == 2007'))
True
Which country has the lowest life expectancy in Europe in 2007? First print the value.
df.query('year == 2007 & continent == "Europe"')["lifeExp"].min()
71.777
What country has the highest life expectancy worldwide?
maxLE = df['lifeExp'].max()
df[df['lifeExp'] == maxLE]
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
803 | Japan | Asia | 2007 | 82.603 | 127467972 | 31656.06806 | JPN | 392 |
What country has the highest life expectancy in each continent?
df.groupby("continent").max("lifeExp")
year | lifeExp | pop | gdpPercap | iso_num | |
---|---|---|---|---|---|
continent | |||||
Africa | 2007 | 76.442 | 135031164 | 21951.21176 | 894 |
Americas | 2007 | 80.653 | 301139947 | 42951.65309 | 862 |
Asia | 2007 | 82.603 | 1318683096 | 113523.13290 | 887 |
Europe | 2007 | 81.757 | 82400996 | 49357.19017 | 826 |
Oceania | 2007 | 81.235 | 20434176 | 34435.36744 | 554 |
df.groupby(['continent'], sort=False)['lifeExp'].max()
continent
Asia 82.603
Europe 81.757
Africa 76.442
Americas 80.653
Oceania 81.235
Name: lifeExp, dtype: float64
Well, mission completed, but we are still lack the name of the countries :-( So copy paste from here.
idx = df.groupby(['continent'])['lifeExp'].transform(max) == df['lifeExp']
df[idx]
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
71 | Australia | Oceania | 2007 | 81.235 | 20434176 | 34435.367440 | AUS | 36 |
251 | Canada | Americas | 2007 | 80.653 | 33390141 | 36319.235010 | CAN | 124 |
695 | Iceland | Europe | 2007 | 81.757 | 301931 | 36180.789190 | ISL | 352 |
803 | Japan | Asia | 2007 | 82.603 | 127467972 | 31656.068060 | JPN | 392 |
1271 | Reunion | Africa | 2007 | 76.442 | 798094 | 7670.122558 | REU | 638 |
Now, look at the minimum values by continent! But remember to filter for the most recent data.
df2 = df.query('year == 2007')
idx = df2.groupby(['continent'])['lifeExp'].transform(min) == df2['lifeExp']
df2[idx]
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
11 | Afghanistan | Asia | 2007 | 43.828 | 31889923 | 974.580338 | AFG | 4 |
647 | Haiti | Americas | 2007 | 60.916 | 8502814 | 1201.637154 | HTI | 332 |
1103 | New Zealand | Oceania | 2007 | 80.204 | 4115771 | 25185.009110 | NZL | 554 |
1463 | Swaziland | Africa | 2007 | 39.613 | 1133066 | 4513.480643 | SWZ | 748 |
1583 | Turkey | Europe | 2007 | 71.777 | 71158647 | 8458.276384 | TUR | 792 |
Ta da! It is Turkey that had the lowest life expectancy in Europe according to the example dataset.
Finally, a quick plot of the results with Pandas-only way (i.e. we shall not use plotly
’s advanced capabilities).
ax = df2[idx].plot.bar(x='country', y='lifeExp', rot=0)