SEAI 2021 - Python - Lab 1
Intro to Python
Vincenzo Nardelli - vincnardelli@gmail.com - https://github.com/vincnardelli
Lab structure:
- Intro
- 1 - Matrix computation with NumPy
- 2 - Data manipulation and analysis with Pandas
- 3 - Graphs with Matplotlib and Seaborn
- Extra topic: Git and Github (with Niccolò Salvini)
Let’s start from the basics!
3 + 5
8
12 / 7
1.7142857142857142
result = 3 + 5
result
8
print(result)
8
result = result * 3.1415
print(result)
25.132
vector = [1, 3, 8, 13]
vector * 3
[1, 3, 8, 13, 1, 3, 8, 13, 1, 3, 8, 13]
dict = {'a': 12,
'b': 34,
'c': 62,
'd': 68,
'e': 29}
dict
{'a': 12, 'b': 34, 'c': 62, 'd': 68, 'e': 29}
Unlike R, the basic version of Python does not allow operations between scalars and matrices. For this you need to convert the vector to numpy array.
The package functions must be called taking into account the library structure
import numpy as np
vector = np.array(vector)
vector
array([ 1, 3, 8, 13])
vector * 3
array([ 3, 9, 24, 39])
The procedure for the subset is similar to that of R but it must be taken into account that the numbering starts from 0 instead of 1.
vector[1]
3
vector[0]
1
Furthermore, in the case of multiple selection, the index starts from 0 (unlike R which starts from 1) and the second value representing the last element is NOT included in the subset (unlike R which is included).
vector[1:3]
array([3, 8])
vector[[False, True, True, False]]
array([3, 8])
vector < 3
array([ True, False, False, False])
vector[vector < 3]
array([1])
L = list(range(10))
L
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
type(L[0])
int
Or, similarly, a list of strings:
L2 = []
for c in L:
L2.append(str(c))
print(L2)
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
L2 = [str(c) for c in L]
L2
['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
type(L2[0])
str
Because of Python’s dynamic typing, we can even create heterogeneous lists:
L3 = [True, "2", 3.0, 4]
[type(item) for item in L3]
[bool, str, float, int]
1 - Matrix computation with NumPy
- Creating arrays from Python listis
- Creating arrays from Scratch
- NumPy Standard Data Types
- Array Attributes
- Array Indexing
- Array Slicing
- Arithmetic Operations
NumPy brings the computational power of languages like C and Fortran to Python!
Why use NumPy? It’s fast
-
In a Python object, to allow the flexible types, each item in the list must contain its own type info, reference count, and other information.
-
In a computational task we are in the special case that all variables are of the same type, much of this information is redundant: it can be much more efficient to store data in a fixed-type array.
The difference between a dynamic-type list and a fixed-type (NumPy-style) array is illustrated in the following figure:
At the implementation level, the array essentially contains a single pointer to one contiguous block of data.
The Python list, on the other hand, contains a pointer to a block of pointers, each of which in turn points to a full Python object.
Again, the advantage of the list is flexibility: because each list element is a full structure containing both data and type information, the list can be filled with data of any desired type. Fixed-type NumPy-style arrays lack this flexibility, but are much more efficient for storing and manipulating data.
import numpy as np
Creating Arrays from Python Lists
# integer array:
np.array([1, 4, 2, 5, 3])
array([1, 4, 2, 5, 3])
np.array([3.14, 4, 2, 3])
array([3.14, 4. , 2. , 3. ])
np.array([1, 2, 3, 4], dtype='float32')
array([1., 2., 3., 4.], dtype=float32)
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])
array([[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])
Creating Arrays from Scratch
Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)
array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
Create an array filled with a linear sequence Starting at 0, ending at 20, stepping by 2
np.arange(0, 20, 2)
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)
array([0. , 0.25, 0.5 , 0.75, 1. ])
Create a 3x3 array of normally distributed random values with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))
array([[-1.02677226, 1.11060734, 0.03739026],
[-0.24285475, 0.88068307, 0.94551808],
[-0.06911716, -0.09423746, -1.25280425]])
Create a 3x3 identity matrix
np.eye(3)
array([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
NumPy Standard Data Types
Because NumPy is built in C, the types will be familiar to users of C, Fortran, and other related languages.
np.zeros(10, dtype=np.int16)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int16)
Data type | Description |
---|---|
bool_ |
Boolean (True or False) stored as a byte |
int_ |
Default integer type (same as C long ; normally either int64 or int32 ) |
intc |
Identical to C int (normally int32 or int64 ) |
intp |
Integer used for indexing (same as C ssize_t ; normally either int32 or int64 ) |
int8 |
Byte (-128 to 127) |
int16 |
Integer (-32768 to 32767) |
int32 |
Integer (-2147483648 to 2147483647) |
int64 |
Integer (-9223372036854775808 to 9223372036854775807) |
uint8 |
Unsigned integer (0 to 255) |
uint16 |
Unsigned integer (0 to 65535) |
uint32 |
Unsigned integer (0 to 4294967295) |
uint64 |
Unsigned integer (0 to 18446744073709551615) |
float_ |
Shorthand for float64 . |
float16 |
Half precision float: sign bit, 5 bits exponent, 10 bits mantissa |
float32 |
Single precision float: sign bit, 8 bits exponent, 23 bits mantissa |
float64 |
Double precision float: sign bit, 11 bits exponent, 52 bits mantissa |
complex_ |
Shorthand for complex128 . |
complex64 |
Complex number, represented by two 32-bit floats |
complex128 |
Complex number, represented by two 64-bit floats |
Array Attributes
One-dimensional array
np.random.seed(0)
x1 = np.random.randint(10, size=6)
x1
array([5, 0, 3, 3, 7, 9])
Two-dimensional array
x2 = np.random.randint(10, size=(3, 4))
x2
array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
Three-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
x3
array([[[8, 1, 5, 9, 8],
[9, 4, 3, 0, 3],
[5, 0, 2, 3, 8],
[1, 3, 3, 3, 7]],
[[0, 1, 9, 9, 0],
[4, 7, 3, 2, 7],
[2, 0, 0, 4, 5],
[5, 6, 8, 4, 1]],
[[4, 9, 8, 1, 1],
[7, 9, 9, 3, 6],
[7, 2, 0, 3, 5],
[9, 4, 4, 6, 4]]])
Each array has attributes ndim
(the number of dimensions), shape
(the size of each dimension), and size
(the total size of the array):
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
Another useful attribute is the dtype
, the data type of the array (which we discussed previously in Understanding Data Types in Python):
print("dtype:", x3.dtype)
dtype: int64
Array Indexing
In a one-dimensional array, the $i^{th}$ value (counting from zero) can be accessed by specifying the desired index in square brackets, just as with Python lists:
x1
array([5, 0, 3, 3, 7, 9])
x1[0]
5
x1[4]
7
To index from the end of the array, you can use negative indices:
x1[-1]
9
x1[-2]
7
In a multi-dimensional array, items can be accessed using a comma-separated tuple of indices:
x2
array([[3, 5, 2, 4],
[7, 6, 8, 8],
[1, 6, 7, 7]])
x2[0, 0]
3
x2[2, 0]
1
x2[2, -1]
7
Values can also be modified using any of the above index notation:
x2[0, 0] = 12
x2
array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
x1[0] = 3.14159 # this will be truncated!
x1
array([3, 0, 3, 3, 7, 9])
Array Slicing
Just as we can use square brackets to access individual array elements, we can also use them to access subarrays with the slice notation, marked by the colon (:
) character.
The NumPy slicing syntax follows that of the standard Python list; to access a slice of an array x
, use this:
x[start:stop:step]
If any of these are unspecified, they default to the values start=0
, stop=
size of dimension
, step=1
.
We’ll take a look at accessing sub-arrays in one dimension and in multiple dimensions.
x = np.arange(10)
x
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
First five element
x[:5]
array([0, 1, 2, 3, 4])
Elements after index 5
x[5:]
array([5, 6, 7, 8, 9])
middle sub-array
x[4:7]
array([4, 5, 6])
every other element
x[::2]
array([0, 2, 4, 6, 8])
every other element, starting at index 1
x[1::2]
array([1, 3, 5, 7, 9])
A potentially confusing case is when the step
value is negative.
In this case, the defaults for start
and stop
are swapped.
This becomes a convenient way to reverse an array:
x[::-1] # all elements, reversed
array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
x2
array([[12, 5, 2, 4],
[ 7, 6, 8, 8],
[ 1, 6, 7, 7]])
x2[:2, :3] # two rows, three columns
array([[12, 5, 2],
[ 7, 6, 8]])
x2[:3, ::2] # all rows, every other column
array([[12, 2],
[ 7, 8],
[ 1, 7]])
Finally, subarray dimensions can even be reversed together:
x2[::-1, ::-1]
array([[ 7, 7, 6, 1],
[ 8, 8, 6, 7],
[ 4, 2, 5, 12]])
Accessing array rows and columns
One commonly needed routine is accessing of single rows or columns of an array.
This can be done by combining indexing and slicing, using an empty slice marked by a single colon (:
):
print(x2[:, 0]) # first column of x2
[12 7 1]
print(x2[0, :]) # first row of x2
[12 5 2 4]
Arithmetic Operations
a = np.array([1,2,3])
b = np.array([(1.5,2,3), (4,5,6)], dtype = float)
a + b
array([[2.5, 4. , 6. ],
[5. , 7. , 9. ]])
a - b
array([[-0.5, 0. , 0. ],
[-3. , -3. , -3. ]])
a * b
array([[ 1.5, 4. , 9. ],
[ 4. , 10. , 18. ]])
a / b
array([[0.66666667, 1. , 1. ],
[0.25 , 0.4 , 0.5 ]])
np.exp(a)
array([ 2.71828183, 7.3890561 , 20.08553692])
np.log(a)
array([0. , 0.69314718, 1.09861229])
c = np.array([1.5,2], dtype = float)
c
array([1.5, 2. ])
d = np.array([4,5], dtype = float)
d
array([4., 5.])
c.dot(d)
16.0
1.5*4+2*5
16.0
2 - Data manipulation and analysis with Pandas
Pandas is a Python data analysis Library. The name is derived from the term “panel data”.
- The Pandas Series Object
- The Pandas DataFrame Object
- Construction DataFrame Objects
- Data loading
- Data indexing and selection
- Aggregation and grouping
- Simple aggregation
- GroupBy
- Aggregate, filter and transform
import pandas as pd
#pd.DataFrame?
The Pandas Series Object
A Pandas Series
is a one-dimensional array of indexed data.
It can be created from a list or array as follows:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data
0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
As we see in the output, the Series
wraps both a sequence of values and a sequence of indices, which we can access with the values
and index
attributes.
The values
are simply a familiar NumPy array:
data.values
array([0.25, 0.5 , 0.75, 1. ])
The index
is an array-like object of type pd.Index
, which we’ll discuss in more detail momentarily.
data.index
RangeIndex(start=0, stop=4, step=1)
Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:
data[1]
0.5
data[1:3]
1 0.50
2 0.75
dtype: float64
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population
California 38332521
Texas 26448193
New York 19651127
Florida 19552860
Illinois 12882135
dtype: int64
By default, a Series
will be created where the index is drawn from the sorted keys.
From here, typical dictionary-style item access can be performed:
population['California']
38332521
Unlike a dictionary, though, the Series
also supports array-style operations such as slicing:
population['New York':'Illinois']
New York 19651127
Florida 19552860
Illinois 12882135
dtype: int64
population[2:5]
New York 19651127
Florida 19552860
Illinois 12882135
dtype: int64
The Pandas DataFrame Object
The next fundamental structure in Pandas is the DataFrame
.
The DataFrame
can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.
DataFrame as a generalized NumPy array
A DataFrame
is an analog of a two-dimensional array with both flexible row indices and flexible column names.
Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame
as a sequence of aligned Series
objects.
Here, by “aligned” we mean that they share the same index.
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area
California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
dtype: int64
states = pd.DataFrame({'population': population,
'area': area})
states
population | area | |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
Florida | 19552860 | 170312 |
Illinois | 12882135 | 149995 |
states.index
Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')
states.columns
Index(['population', 'area'], dtype='object')
Thus the DataFrame
can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.
DataFrame as specialized dictionary
Similarly, we can also think of a DataFrame
as a specialization of a dictionary.
Where a dictionary maps a key to a value, a DataFrame
maps a column name to a Series
of column data.
For example, asking for the 'area'
attribute returns the Series
object containing the areas we saw earlier:
states['area']
California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
Name: area, dtype: int64
Constructing DataFrame objects
From a single Series object
pd.DataFrame(population, columns=['population'])
population | |
---|---|
California | 38332521 |
Texas | 26448193 |
New York | 19651127 |
Florida | 19552860 |
Illinois | 12882135 |
From a list of dicts
Any list of dictionaries can be made into a DataFrame
.
data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
a | b | |
---|---|---|
0 | 0 | 0 |
1 | 1 | 2 |
2 | 2 | 4 |
Even if some keys in the dictionary are missing, Pandas will fill them in with NaN
(i.e., “not a number”) values:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
a | b | c | |
---|---|---|---|
0 | 1.0 | 2 | NaN |
1 | NaN | 3 | 4.0 |
From a two-dimensional NumPy array
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
foo | bar | |
---|---|---|
a | 0.652790 | 0.635059 |
b | 0.995300 | 0.581850 |
c | 0.414369 | 0.474698 |
Data loading
path = "https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/data/titanic.csv"
titanic = pd.read_csv(path)
titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
path_xls = "https://github.com/pandas-dev/pandas/blob/master/doc/data/test.xls?raw=true"
test = pd.read_excel(path_xls)
test
Unnamed: 0 | A | B | C | D | |
---|---|---|---|---|---|
0 | 2000-01-03 | 0.980269 | 3.685731 | -0.364217 | -1.159738 |
1 | 2000-01-04 | 1.047916 | -0.041232 | -0.161812 | 0.212549 |
2 | 2000-01-05 | 0.498581 | 0.731168 | -0.537677 | 1.346270 |
3 | 2000-01-06 | 1.120202 | 1.567621 | 0.003641 | 0.675253 |
4 | 2000-01-07 | -0.487094 | 0.571455 | -1.611639 | 0.103469 |
5 | 2000-01-10 | 0.836649 | 0.246462 | 0.588543 | 1.062782 |
6 | 2000-01-11 | -0.157161 | 1.340307 | 1.195778 | -1.097007 |
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_largest_cities')[1]
df.head()
City[a] | Country | Skyline | UN 2018 population estimates[b] | City proper[c] | Metropolitan area[d] | Urban area(Demographia)[12] | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
City[a] | Country | Skyline | UN 2018 population estimates[b] | Definition | Population | Area(km2) | Population | Area(km2) | Population | Area(km2) | |
0 | Tokyo | Japan | NaN | 37400068 | Metropolis prefecture | 13,515,271[14] | 2,191[14] | 37,274,000[15] | 13,452[15] | 37977000.0 | 8,230[e] |
1 | Delhi | India | NaN | 28514000 | Capital City | 16,753,235[16] | 1484 | 29,000,000[17] | 3,483[17] | 29617000.0 | 2,232[f] |
2 | Shanghai | China | NaN | 25582000 | Municipality | 24,183,000[18] | 6341 | NaN | NaN | 22120000.0 | 4,068[g] |
3 | São Paulo | Brazil | NaN | 21650000 | Municipality | 12,252,023[19] | 1521 | 21,734,682[20] | 7947 | 22046000.0 | 3,116[h] |
4 | Mexico City | Mexico | NaN | 21581000 | City-state | 9,209,944[21] | 1485 | 21,804,515[21] | 7,866[22] | 20996000.0 | 2386 |
#df.to_csv("scraped_data.csv")
#df.to_excel("scraped_data.xlsx")
Data Indexing and Selection
states['area']
California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
Name: area, dtype: int64
Equivalently, we can use attribute-style access with column names that are strings:
states.area
California 423967
Texas 695662
New York 141297
Florida 170312
Illinois 149995
Name: area, dtype: int64
states['density'] = states['population'] / states['area']
states
population | area | density | |
---|---|---|---|
California | 38332521 | 423967 | 90.413926 |
Texas | 26448193 | 695662 | 38.018740 |
New York | 19651127 | 141297 | 139.076746 |
Florida | 19552860 | 170312 | 114.806121 |
Illinois | 12882135 | 149995 | 85.883763 |
Thus for array-style indexing, we need another convention.
Here Pandas again uses the loc
and iloc
indexers mentioned earlier.
Using the iloc
indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame
index and column labels are maintained in the result:
states.iloc[:3, :2]
population | area | |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
Similarly, using the loc
indexer we can index the underlying data in an array-like style but using the explicit index and column names:
states.loc[:'New York', :'area']
population | area | |
---|---|---|
California | 38332521 | 423967 |
Texas | 26448193 | 695662 |
New York | 19651127 | 141297 |
states.loc[states.density > 100, ['population', 'density']]
population | density | |
---|---|---|
New York | 19651127 | 139.076746 |
Florida | 19552860 | 114.806121 |
Additional indexing conventions
There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. First, while indexing refers to columns, slicing refers to rows:
states['Florida':'Illinois']
population | area | density | |
---|---|---|---|
Florida | 19552860 | 170312 | 114.806121 |
Illinois | 12882135 | 149995 | 85.883763 |
Such slices can also refer to rows by number rather than by index:
states[1:3]
population | area | density | |
---|---|---|---|
Texas | 26448193 | 695662 | 38.018740 |
New York | 19651127 | 141297 | 139.076746 |
Similarly, direct masking operations are also interpreted row-wise rather than column-wise:
states[states.density > 100]
population | area | density | |
---|---|---|---|
New York | 19651127 | 141297 | 139.076746 |
Florida | 19552860 | 170312 | 114.806121 |
Aggregation and Grouping
An essential piece of analysis of large data is efficient summarization: computing aggregations like sum()
, mean()
, median()
, min()
, and max()
, in which a single number gives insight into the nature of a potentially large dataset.
In this section, we’ll explore aggregations in Pandas, from simple operations akin to what we’ve seen on NumPy arrays, to more sophisticated operations based on the concept of a groupby
.
Here we will use the Planets dataset, available via the Seaborn package (see Visualization With Seaborn). It gives information on planets that astronomers have discovered around other stars (known as extrasolar planets or exoplanets for short). It can be downloaded with a simple Seaborn command:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
(1035, 6)
planets.head()
method | number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|---|
0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |
This has some details on the 1,000+ extrasolar planets discovered up to 2014.
Simple Aggregation in Pandas
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
0 0.374540
1 0.950714
2 0.731994
3 0.598658
4 0.156019
dtype: float64
ser.sum()
2.811925491708157
ser.mean()
0.5623850983416314
For a DataFrame
, by default the aggregates return results within each column:
df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
df
A | B | |
---|---|---|
0 | 0.155995 | 0.020584 |
1 | 0.058084 | 0.969910 |
2 | 0.866176 | 0.832443 |
3 | 0.601115 | 0.212339 |
4 | 0.708073 | 0.181825 |
df.mean()
A 0.477888
B 0.443420
dtype: float64
By specifying the axis
argument, you can instead aggregate within each row:
df.mean(axis='columns')
0 0.088290
1 0.513997
2 0.849309
3 0.406727
4 0.444949
dtype: float64
planets.dropna().describe()
number | orbital_period | mass | distance | year | |
---|---|---|---|---|---|
count | 498.00000 | 498.000000 | 498.000000 | 498.000000 | 498.000000 |
mean | 1.73494 | 835.778671 | 2.509320 | 52.068213 | 2007.377510 |
std | 1.17572 | 1469.128259 | 3.636274 | 46.596041 | 4.167284 |
min | 1.00000 | 1.328300 | 0.003600 | 1.350000 | 1989.000000 |
25% | 1.00000 | 38.272250 | 0.212500 | 24.497500 | 2005.000000 |
50% | 1.00000 | 357.000000 | 1.245000 | 39.940000 | 2009.000000 |
75% | 2.00000 | 999.600000 | 2.867500 | 59.332500 | 2011.000000 |
max | 6.00000 | 17337.500000 | 25.000000 | 354.000000 | 2014.000000 |
The following table summarizes some other built-in Pandas aggregations:
Aggregation | Description |
---|---|
count() |
Total number of items |
first() , last() |
First and last item |
mean() , median() |
Mean and median |
min() , max() |
Minimum and maximum |
std() , var() |
Standard deviation and variance |
mad() |
Mean absolute deviation |
prod() |
Product of all items |
sum() |
Sum of all items |
These are all methods of DataFrame
and Series
objects.
GroupBy: Split, Apply, Combine
Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called groupby
operation.
The name “group by” comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: split, apply, combine.
A canonical example of this split-apply-combine operation, where the “apply” is a summation aggregation, is illustrated in this figure:
This makes clear what the groupby
accomplishes:
- The split step involves breaking up and grouping a
DataFrame
depending on the value of the specified key. - The apply step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
- The combine step merges the results of these operations into an output array.
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df
key | data | |
---|---|---|
0 | A | 0 |
1 | B | 1 |
2 | C | 2 |
3 | A | 3 |
4 | B | 4 |
5 | C | 5 |
df.groupby('key').sum()
data | |
---|---|
key | |
A | 3 |
B | 5 |
C | 7 |
The GroupBy object
planets.groupby('method')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fdced466128>
planets.groupby('method')['orbital_period']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x7fdced485240>
planets.groupby('method')['orbital_period'].median()
method
Astrometry 631.180000
Eclipse Timing Variations 4343.500000
Imaging 27500.000000
Microlensing 3300.000000
Orbital Brightness Modulation 0.342887
Pulsar Timing 66.541900
Pulsation Timing Variations 1170.000000
Radial Velocity 360.200000
Transit 5.714932
Transit Timing Variations 57.011000
Name: orbital_period, dtype: float64
Dispatch methods
Through some Python class magic, any method not explicitly implemented by the GroupBy
object will be passed through and called on the groups, whether they are DataFrame
or Series
objects.
For example, you can use the describe()
method of DataFrame
s to perform a set of aggregations that describe each group in the data:
planets.groupby('method')['year'].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
method | ||||||||
Astrometry | 2.0 | 2011.500000 | 2.121320 | 2010.0 | 2010.75 | 2011.5 | 2012.25 | 2013.0 |
Eclipse Timing Variations | 9.0 | 2010.000000 | 1.414214 | 2008.0 | 2009.00 | 2010.0 | 2011.00 | 2012.0 |
Imaging | 38.0 | 2009.131579 | 2.781901 | 2004.0 | 2008.00 | 2009.0 | 2011.00 | 2013.0 |
Microlensing | 23.0 | 2009.782609 | 2.859697 | 2004.0 | 2008.00 | 2010.0 | 2012.00 | 2013.0 |
Orbital Brightness Modulation | 3.0 | 2011.666667 | 1.154701 | 2011.0 | 2011.00 | 2011.0 | 2012.00 | 2013.0 |
Pulsar Timing | 5.0 | 1998.400000 | 8.384510 | 1992.0 | 1992.00 | 1994.0 | 2003.00 | 2011.0 |
Pulsation Timing Variations | 1.0 | 2007.000000 | NaN | 2007.0 | 2007.00 | 2007.0 | 2007.00 | 2007.0 |
Radial Velocity | 553.0 | 2007.518987 | 4.249052 | 1989.0 | 2005.00 | 2009.0 | 2011.00 | 2014.0 |
Transit | 397.0 | 2011.236776 | 2.077867 | 2002.0 | 2010.00 | 2012.0 | 2013.00 | 2014.0 |
Transit Timing Variations | 4.0 | 2012.500000 | 1.290994 | 2011.0 | 2011.75 | 2012.5 | 2013.25 | 2014.0 |
Aggregate, filter, transform
In particular, GroupBy
objects have aggregate()
, filter()
, transform()
, and apply()
methods that efficiently implement a variety of useful operations before combining the grouped data.
For the purpose of the following subsections, we’ll use this DataFrame
:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 5 |
1 | B | 1 | 0 |
2 | C | 2 | 3 |
3 | A | 3 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
Aggregation
We’re now familiar with GroupBy
aggregations with sum()
, median()
, and the like, but the aggregate()
method allows for even more flexibility.
It can take a string, a function, or a list thereof, and compute all the aggregates at once.
Here is a quick example combining all these:
df.groupby('key').aggregate(['min', np.median, max])
data1 | data2 | |||||
---|---|---|---|---|---|---|
min | median | max | min | median | max | |
key | ||||||
A | 0 | 1.5 | 3 | 3 | 4.0 | 5 |
B | 1 | 2.5 | 4 | 0 | 3.5 | 7 |
C | 2 | 3.5 | 5 | 3 | 6.0 | 9 |
Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})
data1 | data2 | |
---|---|---|
key | ||
A | 0 | 5 |
B | 1 | 7 |
C | 2 | 9 |
Filtering
A filtering operation allows you to drop data based on the group properties. For example, we might want to keep all groups in which the standard deviation is larger than some critical value:
df
key | data1 | data2 | |
---|---|---|---|
0 | A | 0 | 5 |
1 | B | 1 | 0 |
2 | C | 2 | 3 |
3 | A | 3 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
df.groupby('key').std()
data1 | data2 | |
---|---|---|
key | ||
A | 2.12132 | 1.414214 |
B | 2.12132 | 4.949747 |
C | 2.12132 | 4.242641 |
def filter_func(x):
return x['data2'].std() > 4
df.groupby('key').filter(filter_func)
key | data1 | data2 | |
---|---|---|---|
1 | B | 1 | 0 |
2 | C | 2 | 3 |
4 | B | 4 | 7 |
5 | C | 5 | 9 |
Transformation
While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine. For such a transformation, the output is the same shape as the input. A common example is to center the data by subtracting the group-wise mean:
df.groupby('key').transform(lambda x: x - x.mean())
data1 | data2 | |
---|---|---|
0 | -1.5 | 1.0 |
1 | -1.5 | -3.5 |
2 | -1.5 | -3.0 |
3 | 1.5 | -1.0 |
4 | 1.5 | 3.5 |
5 | 1.5 | 3.0 |
Example
As an example of this, in a couple lines of Python code we can put all these together and count discovered planets by method and by decade:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)
decade | 1980s | 1990s | 2000s | 2010s |
---|---|---|---|---|
method | ||||
Astrometry | 0.0 | 0.0 | 0.0 | 2.0 |
Eclipse Timing Variations | 0.0 | 0.0 | 5.0 | 10.0 |
Imaging | 0.0 | 0.0 | 29.0 | 21.0 |
Microlensing | 0.0 | 0.0 | 12.0 | 15.0 |
Orbital Brightness Modulation | 0.0 | 0.0 | 0.0 | 5.0 |
Pulsar Timing | 0.0 | 9.0 | 1.0 | 1.0 |
Pulsation Timing Variations | 0.0 | 0.0 | 1.0 | 0.0 |
Radial Velocity | 1.0 | 52.0 | 475.0 | 424.0 |
Transit | 0.0 | 0.0 | 64.0 | 712.0 |
Transit Timing Variations | 0.0 | 0.0 | 0.0 | 9.0 |
3 - Graphs with Matplotlib and Seaborn
Matplotlib
https://matplotlib.org/stable/gallery/index.html
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
fig = plt.figure()
plt.plot(x, np.sin(x), '-')
plt.plot(x, np.cos(x), '--');
#fig.savefig('my_figure.png')
plt.plot(x, np.sin(x));
plt.plot(x, x + 0, linestyle='solid')
plt.plot(x, x + 1, linestyle='dashed')
plt.plot(x, x + 2, linestyle='dashdot')
plt.plot(x, x + 3, linestyle='dotted');
# For short, you can use the following codes:
plt.plot(x, x + 4, linestyle='-') # solid
plt.plot(x, x + 5, linestyle='--') # dashed
plt.plot(x, x + 6, linestyle='-.') # dashdot
plt.plot(x, x + 7, linestyle=':'); # dotted
plt.plot(x, x + 0, '-g') # solid green
plt.plot(x, x + 1, '--c') # dashed cyan
plt.plot(x, x + 2, '-.k') # dashdot black
plt.plot(x, x + 3, ':r'); # dotted red
plt.plot(x, np.sin(x), label="sin(x)")
plt.plot(x, np.cos(x), ':b', label='cos(x)')
plt.xlim(-3, 13)
plt.ylim(-2, 2);
plt.title("Sine/Cosine Curves")
plt.xlabel("x")
plt.ylabel("f(x)");
plt.legend();
x = np.linspace(0, 10, 30)
y = np.sin(x)
plt.scatter(x, y, marker='o')
<matplotlib.collections.PathCollection at 0x7fdced88b550>
rng = np.random.RandomState(0)
x = rng.randn(100)
y = rng.randn(100)
colors = rng.rand(100)
sizes = 1000 * rng.rand(100)
plt.scatter(x, y, c=colors, s=sizes, alpha=0.3,
cmap='viridis')
plt.colorbar();
data = np.random.randn(1000)
plt.hist(data);
plt.hist(data, bins=30, density=True, alpha=0.5,
histtype='stepfilled', color='steelblue',
edgecolor='none');
mean = [0, 0]
cov = [[1, 1], [1, 2]]
x, y = np.random.multivariate_normal(mean, cov, 10000).T
plt.hist2d(x, y, bins=30, cmap='Blues')
cb = plt.colorbar()
cb.set_label('counts in bin')
Multiple plots
MATLAB-style Interface
x = np.linspace(0, 10, 100)
plt.figure() # create a plot figure
# create the first of two panels and set current axis
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, np.sin(x))
# create the second panel and set current axis
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));
Object-oriented interface
# First create a grid of plots
# ax will be an array of two Axes objects
fig, ax = plt.subplots(2)
# Call plot() method on the appropriate object
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x));
Seaborn
https://seaborn.pydata.org/examples/index.html
import seaborn as sns
iris = sns.load_dataset("iris")
iris.head()
sepal_length | sepal_width | petal_length | petal_width | species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
sns.histplot(data=iris, x="sepal_length", hue="species", multiple="stack");
sns.kdeplot(data=iris, x="sepal_length", hue="species", shade=True, alpha=0.5);
sns.pairplot(data=iris, hue='species', height=2);