A comprehensive guide to data manipulation with pandas
Tidy Data - A Foundation for Wrangling in pandas
In a tidy data set:
Each variable is saved in its own column
Each observation is saved in its own row
Tidy data complements pandas's vectorized operations. pandas will automatically preserve observations as you manipulate variables. No other format works as intuitively with pandas.
Select columns whose name matches regular expression
Subsets (Rows and Columns)
Use df.loc[] and df.iloc[] to select only rows, only columns or both.
Use df.at[] and df.iat[] to access a single value by row and column. First index selects rows, second index columns.
Operation
Code
Description
Rows by position
df.iloc[10:20]
Select rows 10-20
Columns by position
df.iloc[:, [1, 2, 5]]
Select columns in positions 1, 2 and 5 (first column is 0)
Columns by label
df.loc[:, 'x2':'x4']
Select all columns between x2 and x4 (inclusive)
Conditional selection
df.loc[df['a'] > 10, ['a', 'c']]
Select rows meeting logical condition, and only the specific columns
Single value by position
df.iat[1, 2]
Access single value by index
Single value by label
df.at[4, 'A']
Access single value by label
Using query()
query() allows Boolean expressions for filtering rows.
Count number of rows with each unique value of variable
df.describe()
Basic descriptive statistics for each column (or GroupBy)
Summary Functions
pandas provides a large set of summary functions that operate on different kinds of pandas objects (DataFrame columns, Series, GroupBy, Expanding and Rolling) and produce single values for each of the groups.
Function
Description
sum()
Sum values of each object
count()
Count non-NA/null values of each object
mean()
Mean value of each object
median()
Median value of each object
min()
Minimum value in each object
max()
Maximum value in each object
std()
Standard deviation of each object
var()
Variance of each object
quantile([0.25, 0.75])
Quantiles of each object
apply(function)
Apply function to each object
Make New Columns
Operation
Code
Description
Add single column
df['Volume'] = df.Length * df.Height * df.Depth
Add single column
Assign multiple
df.assign(Area=lambda df: df.Length * df.Height)
Compute and append one or more new columns
Binning
pd.qcut(df.col, n, labels=False)
Bin column into n buckets
Vector Functions
pandas provides a large set of vector functions that operate on all columns of a DataFrame or a single selected column (a pandas Series).