Pandas Data Wrangling Cheat Sheet

Python Libraries Pandas

Posted by admin on 2025-09-17 19:46:00 |

Share: Facebook | Twitter | Whatsapp | Linkedin Visits: 34


Pandas Data Wrangling Cheat Sheet

Pandas Data Wrangling Cheat Sheet

A comprehensive guide to data manipulation with pandas

Tidy Data - A Foundation for Wrangling in pandas

In a tidy data set:

  • Each variable is saved in its own column
  • Each observation is saved in its own row

Tidy data complements pandas's vectorized operations. pandas will automatically preserve observations as you manipulate variables. No other format works as intuitively with pandas.


Creating DataFrames

Specify values for each column

df = pd.DataFrame(
    {"a": [4, 5, 6],
     "b": [7, 8, 9],
     "c": [10, 11, 12]},
    index=[1, 2, 3])

Specify values for each row

df = pd.DataFrame(
    [[4, 7, 10],
     [5, 8, 11],
     [6, 9, 12]],
    index=[1, 2, 3],
    columns=['a', 'b', 'c'])

Create DataFrame with MultiIndex

df = pd.DataFrame(
    {"a": [4, 5, 6],
     "b": [7, 8, 9],
     "c": [10, 11, 12]},
    index=pd.MultiIndex.from_tuples(
        [('d', 1), ('d', 2), ('e', 2)], 
        names=['n', 'v']))

Subset Observations (Rows)

Operation Code Description
Filter by condition df[df.Length > 7] Extract rows that meet logical criteria
Remove duplicates df.drop_duplicates() Remove duplicate rows (only considers columns)
Random sample df.sample(frac=0.5) Randomly select fraction of rows
Random sample df.sample(n=10) Randomly select n rows
Top n df.nlargest(n, 'value') Select and order top n entries
Bottom n df.nsmallest(n, 'value') Select and order bottom n entries
First n df.head(n) Select first n rows
Last n df.tail(n) Select last n rows

Subset Variables (Columns)

Operation Code Description
Multiple columns df[['width', 'length', 'species']] Select multiple columns with specific names
Single column df['width'] or df.width Select single column with specific name
Pattern matching df.filter(regex='regex') Select columns whose name matches regular expression

Subsets (Rows and Columns)

Use df.loc[] and df.iloc[] to select only rows, only columns or both.
Use df.at[] and df.iat[] to access a single value by row and column.
First index selects rows, second index columns.

Operation Code Description
Rows by position df.iloc[10:20] Select rows 10-20
Columns by position df.iloc[:, [1, 2, 5]] Select columns in positions 1, 2 and 5 (first column is 0)
Columns by label df.loc[:, 'x2':'x4'] Select all columns between x2 and x4 (inclusive)
Conditional selection df.loc[df['a'] > 10, ['a', 'c']] Select rows meeting logical condition, and only the specific columns
Single value by position df.iat[1, 2] Access single value by index
Single value by label df.at[4, 'A'] Access single value by label

Using query()

query() allows Boolean expressions for filtering rows.

df.query('Length > 7')
df.query('Length > 7 and Width < 8')
df.query('Name.str.startswith("abc")', engine="python")

Logic in Python (and pandas)

Operator Description Example
< Less than
> Greater than
== Equals
<= Less than or equals
>= Greater than or equals
!= Not equal to
df.column.isin(values) Group membership
pd.isnull(obj) Is NaN
pd.notnull(obj) Is not NaN
`&, , ~, ^` Logical and, or, not, xor
df.any(), df.all() Any, all

Regular Expressions Examples

Pattern Description
'\.' Matches strings containing a period '.'
'Length$' Matches strings ending with word 'Length'
'^Sepal' Matches strings beginning with the word 'Sepal'
'^x[1-5]$' Matches strings beginning with 'x' and ending with 1,2,3,4,5
'^(?!Species$).*' Matches strings except the string 'Species'

Reshaping Data

Change layout, sorting, reindexing, renaming

Operation Code Description
Melt pd.melt(df) Gather columns into rows
Pivot df.pivot(columns='var', values='val') Spread rows into columns
Concatenate rows pd.concat([df1, df2]) Append rows of DataFrames
Concatenate columns pd.concat([df1, df2], axis=1) Append columns of DataFrames
Sort by values (asc) df.sort_values('mpg') Order rows by values of a column (low to high)
Sort by values (desc) df.sort_values('mpg', ascending=False) Order rows by values of a column (high to low)
Rename columns df.rename(columns={'y': 'year'}) Rename the columns of a DataFrame
Sort index df.sort_index() Sort the index of a DataFrame
Reset index df.reset_index() Reset index of DataFrame to row numbers, moving index to columns
Drop columns df.drop(columns=['Length', 'Height']) Drop columns from DataFrame

Method Chaining

Most pandas methods return a DataFrame so that another pandas method can be applied to the result. This improves readability of code.

df = (pd.melt(df)
      .rename(columns={
          'variable': 'var',
          'value': 'val'})
      .query('val >= 200')
     )

Summarize Data

Basic Information

Function Description
len(df) Number of rows in DataFrame
df.shape Tuple of # of rows, # of columns in DataFrame
df['w'].nunique() Number of distinct values in a column
df['w'].value_counts() Count number of rows with each unique value of variable
df.describe() Basic descriptive statistics for each column (or GroupBy)

Summary Functions

pandas provides a large set of summary functions that operate on different kinds of pandas objects (DataFrame columns, Series, GroupBy, Expanding and Rolling) and produce single values for each of the groups.

Function Description
sum() Sum values of each object
count() Count non-NA/null values of each object
mean() Mean value of each object
median() Median value of each object
min() Minimum value in each object
max() Maximum value in each object
std() Standard deviation of each object
var() Variance of each object
quantile([0.25, 0.75]) Quantiles of each object
apply(function) Apply function to each object

Make New Columns

Operation Code Description
Add single column df['Volume'] = df.Length * df.Height * df.Depth Add single column
Assign multiple df.assign(Area=lambda df: df.Length * df.Height) Compute and append one or more new columns
Binning pd.qcut(df.col, n, labels=False) Bin column into n buckets

Vector Functions

pandas provides a large set of vector functions that operate on all columns of a DataFrame or a single selected column (a pandas Series).

Function Description
shift(1) Copy with values shifted by 1
shift(-1) Copy with values lagged by 1
rank(method='dense') Ranks with no gaps
rank(method='min') Ranks. Ties get min rank
rank(method='first') Ranks. Ties go to first value
rank(pct=True) Ranks rescaled to interval [0, 1]
cumsum() Cumulative sum
cummax() Cumulative max
cummin() Cumulative min
cumprod() Cumulative product
max(axis=1) Element-wise max
min(axis=1) Element-wise min
clip(lower=-10, upper=10) Trim values at input thresholds
abs() Absolute value

Combine Data Sets

Standard Joins

Join Type Code Description
Left Join pd.merge(adf, bdf, how='left', on='x1') Join matching rows from bdf to adf
Right Join pd.merge(adf, bdf, how='right', on='x1') Join matching rows from adf to bdf
Inner Join pd.merge(adf, bdf, how='inner', on='x1') Join data. Retain only rows in both sets
Outer Join pd.merge(adf, bdf, how='outer', on='x1') Join data. Retain all values, all rows

Filtering Joins

Operation Code Description
Semi Join adf[adf.x1.isin(bdf.x1)] All rows in adf that have a match in bdf
Anti Join adf[~adf.x1.isin(bdf.x1)] All rows in adf that do not have a match in bdf

Set-like Operations

Operation Code Description
Intersection pd.merge(ydf, zdf) Rows that appear in both ydf and zdf
Union pd.merge(ydf, zdf, how='outer') Rows that appear in either or both ydf and zdf
Set Difference pd.merge(ydf, zdf, how='outer', indicator=True).query('_merge == "left_only"').drop(columns=['_merge']) Rows that appear in ydf but not zdf

Group Data

Operation Code Description
Group by column df.groupby(by="col") Return a GroupBy object, grouped by values in column named "col"
Group by index df.groupby(level="ind") Return a GroupBy object, grouped by values in index level named "ind"

Additional GroupBy Functions

Function Description
size() Size of each group
agg(function) Aggregate group using function

All summary functions can be applied to a group. The examples can also be applied to groups, where the function is applied on a per-group basis.


Windows

Function Description
df.expanding() Return an Expanding object allowing summary functions to be applied cumulatively
df.rolling(n) Return a Rolling object allowing summary functions to be applied to windows of length n

Handling Missing Data

Operation Code Description
Drop missing df.dropna() Drop rows with any column having NA/null data
Fill missing df.fillna(value) Replace all NA/null data with value

Plotting

Plot Type Code Description
Histogram df.plot.hist() Histogram for each column
Scatter Plot df.plot.scatter(x='w', y='h') Scatter chart using pairs of points

Resources

  • Pandas API Reference: pandas.pydata.org
  • Pandas User Guide: Available at the official pandas documentation

Leave a Comment: