Pandas Data Wrangling Cheat Sheet

Python Libraries Pandas

Posted by admin on 2025-09-17 19:46:00 |

Share: Facebook | Twitter | Whatsapp | Linkedin Visits: 104

Pandas Data Wrangling Cheat Sheet

A comprehensive guide to data manipulation with pandas

Tidy Data - A Foundation for Wrangling in pandas

In a tidy data set:

Each variable is saved in its own column
Each observation is saved in its own row

Tidy data complements pandas's vectorized operations. pandas will automatically preserve observations as you manipulate variables. No other format works as intuitively with pandas.

Creating DataFrames

Specify values for each column

df = pd.DataFrame(
    {"a": [4, 5, 6],
     "b": [7, 8, 9],
     "c": [10, 11, 12]},
    index=[1, 2, 3])

Specify values for each row

df = pd.DataFrame(
    [[4, 7, 10],
     [5, 8, 11],
     [6, 9, 12]],
    index=[1, 2, 3],
    columns=['a', 'b', 'c'])

Create DataFrame with MultiIndex

df = pd.DataFrame(
    {"a": [4, 5, 6],
     "b": [7, 8, 9],
     "c": [10, 11, 12]},
    index=pd.MultiIndex.from_tuples(
        [('d', 1), ('d', 2), ('e', 2)], 
        names=['n', 'v']))

Subset Observations (Rows)

Operation	Code	Description
Filter by condition	`df[df.Length > 7]`	Extract rows that meet logical criteria
Remove duplicates	`df.drop_duplicates()`	Remove duplicate rows (only considers columns)
Random sample	`df.sample(frac=0.5)`	Randomly select fraction of rows
Random sample	`df.sample(n=10)`	Randomly select n rows
Top n	`df.nlargest(n, 'value')`	Select and order top n entries
Bottom n	`df.nsmallest(n, 'value')`	Select and order bottom n entries
First n	`df.head(n)`	Select first n rows
Last n	`df.tail(n)`	Select last n rows

Subset Variables (Columns)

Operation	Code	Description
Multiple columns	`df[['width', 'length', 'species']]`	Select multiple columns with specific names
Single column	`df['width']` or `df.width`	Select single column with specific name
Pattern matching	`df.filter(regex='regex')`	Select columns whose name matches regular expression

Subsets (Rows and Columns)

Use df.loc[] and df.iloc[] to select only rows, only columns or both.
Use df.at[] and df.iat[] to access a single value by row and column.
First index selects rows, second index columns.

Operation	Code	Description
Rows by position	`df.iloc[10:20]`	Select rows 10-20
Columns by position	`df.iloc[:, [1, 2, 5]]`	Select columns in positions 1, 2 and 5 (first column is 0)
Columns by label	`df.loc[:, 'x2':'x4']`	Select all columns between x2 and x4 (inclusive)
Conditional selection	`df.loc[df['a'] > 10, ['a', 'c']]`	Select rows meeting logical condition, and only the specific columns
Single value by position	`df.iat[1, 2]`	Access single value by index
Single value by label	`df.at[4, 'A']`	Access single value by label

Using query()

query() allows Boolean expressions for filtering rows.

df.query('Length > 7')
df.query('Length > 7 and Width < 8')
df.query('Name.str.startswith("abc")', engine="python")

Logic in Python (and pandas)

Operator	Description	Example
`<`	Less than
`>`	Greater than
`==`	Equals
`<=`	Less than or equals
`>=`	Greater than or equals
`!=`	Not equal to
`df.column.isin(values)`	Group membership
`pd.isnull(obj)`	Is NaN
`pd.notnull(obj)`	Is not NaN
`&,	, ~, ^`	Logical and, or, not, xor
`df.any(), df.all()`	Any, all

Regular Expressions Examples

Pattern	Description
`'\.'`	Matches strings containing a period '.'
`'Length$'`	Matches strings ending with word 'Length'
`'^Sepal'`	Matches strings beginning with the word 'Sepal'
`'^x[1-5]$'`	Matches strings beginning with 'x' and ending with 1,2,3,4,5
`'^(?!Species$).*'`	Matches strings except the string 'Species'

Reshaping Data

Change layout, sorting, reindexing, renaming

Operation	Code	Description
Melt	`pd.melt(df)`	Gather columns into rows
Pivot	`df.pivot(columns='var', values='val')`	Spread rows into columns
Concatenate rows	`pd.concat([df1, df2])`	Append rows of DataFrames
Concatenate columns	`pd.concat([df1, df2], axis=1)`	Append columns of DataFrames
Sort by values (asc)	`df.sort_values('mpg')`	Order rows by values of a column (low to high)
Sort by values (desc)	`df.sort_values('mpg', ascending=False)`	Order rows by values of a column (high to low)
Rename columns	`df.rename(columns={'y': 'year'})`	Rename the columns of a DataFrame
Sort index	`df.sort_index()`	Sort the index of a DataFrame
Reset index	`df.reset_index()`	Reset index of DataFrame to row numbers, moving index to columns
Drop columns	`df.drop(columns=['Length', 'Height'])`	Drop columns from DataFrame

Method Chaining

Most pandas methods return a DataFrame so that another pandas method can be applied to the result. This improves readability of code.

df = (pd.melt(df)
      .rename(columns={
          'variable': 'var',
          'value': 'val'})
      .query('val >= 200')
     )

Summarize Data

Basic Information

Function	Description
`len(df)`	Number of rows in DataFrame
`df.shape`	Tuple of # of rows, # of columns in DataFrame
`df['w'].nunique()`	Number of distinct values in a column
`df['w'].value_counts()`	Count number of rows with each unique value of variable
`df.describe()`	Basic descriptive statistics for each column (or GroupBy)

Summary Functions

pandas provides a large set of summary functions that operate on different kinds of pandas objects (DataFrame columns, Series, GroupBy, Expanding and Rolling) and produce single values for each of the groups.

Function	Description
`sum()`	Sum values of each object
`count()`	Count non-NA/null values of each object
`mean()`	Mean value of each object
`median()`	Median value of each object
`min()`	Minimum value in each object
`max()`	Maximum value in each object
`std()`	Standard deviation of each object
`var()`	Variance of each object
`quantile([0.25, 0.75])`	Quantiles of each object
`apply(function)`	Apply function to each object

Make New Columns

Operation	Code	Description
Add single column	`df['Volume'] = df.Length * df.Height * df.Depth`	Add single column
Assign multiple	`df.assign(Area=lambda df: df.Length * df.Height)`	Compute and append one or more new columns
Binning	`pd.qcut(df.col, n, labels=False)`	Bin column into n buckets

Vector Functions

pandas provides a large set of vector functions that operate on all columns of a DataFrame or a single selected column (a pandas Series).

Function	Description
`shift(1)`	Copy with values shifted by 1
`shift(-1)`	Copy with values lagged by 1
`rank(method='dense')`	Ranks with no gaps
`rank(method='min')`	Ranks. Ties get min rank
`rank(method='first')`	Ranks. Ties go to first value
`rank(pct=True)`	Ranks rescaled to interval [0, 1]
`cumsum()`	Cumulative sum
`cummax()`	Cumulative max
`cummin()`	Cumulative min
`cumprod()`	Cumulative product
`max(axis=1)`	Element-wise max
`min(axis=1)`	Element-wise min
`clip(lower=-10, upper=10)`	Trim values at input thresholds
`abs()`	Absolute value

Combine Data Sets

Standard Joins

Join Type	Code	Description
Left Join	`pd.merge(adf, bdf, how='left', on='x1')`	Join matching rows from bdf to adf
Right Join	`pd.merge(adf, bdf, how='right', on='x1')`	Join matching rows from adf to bdf
Inner Join	`pd.merge(adf, bdf, how='inner', on='x1')`	Join data. Retain only rows in both sets
Outer Join	`pd.merge(adf, bdf, how='outer', on='x1')`	Join data. Retain all values, all rows

Filtering Joins

Operation	Code	Description
Semi Join	`adf[adf.x1.isin(bdf.x1)]`	All rows in adf that have a match in bdf
Anti Join	`adf[~adf.x1.isin(bdf.x1)]`	All rows in adf that do not have a match in bdf

Set-like Operations

Operation	Code	Description
Intersection	`pd.merge(ydf, zdf)`	Rows that appear in both ydf and zdf
Union	`pd.merge(ydf, zdf, how='outer')`	Rows that appear in either or both ydf and zdf
Set Difference	`pd.merge(ydf, zdf, how='outer', indicator=True).query('_merge == "left_only"').drop(columns=['_merge'])`	Rows that appear in ydf but not zdf

Group Data

Operation	Code	Description
Group by column	`df.groupby(by="col")`	Return a GroupBy object, grouped by values in column named "col"
Group by index	`df.groupby(level="ind")`	Return a GroupBy object, grouped by values in index level named "ind"

Additional GroupBy Functions

Function	Description
`size()`	Size of each group
`agg(function)`	Aggregate group using function

All summary functions can be applied to a group. The examples can also be applied to groups, where the function is applied on a per-group basis.

Windows

Function	Description
`df.expanding()`	Return an Expanding object allowing summary functions to be applied cumulatively
`df.rolling(n)`	Return a Rolling object allowing summary functions to be applied to windows of length n

Handling Missing Data

Operation	Code	Description
Drop missing	`df.dropna()`	Drop rows with any column having NA/null data
Fill missing	`df.fillna(value)`	Replace all NA/null data with value

Plotting

Plot Type	Code	Description
Histogram	`df.plot.hist()`	Histogram for each column
Scatter Plot	`df.plot.scatter(x='w', y='h')`	Scatter chart using pairs of points

Resources

Pandas API Reference: pandas.pydata.org
Pandas User Guide: Available at the official pandas documentation